Panion PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 154

A Pra

ti al Companion to the ML Exer ise Book


1 Regression Methods

1. (Gradient des ent: omparison with


xxx another training algorithm
xxx on a fun tion approximation task)
• CMU, 2004 fall, Carlos Guestrin, HW4, pr. 2
In this problem you'll ompare the Gradient Des ent training algorithm with
one [other℄ training algorithm of your hoosing, on a parti ular fun tion appro-
ximation problem. For this problem, the idea is to familiarize yourself with
Gradient Des ent and at least one other numeri al solution te hnique.
The dataset data.txt ontains a series of (x, y) re ords, where x ∈ [0, 5] and y
is a fun tion of x given by y = a sin(bx) + w, where a and b are parameters to be
learned and w is a noise term su h that w ∼ N (0, σ 2 ). We want to learn from
the data the best values of a and b to minimize the sum of squared error:

n
X
arg min (yi − a sin(bxi ))2 .
a,b
i=1

Use any programming language of your hoi e and implement two training
te hniques to learn these parameters. The rst te hnique should be Gradient
Des ent with a xed learning rate, as dis ussed in lass. The se ond an be
any of the other numeri al solutions listed in lass: Levenberg-Marquardt,
Newton's Method, Conjugate Gradient, Gradient Des ent with dynami lear-
ning rate and/or momentum onsiderations, or one of your own hoi e not
mentioned in lass.
You may want to look at a s atterplot of the data to get rough initial values
for the parameters a and b. If you are getting a large sum of squared error
after onvergen e (where large means > 100), you may want to try random
restarts.
Write a short report detailing the method you hose and its relative perfor-
man e in omparison to standard Gradient Des ent (report the nal solution
obtained (values of a and b) and some measure of the omputation required
to rea h it and/or the resistan e of the approa h to lo al minima). If possi-
ble, explain the dieren e in performan e based on the algorithmi dieren e
between the two approa hes you implemented and the fun tion being learned.

3
2. (Distribuµia exponenµial : estimarea parametrilor
xxx în sens MLE ³i respe tiv în sens MAP,
xxx folosind a distribuµie a priori distribuµia Gamma)
CMU, 2015 fall, A. Smola, B. Po zos, HW1, pr. 1.1.ab
a. An exponential distribution with parameter λ has the probability density
fun tion (p.d.f.) Exp(x) = λe−λx for x ≥ 0. Given some i.i.d. data {xi }n
i=1 ∼
Exp(λ), derive the maximum likelihood estimate (MLE) λMLE .
b. A Gamma distribution with parameters r > 0, α > 0 has the p.d.f.
αr r−1 −αx
Gamma (x|r, α) = x e for x ≥ 0,
Γ(r)
where Γ is Euler's gamma fun tion.
If the posterior distribution is in the same family as the prior distribution,
then we say that the prior distribution is the onjugate prior for the likelihood
fun tion.
Show that the Gamma distribution is a onjugate prior of the Exp(λ) dis-

i=1 and X ∼ Exp(λ) and


not.
tribution. In other words, show that if X = {xi }n
λ∼ Gamma(r, α), then P (λ|X) ∼ Gamma(r∗ , α∗ ) for some values r∗ , α∗ .
. Derive the maximum a posteriori estimator (MAP) λMAP as a fun tion of
r, α.
d. What happens [with λMLE and λMAP ℄ as n gets large?
e. Let's perform an experiment in the above setting. Generate n = 20 random
variables drawn from Exp(λ = 0.2). Fix α = 100 and vary r over the range
(1, 30) using a stepsize of 1. Compute the orresponding MLE and MAP
estimates for λ. For ea h r, repeat this pro ess 50 times and ompute the
mean squared error of both estimates ompared against the true value. Plot
the mean squared error as a fun tion of r. (Note: O tave parameterizes the
Exponential distribution with θ = 1/λ.)

f. Now, x (r, α) = (30, 100) and vary n up to 1000. Plot the MSE for ea h n of
the orresponding estimates.

g. Under what onditions is the MLE estimator better? Under what ondi-
tions is the MAP estimator better? Explain the behavior in the two above
plots.

R spuns:

a. The log likelihood is


X X
ℓ(λ) = ln λ − λxi = nlogλ − λ xi
i i

Set the derivative to 0:


n X 1
− xi = 0 ⇒ λMLE = .
λ i

This is biased.

4
b.

P (λ|X) ∝ P (X|λ)P (λ)


P
∝ λn e−λ i xi α−1 −βλ
λ e
P
n+α−1 −λ( i xi +β)
∝ λ e .

Therefore P (λ|X) ∝ Gamma(α + n, i xi + β).


P

. The log posterior is


!
X
ln P (λ|X) ∝ −λ xi + β + (n + α − 1) ln λ.
i

Set the derivative to 0:


X n+α−1 n+α−1
0=− xi − β + → λMAP = P .
i
λ i xi + β

d.
α−1 α−1
n+α−1 1+ 1+
λMAP =P = P n = n → 1 =λ
MLE .
x
i i + β x
i i β β x̄
+ x̄ +
n n n
e.

f.

g. The MLE is better when prior information is in orre t. The MAP is better
with low sample size and good prior information. Asymptoti ally they are the
same.

5
3. (Distribuµia binomial : estimarea parametrului în sens MLE,
xxx folosind metoda lui Newton;
xxx distribuµia Gamma: estimarea parametrilor în sens MLE,
xxx folosind metoda gradientului ³i metoda lui Newton)
• CMU, 2008 spring, Tom Mit hell, HW2, pr. 1.2
xxx •◦ CMU, 2015 fall, A. Smola, B. Po zos, HW1, pr. 1.2.
x x
a. For the binomial sampling fun tion, pdf f (x) = Cn p (1 − p)n−x , nd the MLE
using the Newton-Raphson method, starting with an estimate θ 0 = 0.1, n = 100,
x = 8. Show the resulting θj until it rea hes onvergen e (θj+1 −θj < .01). (Note
that the binomial pdf may be al ulated analyti ally - you may use this to
he k your answer.)

b. Note
: For this [part of the℄ exer ise, please make use of the digamma and
trigamma fun tions. You an nd the digamma and trigamma fun tions in
any s ienti omputing pa kage (e.g. O tave, Matlab, Python...).
Inside the handout, the estimators.mat le ontains a ve tor drawn from a
Gamma distribution. Run your implementation of gradient des ent and New-
ton's method  for the later, see the ex. 41 in our exer ise book  to obtain
the MLE estimators for this distribution. Create a plot showing the onver-
gen e of the two above methods. How do they ompare? Whi h took more
iterations? Lastly, provide the a tual estimated values obtained.

Solution:

a.

b. You should have gotten α ≈ 4, β ≈ 0.5.

6
4. (Linear, polynomial, regularized (L2 ), and kernelized regression:
xxx appli ation on a [UCI ML Repository℄ dataset
xxx for housing pri es in Boston area)
•·MIT, 2001 fall, Tommi Jaakkola, HW1, pr. 1
xxx MIT, 2004 fall, Tommi Jaakkola, HW1, pr. 3
xxx MIT, 2006 fall, Tommi Jaakkola, HW2, pr. 2.d
A. Here we will be using a regression method to predi t housing pri es in
suburbs of Boston. You'll nd the data in the le housing.data. Information
about the data, in luding the olumn interpretation an be found in the le
housing.names. These les are taken from the UCI Ma hine Learning Repository
https://ar hive.i s.u i.edu/ml/datasets.html.

We will predi t the median house value (the 14th, and last, olumn of the
data) based on the other olumns.

a. First, we will use a linear regression model to predi t the house values,
using squared-error as the riterion to minimize. In other words y = f (x; ŵ) =
P13 Pn 2
ŵ0 + i=1 ŵi xi , where ŵ = arg minw t=1 (yt − f (xt ; w)) ; here yt are the house
values, xt are input ve tors, and n is the number of training examples.
Write the following MATLAB fun tions (these should be simple fun tions to
ode in MATLAB):

• A fun tion that takes as input weights w and a set of input ve tors
{xt }t=1,...,n , and returns the predi ted output values {yt }t=1,...,n .
• A fun tion that takes as input training input ve tors and output values,
and return the optimal weight ve tor ŵ.
• A fun tion that takes as input a training set of input ve tors and output
values, and a test set input ve tors, and output values, and returns the
mean training error (i.e., average squared-error over all training samples)
and mean test error.

b. To test our linear regression model, we will use part of the data set as a
training set, and the rest as a test set. For ea h training set size, use the rst
lines of the data le as a training set, and the remaining lines as a test set.
Write a MATLAB fun tion that takes as input the omplete data set, and the
desired training set size, and returns the mean training and test errors.
Turn in the mean squared training and test errors for ea h of the following
training set sizes: 10, 50, 100, 200, 300, 400.
(Qui k validation: For a sample size of 100, we got a mean training error of
4.15 and a mean test error of 1328.)
. What ondition must hold for the training input ve tors so that the training
error will be zero for any set of output values?

d. Do the training and test errors tend to in rease or de rease as the training
set size in reases? Why? Try some other training set sizes to see that this is
only a tenden y, and sometimes the hange is in the dierent dire tion.

e. We will now move on to polynomial regression. We will predi t the house


values using a fun tion of the form:
13 X
X m
f (x; w) = w0 + wi,d xdi ,
i=1 d=1

7
where again, the weights w are hosen so as to minimize the mean squared
error of the training set. Think about why we also in lude all lower order
polynomial terms up to the highest order rather than just the highest ones.
Note that we only use features whi h are powers of a single input feature.
We do so mostly in order to simplify the problem. In most ases, it is more
bene ial to use features whi h are produ ts of dierent input features, and
perhaps also their powers.
Think of why su h features are usually more powerful.
Write a version of your MATLAB fun tion from se tion b that takes as input
also a maximal degree m and returns the training and test error under su h
a polynomial regression model.

NOTE : When the degree is high, some of the features will have extremely
high values, while others will have very low values. This auses severe nume-
ri pre ision problems with matrix inversion, and yields wrong answers. To
over ome this problem, you will have to appropriately s ale ea h feature xd i
in luded in the regression model, to bring all features to roughly the same
magnitude. Be sure to use the same s aling for the training and test sets.
For example, divide ea h feature by the maximum absolute value of the fe-
ature, among all training and test examples. (MATLAB matrix and ve tor
operations an be very useful for doing su h s aling operations easily.)

f.

g. For a training set size of 400, turn in the mean squared training and test
errors for maximal degrees of zero through ten.

(Qui k validation: for maximal degree two, we got a training error of 14.5 and
a test error of 32.8).

h. Explain the qualitative behavior of the test error as a fun tion of the
polynomial degree. Whi h degree seems to be the best hoi e?

i. Prove (in two senten es) that the training error is monotoni ally de reasing
with the maximal degree m. That is, that the training error using a higher
degree and the same training set, is ne essarily less then or equal to the
training error using a lower degree.

j. We laim that if there is at least one feature ( omponent of the input


ve tor x) with no repeated values in the training set, then the training error
will approa h zero as the polynomial degree in reases. Why is this true?

B. In this [part of the℄ problem, we explore the behavior of polynomial re-


gression methods when only a small amount of training data is available.
..
.
We will begin by using a maximum likelihood estimation riterion for the
parameters w that redu es to least squares tting.

k. Consider a simple 1D regression problem. The data in housing.data provides


information of how 13 dierent fa tors ae t house pri e in the Boston area.
(Ea h olumn of data represents a dierent fa tor, and is des ribed in brief
in the le housing.names.) To simplify matters (and make the problem easier to

8
visualise), we onsider predi ting the house pri e (the 14th olumn) from the
LSTAT feature (the 13th olumn).
We split the data set into two parts (in testLinear.m
), train on the rst part
and test on the se ond. We have provided you with the ne essary MATLAB
ode for training and testing a polynomial regression model. Simply edit the
s ript (ps1_part2.m) to generate the variations dis ussed below.

i. Use ps1_part2.m to al ulate and plot training and test errors for
polynomial regression models as a fun tion of the polynomial order
(from 1 to 7). Use 250 training examples (set numtrain=250).

ii. Briey explain the qualitative behavior of the errors. Whi h of


the regression models are over-tting to the data? Provide a brief
justi ation.

iii.Rerun ps1 part2.m with only 50 training examples (set num-


train=50).
Briey explain key dieren es between the resulting plot
i
and the one from part ). Whi h of the models are over-tting this
time?

Comment : There are many ways of trying to avoid over-tting. One way is
to use a maximum a posteriori (MAP) estimation riterion rather than maxi-
mum likelihood. The MAP riterion allows us to penalize parameter hoi es
that we would not expe t to lead to good generalization. For example, very
large parameter values in linear regression make predi tions very sensitive to
slight variations in the inputs. We an express a preferen e against su h large
parameter values by assigning a prior distribution over the parameters su h
as simple Gaussian
p(w; α2 ) = N (0, α2 I).
This prior de reases rapidly as the parameters deviate from zero. The single
varian e (hyper-parameter) α2 ontrols the extent to whi h we penalize large
parameter values. This prior needs to be ombined with the likelihood to get
the MAP riterion. The MAP parameter estimate maximizes

ln(p(y|Xw, σ 2 ) p(w; α2 )) = ln p(y|Xw, σ 2 ) + ln p(w; α2 ).

The resulting parameter estimates are biased towards zero due to the prior.
We an nd these estimates as before by setting the derivatives to zero.

l. Show that
σ 2 −1 ⊤
ŵMAP = (X ⊤ X + I) X y.
α2
m. In the above solution, show that in the limit of innitely large α, the MAP
estimate is equal to the ML estimate, and explain why this happens.

n. Let us see how the MAP estimate hanges our solution in the housing-pri e
estimation problem. The MATLAB ode you used above a tually ontains a
α2
variable orresponding to the varian e ratio var_ratio = for the MAP es-
σ2
timator. This has been set to a default value of zero to simulate the ML
estimator. In this part, you should vary this value from 1e-8 to 1e-4 in multi-
ples of 10 (i.e., 1e-8, 1e-7, . . . , 1e-4). A larger ratio orresponds to a stronger
prior (smaller values of α2 onstrain the parameters w to lie loser to origin).

9
iv. Plot the training and test errors as a fun tion of the polynomial
order using the above 5 MAP estimators and 250 and 50 training
points.

v. Des ribe how the prior ae ts the estimation results.

C. Implement the kernel linear regression method (des ribed in MIT, 2006
fall, Tommi Jaakkola, HW2, pr. 2.a- / Estimp-56 / Estimp-72) for λ > 0. We
are interested in exploring how the regularization parameter λ ≥ 0 ae ts the
solution when the kernel fun tion is the radial basis kernel
 
′ β ′ 2
K(x, x ) = exp − kx − x k , β > 0.
2

We have provided training and test data as well as helpful MATLAB s ripts
in hw2/prob2. You should only need to omplete the relevant lines in run prob2
s ript. The data pertains to the problem of predi ting Boston housing pri es
based on various indi ators (normalized). Evaluate and plot the training and
test errors (mean squared errors) as a fun tion of λ in the range λ ∈ (0, 1). Use
β = 0.05. Explain the qualitative behavior of the two urves.

Solution:

A.

a.

b. First to read the data (ignoring olumn four):


 data = load('housing.data');
 x = data(:,[1:3 5:13℄);
 y = data(:,14);

To get the training and test errors for training set of size s, we invoke the
following MATLAB ommand:
 [trainE,testE℄ = testLinear(x,y,s)
Here are the errors I got:

training size training error test error


10 6.27 × 10−26 1.05 × 105
50 3.437 24253
100 4.150 1328
200 9.538 316.1
300 9.661 381.6
400 22.52 41.23

[ Note that for a training size of ten, the training error should have been
zero. The very low, but still non-zero, error is a result of limited pre ision of
the al ulations, and is not a problem. Furthermore, with only ten training
examples, the optimal regression weights are not uniquely dened. There is a
four dimensional linear subspa e of weight ve tors that all yield zero training
error. The test error above (for a training size of ten) represents an 2arbitrary
hoi e of weights from this subspa e (impli itly made by the pinv() fun tion).
Using dierent, equally optimal, weights would yield dierent test errors. ℄

10
. The training error will be zero if the input ve tors are linearly independent.
More pre isely, sin e we are allowing an ane term w0 , it is enough that
the input ve tors with an additional term always equal to one, are linearly
independent. Let X be the matrix of input ve tors, with additional `one'
terms, y any output ve tor, and w a possible weight ve tor. If the inputs are
linearly independent, Xw = y always has a solution, and the weights w lead to
zero training error.
[ Note that if X is a square matrix with linearly independent rows, than it
is invertible, and Xw = y has a unique solution. But even if X is not square
matrix, but its rows are still linearly independent (this an only happen if
there are less rows then olumns, i.e., less features then training examples),
then there are solutions to Xw = y , whi h do not determine w uniquely, but
still yield zero training error (as in the ase of a sample size of ten above). ℄

d. The training error tends to in rease. As more examples have to be tted,


it be omes harder to 'hit', or even ome lose, to all of them.
The test error tends to de rease. As we take into a ount more examples
when training, we have more information, and an ome up with a model that
better resembles the true behavior. More training examples lead to better
generalization.

e. We will use the following fun tions, on top of those from question b:
fun tion xx = degexpand(x, deg)
fun tion [trainE, testE℄ = testPoly(x, y, numtrain, deg)
f.
..
.

g. To get the training and test errors for maximum degree d, we invoke the
following MATLAB ommand:
 [trainE,testE℄ = testPoly(x,y,400,d)
Here are the errors I got:

degree training error test error


0 83.8070 102.2266
1 22.5196 41.2285
2 14.8128 32.8332
3 12.9628 31.7880
4 10.8684 5262
5 9.4376 5067
6 7.2293 4.8562 × 107
7 6.7436 1.5110 × 106
8 5.9908 3.0157 × 109
9 5.4299 7.8748 × 1010
10 4.3867 5.2349 × 1013

[ These results were obtained using pinv(). Using dierent operations, although
theoreti ally equivalent, might produ e dierent results for higher degrees. In
any ase, using any of the suggested methods above, the errors should mat h
the above table at least up to degree ve. Beyond that, using inv() starts
produ ing unreasonable results due to extremely small values in the matrix,

11
whi h make it almost singular (non-invertible). If you used inv() and got su h
values, you should point this out.
Degree zero refers to having a onstant predi tor, i.e., predi t the same input
value for all output values. The onstant value that minimizes the training
error (and is thus used) is the mean training output. ℄

h. Allowing more omplex models, with more features, we an use as predi -


tors fun tions that better orrespond to the true behavior of the data. And
so, the approximation error (the dieren e between the optimal model from
our limited lass, and the true behavior of the data) de reases as we in rease
the degree. As long as there is enough training data to support su h omplex
models, the generalization error
is not too bad, and the test error de rea-
ses. However, past some point we start over-tting
the training data and
the in rease in the generalization error be omes mu h more signi ant than
the ontinued de rease in the approximation error (whi h we annot dire tly
observe), ausing the test error to rise.
Looking at the test error, the best maximum degree seems to be three.

i. Predi tors of lower maximum degree are in luded in the set of predi tors
of higher maximum degree (they orrespond to predi tors in whi h weights
of higher degree features are set to zero). Sin e we hoose the predi tor from
within the set the minimizes the training error, allowing more predi tors, an
only de rease the training error.

j. We show for all m ≥ n − 1 (where n is the number of training examples),


the training error is 0, but onstru ting weights whi h predi t the training
examples exa tly. Let j be a omponent of the input with no repeat values.
We let wi,d = 0 for all i = j , and all d = 1, . . . , m. Then we have

XX m
X
f (x) = w0 + wi,d xdi = w0 + wj,d xdj .
i d d=1

Given n training points (x1 , y1 ), . . . , (xn , yn ) we are required to nd w0 , wj,1 , . . . , wj,m
Pm d
s.t. w0 + d=1 wj,d (xi )j = yi , ∀i = 1, . . . , n. That is, we want to interpolate n po-
ints with a degree m ≥ n − 1 polynomial, whi h an be done exa tly as long as
the points xi are distin t.

B.

k.

i.

ii.The training error is monotoni ally de reasing (non-in reasing) with po-
lynomial order. This is be ause higher order models an fully represent any

12
lower order model by adequate setting of parameters, whi h in turn implies
that the former an do no worse than the latter when tting to the same
training data.

(Note that this monotoni ity property need not hold if the training sets to
whi h the higher and lower order models were t were dierent
, even if these
were drawn from the same underlying distribution.)

The test error mostly de reases with model order till about 5th order, and
then in reases. This is an indi ation (but not proof ) that higher order models
(6th and 7th) might be overtting to the data. Based on these results, the
best hoi e of model for training on the given data is the 5th order model,
sin e it has lowest error on an independent test set of around 250 examples.

iii.

We note the following dieren es between the plots for 250 and 50 examples:

• The training errors are lower in the present ase. This is be ause we
are having to t fewer points with the same model. In this examples, in
parti ular, we are tting only a subset of the points we were previously
tting (sin e there is no randomness in drawing points for training).

• The test errors for most models are higher. This is eviden e of systemati
overtting for all model orders, relative to the ase where there were many
more training points.

• The model with the lowest test error is now the third order model. From
4th order onwards, the test error generally in reases (though the 7th order
is an ex eption, perhaps due to the parti ular hoi e of training and test
sets). This tells us that with fewer training examples, our preferen e
should swit h towards lower-order models (in the interest of a hieving
low generalisation error), even though the true model responsible for
generating the underlying data might be of mu h higher order. This
relates to the trade-o between bias and varian e. We typi ally want
to minimise the mean-square error, whi h is the sum of the bias and
varian e. Low-order models typi ally have high bias but low varian e.
Higher order models may be unbiased, but have higher varian e.

l.
.
..

m.
..
.

n.

13
iv.
Plots for 250 training examples. Left to right, (then) top to bottom, varian e
ratio = 1e-8 to 1e-4:

Plots for 50 training examples. Left to right, (then) top to bottom, varian e
ratio = 1e-8 to 1e-4:

v. We make the following observations:

• As the varian e ratio (i.e. the strength of the prior) in reases, the trai-
ning error in reases (slightly). This is be ause we are no longer solely
interested in obtaining the best t to the training data.

• The test error for higher order models de reases dramati ally with strong
priors. This is be ause we are no longer allowing these models to overt
to training data by restri ting the range of weights possible.

• Test error generally de reases with in reasing prior.

14
• As a onsequen e of the above two points, the best
model hanges slightly
with in reasing prior in the dire tion of more omplex models.
• For 50 training samples, the dieren e in test error between ML and
MAP is more signi ant than with 250 training examples. This is be ause
overtting is a more serious problem in the former ase.

C. Sample ode for this problem is shown below:


Ntrain = size(Xtrain,1);
Ntest = size(Xtest,1);
for i=1:length(lambda),
lmb = lambda(i);
alpha = lmb * ((lmb*eye(Ntrain) + K)∧ -1) * Ytrain;
Atrain = (1/lmb) * repmat( alpha', Ntrain, 1);
yhat_train = sum(Atrain.*K,2);
Atest = (1/lmb) * repmat( alpha', Ntest, 1);
yhat_test = sum(Atest.*(Ktrain_test'), 2);
E(i,:) = [mean((yhat_train-Ytrain).∧2),mean((yhat_test-Ytest).∧ 2)℄;
end;
The resulting plot is shown in the
nearby gure. As an be seen the
training error is zero at λ = 0 and
in reases as λ in reases. The test er-
ror initially de reases, rea hes a mi-
nimum around 0.1, and then in rea-
ses again. This is exa tly as we would
expe t.
λ ≈ 0 results in over-tting (the mo-
del is too powerful). Our regression
fun tion has a low bias but high va-
rian e.
By in reasing λ we onstrain the mo-
del, thus in reasing the training er-
ror. While the regularization in rea-
ses bias, the varian e de reases fas-
ter, and we generalize better.
High values of λ result in under-
tting (high bias, low varian e) and
both training error and test errors
are high.

15
5. (Regresia liniar  lo al-ponderat  regularizat , kernel-izat )
•◦ Stanford, 2008 fall, Andrew Ng, HW1, pr. 2.d
The les q2x.dat and q2y.dat ontain the inputs (x(i) ) and outputs (y (i) ) for a
regression problem, with one training example per row.

a. Implement (unweighted) linear regression (y = θ ⊤ x) on this dataset (using


the normal equations , [LC: i.e., the analyti al / losed form solution℄), and
plot on the same gure the data and the straight line resulting from your t.
(Remember to in lude the inter ept term .)

b. Implement lo ally weighted linear regression on this dataset (using the


weighted normal equations you derived in part (b) [LC: i.e, Stanford, 2008
fall, Andrew Ng, HW1, pr. 2.b, but you may take a look at CMU, 2010 fall,
Aarti Singh, midterm, pr. 4 found in the exer ise book℄), and plot on the same
gure the data and the urve resulting from your t. When evaluating h( · )
at a query point x, use weights

(x − x(i) )2
 
w(x(i) ) = exp − ,
2τ 2

with a bandwidth parameter τ = 0.8. (Again, remember to in lude the inter-


ept term.)

. Repeat (b) four times, with τ = 0.1, 0.3, 2 and 10. Comment briey on what
happens to the t when τ is too small or too large.

Solution:

LC: See the ode in the prob2.m Matlab le that I put in the HW1 sub-
folder of Stanford 2011f folder in the main (Stanford) arhive and also in
book/g/Stanford.2008f.ANg.HW1.pr2d.

16
(Plotted in olor where available.)
For small bandwidth parameter τ , the tting is dominated by the losest by
training samples. The smaller the bandwidth, the less training samples that
are a tually taken into a ount when doing the regression, and the regression
results thus be ome very sus eptible to noise in those few training samples.
For larger τ , we have enough training samples to reliably t straight lines,
unfortunately a straight line is not the right model for these data, so we also
get a bad t for large bandwidths.

17
6. ([Weighted℄ Linear Regression applied to
xxx predi ting the needed quatity of insulin,
xxx starting from the sugar level in the patient's blood)
•◦ CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 4
An automated insulin inje tor needs to al ulate how mu h insulin it should
inje t into a patient based on the patient's blood sugar level. Let us formulate
this as a linear regression problem as follows: let yi be the dependent predi ted
variable (blood sugar level), and let β0 , β1 and β2 be the unknown oe ients
of the regression fun tion. Thus, yi = β0 + β1 xi + β2 x2i , and we an formulate
not.
the problem of nding the unknown β = (β0 , β1 , β2 ) as:

β̂ = (X ⊤ X)−1 X ⊤ y.

See data2.txt (posted on website) for data based on the above s enario with
spa e separated elds onforming to:
bloodsugarlevel insulinedose weightage
The purpose of the weightage eld will be made lear in part c.
a. Write ode in Matlab to estimate the regression oe ients given the
dataset onsisting of pairs of independent and dependent variables. Generate
a spa e separated le with the estimated parameters from the entire dataset
by writing out all the parameter.

b. Write ode in Matlab to perform inferen e by predi ting the insulin dosage
given the blood sugar level based on training data using a leave one out ross
validation s heme. Generate a spa e separated le with the predi ted dosages
in order. The predi ted dosages:

. However, it has been found that one group of patients are twi e as sensitive
to the insuline dosage than the other. In the training data, these parti ular
patients are given a weightage of 2, while the others are given a weightage
of 1. Is your goodness of t fun tion exible enough to in orporate this
information?

d. Show how to formulate the regression fun tion, and orrespondingly al-
ulate the oe ients of regression under this new s enario, by in orporating
the given weights.

e. Code up this variant of regressional analysis. Write out the new oe ients
of regression you obtain by using the whole dataset as training data.

Solution:

a. The betas, in order:


−74.3825 13.4215 1.1941
b.
1417.0177 1501.4423 1966.3563 2833.7942 2953.4532 3075.472 3199.8566 3326.9663
3456.4704 5038.3777 5196.6767 5357.0094 7091.8113 7278.2709 7467.3604 7658.5256
7852.0808 9703.9748 9921.8604 10142.5438 10365.8512 12498.2968 12749.6202 13003.94
1798.3745 1869.1966 2024.7112 2265.6988 2392.0227 2756.1588 3915.9302 4004.465
4878.0057 5094.3282 6217.981 6485.8542 6544.4688 6805.7564 7073.8455 7207.5032
7285.3657 9393.5129 9515.9043 9704.0029 10060.5539 12037.6304 12361.3586 12903.5394

18
. Sin e we need to weigh ea h point dierently, out urrent goodness of t
fun tion is unable to work in this s enario. However, sin e the weights for this
spe i dataset are 1 and 2, we may just use the old formalism and double
the data items with weightage 2. The hanged formalism whi h enables us to
assign weights of any pre ision to the data sample is shown below.

d. Let:

y = Xβ
y = (y1 , y2 , . . . , yn )⊤
β = (β1 , β2 , . . . , βm )⊤
Xi,1 = 1
Xi,j+1 = xi,j

Let us dene the weight matrix as:



Ωi,i = wi
Ωi,j = 0 (for i 6= j )
So, Ωy = ΩXβ .
To minimize the weighted square error, we have to take the derivative with
respe t to β :


((Ωy − ΩXβ)⊤ (Ωy − ΩXβ))
∂β

= ((Ωy)⊤ (Ωy) − 2(Ωy)⊤ (ΩXβ) + (ΩXβ)⊤ (ΩXβ)
∂β

= ((Ωy)⊤ (Ωy) − 2(Ωy)⊤ (ΩXβ) + β ⊤ X ⊤ Ω⊤ ΩXβ
∂β
Therefore


((Ωy − ΩXβ)⊤ (Ωy − ΩXβ)) = 0
∂β
⇔ 0 − 2((Ωy)⊤ (ΩX))⊤ + 2X ⊤ Ω⊤ ΩX β̂ = 0
⇔ β̂ = (X ⊤ Ω⊤ ΩX)−1 X ⊤ Ω⊤ Ωy

e. The new beta oe ients are in order:


−57.808 13.821 1.199

19
7. (Linear [Ridge℄ regression applied to
xxx predi ting the level of PSA in the prostate gland,
xxx using a set of medi al test results)
•◦⋆⋆ CMU, 2009 fall, Geo Gordon, HW3, pr. 3
The linear regression method is widely used in the medi al domain. In this
question you will work on a prostate an er data from a study by Stamey et
al.1 You an download the data from . . . .
Your task is to predi t the level of prostate-spe i antigen (PSA) using a set
of medi al test results. PSA is a protein produ ed by the ells of the prostate
gland. High levels of PSA often indi ate the presen e of prostate an er or
other prostate disorders.
The attributes are several lini al measurements on men who have prostate
an er. There are 8 attributes: log an er volume l avol, log prostate weight
(lweight), log of the amount of benign prostati hyperplasia (lbph), seminal
vesi le invasion (svi), age, log of apsular penetration (l p), Gleason s ore
(gleason), and per ent of Gleason s ores of 4 or 5 (pgg45). svi and gleason
are ategori al, that is they take values either 1 or 0; others are real-valued.
We will refer to these attributes as A1 = l avol, A2 = lweight, A3 = age, A4
= lbph, A5 = svi, A6 = l p, A7 = gleason, A8 = pgg45.
Ea h row of the input le des ribes one data point: the rst olumn is the
index of the data point, the following eight olumns are attributes, and the
tenth olumn gives the log PSA level lpsa, the response variable we are in-
terested in. We already randomized the data and split it into three parts
orresponding to training, validation and test sets. The last olumn of the le
indi ates whether the data point belongs to the training set, validation set
or test set, indi ated by `1' for training, `2' for validation and `3' for testing.
The training data in ludes 57 examples; validation and test sets ontain 20
examples ea h.

Inspe ting the Data

a. Cal ulate the orrelation matrix of the 8 attributes and report it in a table.
The table should be 8-by-8. You an use Matlab fun tions.

b. Report the top 2 pairs of attributes that show the highest pairwise positive
orrelation and the top 2 pairs of attributes that show the highest pairwise
negative orrelation.

Solving the Linear Regression Problem

You will now try to nd several models in order to predi t the lpsa levels.
The linear regression model is

Y = f (X) + ǫ

where ǫ is a Gaussian noise variable, and


p
X
f (X) = wj φj (X)
j=0

1 Stamey TA, Kabalin JN, M Neal JE et al. Prostate spe i antigen in the diagnosis and treatment of the
prostate. II. Radi al prostate tomy treated patients. J Urol 1989;141:107683.

20
where p is the number of basis fun tions (features), φj is the j th basis fun tion,
and wj is the weight we wish to learn for the j th basis fun tion. In the models
below, we will always assume that φ0 (X) = 1 represents the . inter ept term
. Write a Matlab fun tion that takes the data matrix Φ and the olumn
ve tor of responses y as an input and produ es the least squares t w as the
output (refer to the le ture notes for the al ulation of w).

d. You will reate the following three models. Note that before solving ea h
regression problem below, you should s ale ea h feature ve tor to have a
zero mean and unit varian e. Don't forget to in lude the inter ept olumn,
φ0 (X) = 1, after s aling the other features. Noti e that sin e you shifted the
attributes to have zero mean, in your solutions, the inter ept term will be the
mean of the response variable.

• Model1: Features are equal to input attributes, with the addition of a on-
stant feature φ0 . That is, φ0 (X) = 1, φ1 (X) = A1 , . . . , φ8 (X) = A8 . Solve the
linear regression problem and report the resulting feature weights. Dis uss
what it means for a feature to have a large negative weight, a large positive
weight, or a small weight. Would you be able to omment on the weights, if
you had not s aled the predi tors to have the same varian e? Report mean
squared error (MSE) on the training and validation data.

• Model2: In lude additional features orresponding to pairwise produ ts of


the rst six of the original attributes,2 i.e., φ9 (X) = A1 ·A2 , . . . , φ13 (X) = A1 ·A6 ,
φ15 (X) = A2 · A3 , . . . , φ23 (X) = A5 · A6 . First ompute the features a ording
to the formulas above using the unnormalized values, then shift and s ale the
new features to have zero mean and unit varian e and add the olumn for the
inter ept term φ0 (X) = 1. Report the ve features whose weights a hieved the
largest absolute values.

• Model3: Starting with the results of Model1, drop the four features with
the lowest weights (in absolute values). Build a new model using only the
remaining features. Report the resulting weights.

e. Make two bar harts, the rst to ompare the training errors of the three
models, the se ond to ompare the validation errors of the three models.
Whi h model a hieves the best performan e on the training data? Whi h
model a hieves the best performan e on the validation data? Comment on
dieren es between training and validation errors for individual models.

f. Whi h of the models would you use for predi ting the response variable?
Explain.

Ridge Regression

For this question you will start with Model2 and employ regularization on it.

g. Write a Matlab fun tion to solve Ridge regression. The fun tion should take
the data matrix Φ, the olumn ve tor of responses y , and the regularization
parameter λ as the inputs and produ e the least squares t w as the output
(refer to the le ture notes
for the al ulation of w). Do not penalize w0 , the
2 These features are also alled intera tions, be ause they attempt to a ount for the ee t of two attributes
being simultaneously high or simultaneously low.

21
inter ept term.(You an a hieve this by repla ing the rst olumn of the λI
matrix with zeros.)

h. You will reate a plot exploring the ee t of the regularization parameter
on training and validation errors. The x-axis is the regularization parameter
(on a log s ale) and the y-axis is the mean squared error
. Show two urves in
the same graph, one for the training error
and one for the validation error
.
Starting with λ = 2−30 , try 50 values: at ea h iteration in rease λ by a fa tor
of 2, so that for example the se ond iteration uses λ = 2−29 . For ea h λ, you
need to train a new model.

i. What happens to the training error as the regularization parameter in-


reases? What about the validation error? Explain the urve in terms of
overtting, bias and varian e.

j. What is the λ that a hieves the lowest validation error and what is the
validation error at that point? Compare this validation error to the Model2
validation error when no regularization was applied (you solved this in part
e). How does w dier in the regularized and unregularized versions, i.e., what
ee t did regularization have on the weights?

k. Is this validation error lower or higher than the validation error of the
model you hose in part f ? Whi h one should be your nal model?

l. Now that you have de ided on your model (features and possibly the re-
gularization parameter), ombine your training and validation data to make
a ombined training set
, train your model on this ombined training set, and
al ulate it on the test set. Report the and training . test errors
Solution:

a.
l avol lweight age lbph svi l p gleason pgg45 lpsa
l avol 1.0000 0.2805 0.2249 0.0273 0.5388 0.6753 0.4324 0.4336 0.7344
lweight 0.2805 1.0000 0.3479 0.4422 0.1553 0.1645 0.0568 0.1073 0.4333
age 0.2249 0.3479 1.0000 0.3501 0.1176 0.1276 0.2688 0.2761 0.1695
lbph 0.0273 0.4422 0.3501 1.0000 -0.0858 -0.0069 0.0778 0.0784 0.1798
svi 0.5388 0.1553 0.1176 -0.0858 1.0000 0.6731 0.3204 0.4576 0.5662
l p 0.6753 0.1645 0.1276 -0.0069 0.6731 1.0000 0.5148 0.6315 0.5488
gleason 0.4324 0.0568 0.2688 0.0778 0.3204 0.5148 1.0000 0.7519 0.3689
pgg45 0.4336 0.1073 0.2761 0.0784 0.4576 0.6315 0.7519 1.0000 0.4223
lpsa 0.7344 0.4333 0.1695 0.1798 0.5662 0.5488 0.3689 0.4223 1.0000

b. The top 2 pairs that show the highest pairwise positive orrelation are
gleason - ppg4 (0.7519) and l avol -l p (0.6731).
Highest negative orrelations:
lbph - svi (-0.0858) and lph - l p (-0.0070).
. See below:
fun tion what=lregress(Y,X)
% least square solution to linear regression
% X is the feature matrix
% Y is the response variable ve tor
what=inv(X'*X)*X'*Y;
end
d.
Model1:

22
the weight ve tor:
w = [2.68265, 0.71796, 0.17843, −0.21235, 0.25752, 0.42998, −0.14179, 0.08745, 0.02928].
Model2:
The largest ve absolute values in des ending order:
lweight*age, lpbh, lweight, age, age*lpbh.
Model3:
The features with have the lowest absolute weights in Model1:
pgg45, gleason, l p, lweight.
The resulting weights: w = [2.6827, 0.7164, −0.1735, 0.3441, 0.4095].

e.

Training Error of the Three Models Validation Error of the Three Models
0.7 1

0.6
0.8
0.5
Validation MSE
Training MSE

0.4 0.6

0.3 0.4
0.2
0.2
0.1

0 0
1 2 3 1 2 3
Model ID Model ID

Model2 a hieves the best performan e on the training data, whereas Model1
a hieves the best performan e on the validation data. Model2 suer from
overtting, indi ated by the very good training model but low validation error.
Model3 seems to be too simple, it has a higher training and a higher validation
error ompared to Model1. The features that are dropped are informative, as
indi ated by the lower training and validation errors.

f. Model1, sin e it a hieves the best performan e on the validation data.


Model2 overts, and Model3 is too simple.

g. See below:

fun tion what = ridgeregress(Y,X,lambda)


% X is the feature matrix
% Y is the response ve tor
% what are the estimated weights
penal = lambda*eye(size(X,2));
penal(:,1) = 0;
what = inv(X'*X+penal)*X'*Y;
end

23
h.
1.6
training error
1.4 testing error

1.2

MSE 1

0.8

0.6

0.4

0.2
−30 −20 −10 0 10 20
log2(lambda)

i. When the model is not regularized mu h (the left side of the graph), the
training error is low and the validation error is high, indi ating the model is
too omplex and overtting to the training data. In that region, the biasis
low and the varian e is high.
As the regularization parameter in reases, the bias in reases and varian e
de reases. The overtting problem is over ome as indi ated by de reasing
validation error and in reasing training error.
As regularization penalty in rease too mu h, the model be omes getting too
simple and start suering from undertting
as an be shown by the poor
performan e on the training data.

j. logλ = 4, i.e., λ = 16, a hieves the lowest validation error, whi h is 0.447.
This validation error is mu h less than the validation error of the model wi-
thout regularization, whi h was 0.867. Regularized weights are smaller than
unregularized weights. Regularization de reases the magnitude of the weights.

k. The validation error of the penalized model (λ = 16) is 0.447, whi h is lower
than Model1's validation error, 0.5005. Therefore, this model is hosen.

l. The nal models' training error is 0.40661 and the test error is 0.58892.

24
8. (Linear weighted, unweighted, and fun tional regression:
xxx appli ation to denoising quasar spe tra)
•◦· Stanford, 2017 fall, Andrew Ng, Dan Boneh, HW1, pr. 5
xxx Stanford, 2016 fall, Andrew Ng, John Du hi, HW1, pr. 5

Solution:

25
9. ([Feature sele tion in the ontext of linear regression
xxx with L1 regularization:
xxx the oordinate des ent method)
•◦⋆ MIT, 2003 fall, Tommi Jaakkola, HW4, pr. 1

Solution:

26
10. (Logisti regression with gradient as ent:
xxx appli ation to text lassi ation)
•◦ CMU, 2010 fall, Aarti Singh, HW1, pr. 5
In this problem you will implement Logisti Regression and evaluate its per-
forman e on a do ument lassi ation task. The data for this task is taken
from the 20 Newsgroups data set,3 and is available from the ourse web page.
Our model will use the bag-of-words assumption
. This model assumes that
ea h word in a do ument is drawn independently from a ategori al distribu-
tion over possible words. (A ategori al distribution is a generalization of a
Bernoulli distribution to multiple values.) Although this model ignores the
ordering of words in a do ument, it works surprisingly well for a number of
tasks. We number the words in our vo abulary from 1 to m, where m is the
total number of distin t words in all of the do uments. Do uments from lass
y are drawn from a lass-spe i ategori al distribution parameterized
Pm by θy .
θy is a ve tor, where θy,i is the probability of drawing word i and i=1 θy,i = 1.
Therefore, the lass- onditional probability of drawing do ument x from our
model is
m
ounti (x)
Y
P (X = x|Y = y) = θy,i ,
i=1

where ounti (x) is the number of times word i appears in x.


a. Provide high-level des riptions of the Logisti Regression algorithm. Be
sure to des ribe how to estimate the model parameters and how to lassify a
new example.

b. Implement Logisti Regression. We found that a step size around 0.0001


worked well. Train the model on the provided training data and predi t the
labels of the test data. Report the training and test error.

Solution:

a. The logisti regression model is


P
exp(w0 + i wi xi )
P (Y = 1|X = x, w) = P ,
1 + exp(w0 + i wi xi )

where w = (w0 , w1 , . . . , wm )⊤ is our parameter ve tor. We will nd ŵ by maxi-


mizing the data loglikelihood l(w):
 
Y exp(y j (w0 + P wi xj ))
i 
l(w) = log  Pi j
j 1 + exp(w 0 + i i i)
w x
!
wi xji ) wi xji ))
X X X
= y j (w0 + − log(1 + exp(w0 +
j i i

We an estimate/learn the parameters (w) of logisti regression by optimi-


zing l(w), using gradient as ent
. The gradient
of l(w) is the array of partial
derivatives of l(w):
3 Full version available from http://people. sail.mit.edu/jrennie/20Newsgroups/.

27
!
j
P
∂l(w) X
i wi xi )
exp(w0 +
= yj −
∂w0 1 + exp(w0 + i wi xji )
P
j
X
= (y j − P (Y = 1|X = xj ; w))
j
!
xjk exp(w0 + wi xji )
P
∂l(w)
y j xjk
X
= − Pi
∂wk j 1 + exp(w0 + i wi xji )

xjk (y j − P (Y = 1|X = xj ; w))


X
=
j

Let w(t) represent our parameter ve tor on the t-th iteration of gradient as ent.
To perform gradient as ent, we rst set w(0) to some arbitrary value (say 0).
We then repeat the following updates
until onvergen e:

(t+1) (t)
X 
w0 ← w0 + α y j − P (Y = 1|X = xj ; w(t) )
j
 
(t+1) (t)
xjk y j − P (Y = 1|X = xj ; w(t) )
X
wk ← wk +α
j

where α is a step size parameter whi h ontrols how far we move along our
gradient at ea h step. We set α = 0.0001. The algorithm onverges when
||w(t) − w(t+1) || < δ , that is when the weight ve tor doesn't hange mu h during
an iteration. We set δ = 0.001.

b. Training error: 0.00. Test error: 0.29. The large dieren e between training
and test error means that our model overts
our training data. A possible
reason is that we do not have enough training data to estimate either model
a urately.

28
11. (Logisti regression with gradient as ent:
xxx appli ation on a syntheti dataset from R2 ;
xxx overtting)
CMU, 2015 spring, T. Mit hell, N. Bal an, HW4, pr. 2. -i
•◦
In logisti regression, our goal is to learn a set of parameters by maximizing
the onditional log-likelihood of the data.
In this problem you will implement a logisti regression lassier and apply it
to a two- lass lassi ation problem. In the ar hive, you will nd one .m le
for ea h of the fun tions that you are asked to implement, along with a le
alled HW4Data.mat that ontains the data for this problem. You an load the
data into O tave by exe uting load(HW4Data.mat) in the O tave interpreter.
Make sure not to modify any of the fun tion headers that are provided.

a. Implement a logisti regression lassier using gradient as ent  for the


formulas and their al ulation see ex. 23 in our exer ise book4  by lling in
the missing ode for the following fun tions:

• Cal ulate the value of the obje tive fun tion:


obj = LR_Cal Obj(XTrain,yTrain,wHat)

• Cal ulate the gradient:


grad = LR_Cal Grad(XTrain,yTrain,wHat)

• Update the parameter value:


wHat = LR_UpdateParams(wHat,grad,eta)

• Che k whether gradient as ent has onverged:


hasConverged = LR_Che kConvg(oldObj,newObj,tol)

• Complete the implementation of gradient as ent:


[wHat,objVals℄ = LR_GradientAs ent(XTrain,yTrain)

• Predi t the labels for a set of test examples:


[yHat,numErrors℄ = LR_Predi tLabels(XTest,yTest,wHat)

where the arguments and return values of ea h fun tion are dened as follows:

• XTrain is an n × p dimensional matrix that ontains one training instan e


per row
• yTrain is an n × 1 dimensional ve tor ontaining the lass labels for ea h
training instan e
• wHat is a p + 1 × 1 dimensional ve tor ontaining the regression parameter
estimates ŵ0 , ŵ1 , . . . , ŵp
• grad is a p + 1 × 1 dimensional ve tor ontaining the value of the gradient
of the obje tive fun tion with respe t to ea h parameter in wHat
• eta is the gradient as ent step size that you should set to eta = 0.01
• obj, oldObj and newObj are values of the obje tive fun tion
• tol is the onvergen e toleran e, whi h you should set to tol = 0.001
• objValsis a ve tor ontaining the obje tive value at ea h iteration of gra-
dient as ent
4 From the formal point of view you will assume that a dataset with n training examples and p features will be
(i) (i)
given to you. The lass labels will be denoted y (i) , the features x1 , . . . , xp , and the parameters w0 , w1 , . . . , wp ,
where the supers ript (i) denotes the sample index.

29
• XTest is an m × p dimensional matrix that ontains one test instan e per
row
• yTestis an m × 1 dimensional ve tor ontaining the true lass labels for
ea h test instan e
• yHat is an m × 1 dimensional ve tor ontaining your predi ted lass labels
for ea h test instan e
• numErrorsis the number of mis lassied examples, i.e. the dieren es be-
tween yHat and yTest

To omplete the LR_GradientAs ent fun tion, you should use the helper fun -
tions LR_Cal Obj, LR_Cal Grad, LR_UpdateParams, and LR_Che kConvg.

b. Train your logisti regression lassier on the data provided in XTrain and
yTrain with LR_GradientAs ent, and then use your estimated parameters wHat to
al ulate predi ted labels for the data in XTest with LR_Predi tLabels.

. Report the number of mis lassied examples in the test set.

d. Plot the value of the obje tive fun tion on ea h iteration of gradient des-
ent, with the iteration number on the horizontal axis and the obje tive value
on the verti al axis. Make sure to in lude axis labels and a title for your
plot. Report the number of iterations that are required for the algorithm to
onverge.

e. Next, you will evaluate how the training and test error hange as the trai-
ning set size in reases. For ea h value of k in the set {10, 20, 30, . . . , 480, 490, 500},
rst hoose a random subset of the training data of size k using the following
ode:

subsetInds = randperm(n, k)
XTrainSubset = XTrain(subsetInds, :)
yTrainSubset = yTrain(subsetInds)

Then re-train your lassier using XTrainSubset and yTrainSubset, and use the
estimated parameters to al ulate the number of mis lassied examples on
both the training set XTrainSubset and yTrainSubset and on the original test set
XTest and yTest. Finally, generate a plot with two lines: in blue, plot the value
of the training error against k , and in red, pot the value of the test error
against k , where the error should be on the verti al axis and training set size
should be on the horizontal axis. Make sure to in lude a legend in your plot
to label the two lines. Des ribe what happens to the training and test error
as the training set size in reases, and provide an explanation for why this
behavior o urs.

f. Based on the logisti regression formula you learned in lass, derive the
analyti al expression for the de ision boundary of the lassier in terms of
w0 , w1 , . . . , wp and x1 , . . . , xp . What an you say about the shape of the de ision
boundary?

g. In this part, you will plot the de ision boundary produ ed by your lassier.
First, reate a two-dimensional s atter plot of your test data by hoosing the
two features that have highest absolute weight in your estimated parameters
wHat (let's all them features j and k ), and plotting the j -th dimension stored

30
in XTest(:,j) on the horizontal axis and the k -th dimension stored in XTest(:,k)
on the verti al axis. Color ea h point on the plot so that examples with true
label y = 1 are shown in blue and label y = 0 are shown in red. Next, using
the formula that you derived in part (f ), plot the de ision boundary of your
lassier in bla k on the same gure, again onsidering only dimensions j and
k.

Solution:

a. See the fun tions LR_Cal Obj, LR_Cal Grad, LR_UpdateParams, LR_Che kConvg,
LR_GradientAs ent,and LR_Predi tLabels in the solution ode.

b. See the fun tion RunLR in the solution ode.

. There are 13 mis lassied examples in the test set.

d. See the gure below. The algorithm onverges after 87 iterations.

31
e. See gure below.

As the training set size in reases, test error de reases but training error in-
reases. This pattern be omes even more evident when we perform the same
experiment using multiple random sub-samples for ea h training set size, and
al ulate the average training and test error over these samples, the result of
whi h is shown in the gure below.

When the training set size is small, the logisti regression model is often
apable of perfe tly lassifying the training data sin e it has relatively little
variation. This is why the training error is lose to zero. However, su h a
model has poor generalization ability be ause its estimate of what is based on
a sample that is not representative of the true population from whi h the data

32
is drawn. This phenomenon is known as overtting be ause the model ts too
losely to the training data. As the training set size in reases, more variation
is introdu ed into the training data, and the model is usually no longer able
to t to the training set as well. This is also due to the fa t that the omplete
dataset is not 100% linearly separable. At the same time, more training data
provides the model with a more omplete pi ture of the overall population,
whi h allows it to learn a more a urate estimate of wHat. This in turn leads
to better generalization ability i.e. lower predi tion error on the test dataset.
PP
f. The analyti al formula for the de ision boundary is given by w0 + j=1 wj xj =
0. This is the equation for a hyperplane in Rp , whi h indi ates that the de ision
boundary is linear.

g. See the fun tion PlotDB in the solution ode. See the gure below.

33
12. (Logisti Regression (with gradient as ent)
xxx and Rosenblatt's Per eptron:
xxx appli ation on the Breast Can er dataset
xxx n-fold ross-validation; onden e interval)
•◦ (CMU, 2009 spring, Ziv Bar-Joseph, HW2, pr. 4)
For this exer ise, you will use the Breast Can er dataset, downloadable from
the ourse web page. Given 9 dierent attributes, su h as uniformity of ell
size, the taskis to predi t malignan y.5 The ar hive from the ourse web
page ontains a Matlab method loaddata.m, so you an easily load in the data
by typing (from the dire tory ontaining loaddata.m): data = loaddata. The
variables in the resulting data stru ture relevant for you are:
• data.X: 683 9-dimensional data points, ea h element in the interval [1, 10].
• data.Y: the 683 orresponding lasses, either 0 (benign), or 1 (malignant).
Logisti Regression

a. Write ode in Matlab to train the weights for logisti regression. To avoid
dealing with the inter ept term
expli itly, you an add a nonzero- onstant
tenth dimension to data.X: data.X(:,10)=1. Your regression fun tion thus be-
omes simply:

1
P (Y = 0|x; w) = P10
1 + exp( k=1 xk wk )
P10
exp( k=1 xk wk )
P (Y = 1|x; w) =
1 + exp( 10
P
k=1 xk wk )

and the gradient-as end update rule:

683
X
w ← w + α/683 xj (y j − P (Y j = 1|xj ; w))
j=1

Use the learning rate α = 1/10. Try dierent learning rates if you annot get
w to onverge.
b. To test your program, use 10-fold ross-validation, splitting [data.X data.Y℄
into 10 random approximately equal-sized portions, training on 9 on atenated
parts, and testing on the remaining part. Report the mean lassi ation
a ura y over the 10 runs, and the 95% onden e interval
.

Rosenblatt's Per eptron

A very simple and popular linear lassier is the per eptron algorithm of
Rosenblatt (1962), a single-layer neural network model of the form

y(x) = f (w⊤ x),

with the a tivation fun tion



1 if a ≥ 0
f (a) =
−1 otherwise.
5 For more information on what the individual attributes mean, see ftp://ftp.i s.u i.edu/pub/ma hine-
learning-databases/breast- an er-wis onsin/breast an er-wis onsin.names.

34
For this lassier, we need our lasses to be −1 (benign) and 1 (malignant),
whi h an be a hieved with the Matlab ommand: data.Y = data.Y ⋆ 2 - 1.
Weight training usually pro eeds in an online fashion, iterating through the
individual data points xj one or more times. For ea h xj , we ompute the
predi ted lass ŷ j = f (w⊤ xj ) for xj under the urrent parameters w, and update
the weight ve tor as follows:

w ← w + xj [y j − ŷ j ].

Note how w only hanges if xj was mis lassied under the urrent model.
. Implement this training algorithm in Matlab. To avoid dealing with the in-
ter ept term expli itly, augment ea h point in data.X with a non-zero onstant
tenth element. In Matlab this an be done by typing: data.X(:,10)=1. Have
your algorithm iterate through the whole training data 20 times and report the
number of examples that were still mis- lassied in the 20th iteration. Does
it look like the training data is linearly separable? (Hint: The per eptron
algorithm is guaranteed to onverge if the data is linearly separable.)

d. To test your program, use 10-fold ross-validation, using the splits you
obtained in part b. For ea h split, do 20 training iterations to train the wei-
ghts. Report the mean lassi ation a ura y over the 10 runs, and the 95%
onden e interval.

e. If the data is not linearly separable, weights an toggle ba k and forth


from iteration to iteration. Even in the linearly separable ase, the learned
model is often very dependent on whi h training data points ome rst in the
training sequen e. A simple improvement is the weighted per eptron
: training
pro eeds as before, but the weight ve tor w is saved after ea h update. After
training, instead of the nal w, the average of all saved w is taken to be
the learned weight ve tor. Report 10-fold CV a ura y for this variant and
ompare it to the simple per eptron's.

Solution:

You should have gotten something like this:

b. mean a ura y: 0.965, onden e interval: (0.951217, 0.978783).


. 30 mis- lassi ations in the 20th iteration. (Note that using the trained
weights *after* the 20th iteration results in only around 24 mis- lassi ations.)
When running with 200 iterations, still more than 20 mis- lassi ations o ur,
so the data is unlikely to be linearly separable as otherwise the training error
would be ome zero after many enough iterations.

d. Per eptron:
mean a ura y = 0.956, 95% onden e interval: (0.940618, 0.971382).

e. Weighted per eptron:


mean a ura y = 0.968, 95% onden e interval: (0.954800, 0.981200).

35
13. (Logisti regression using Newton's method:
xxx appli ation on R2 data)
•◦ Stanford, 2011 fall, Andrew Ng, HW1, pr. 1.b
a. On the web page asso iated to this booklet, you will nd the les q1x.dat
and q1y.dat whi h ontain the inputs (x(i) ∈ R2 ) and outputs (y (i) ∈ {0, 1}) res-
pe tively for a binary lassi ation problem, with one training example per
row.
Implement Newton's method for optimizing ℓ(θ), the [ onditional] log-likelihood
fun tion
m
X
ℓ(θ) = y (i) ln σ(w · x(i) ) + (1 − y (i) ) ln(1 − σ(w · x(i) )),
i=1

and apply it to t a logisti regression model to the data. Initialize Newton's


method with θ = 0 (the ve tor of all zeros). What are the oe ients θ
resulting from your t? (Remember to in lude the .) inter ept term
b. Plot the training data (your axes should be x1 and x2 , orresponding to the
two oordinates of the inputs, and you should use a dierent symbol for ea h
point plotted to indi ate whether that example had label 1 or 0). Also plot
on the same gure the de ision boundary t by logisti regression. (I.e., this
should be a straight line showing the boundary separating the region where
h(x) > 0.5 from where h(x) ≤ 0.5.)

Solution:

a. θ = (−2.6205, 0.7604, 1.1719) with the rst entry orresponding to the inter ept
term.

b.

36
14. (Solving logisti regression, the kernelized version,
xxx using Newton's method:
xxx implementation + appli ation on R2 data)
•◦ CMU, 2005 fall, Tom Mit hell, HW3, pr. 2. d
a. Implement the kernel logisti regression des ribed in ex. 26 in our exer ise
 kx − x′ k2 
book, using the gaussian kernel Kσ (x, x′ ) = exp .
2σ 2
Run your program on the le ds2.txt (the rst two olumns are X , the last
olumn is Y ) with σ = 1. Report the training error. Set stepsize to be 0.01
and the maximum number of iterations 100. The s atterplot of the ds2.txt data
is the follows:

b. Use 10-fold ross-validation to nd the best σ and plot the total number
of mistakes for σ ∈ {0.5, 1, 2, 3, 4, 5, 6}.

Solution:

a. 53 mis lassi ations.

b. The best value of σ is 2.

37
15. (Lo ally-weighted, regularized (L2 ) logisti regression,
xxx using Newton's method:
xxx appli ation on dataset from R2 )
•◦ Stanford, 2007 fall, Andrew Ng, HW1, pr. 2
In this problem you will implement a lo ally-weighted version of logisti re-
gression whi h was des ribed in the 56 exer ise in the Estimating the para-
meters of some probabilisti distributions hapter of our exer ise book. For
the entirety of this problem you an use the value λ = 0.0001.
Given a query point x, we hoose ompute the weights

kx − xi k2
 
wi = exp − .
2τ 2

This s heme gives more weight to the nearby points when predi ting the
lass of a new example[, mu h like the lo ally weighted linear regression dis-
ussed at exer ise ??℄.

a. Implement the Newton algorithm for optimizing the log-likelihood fun tion
(ℓ(θ) in the 56 exer ise) for a new query point x, and use this to predi t the
lass of x. The q2/ dire tory ontains data and ode for this problem. You
should implement the y = lwlr(X_train, y_train, x, tau) fun tion in the lwlr.m
le. This fun tion takes as input the training set (the X_train and y_train
matri es), a new query point x and the weight bandwitdh tau. Given this
i
input, the fun tion should . ompute weights wi for ea h training example,
using the formula above, ii
. maximize ℓ(θ) using Newton's method, and . iii
output y = 1{hθ (x)>0.5} as the predi tion.
We provide two additional fun tions that might help. The [X_train, y_train℄ =
load_data; fun tion will load the matri es from les in the data/ folder. The
fun tion plot_lwlr(X_train, y_train, tau, resolution) will plot the resulting lassier
(assuming you have properly implemented lwlr.m). This fun tion evaluates the
lo ally weighted logisti regression lassier over a large grid of points and
plots the resulting predi tion as blue (predi ting y = 0) or red (predi ting
y = 1). Depending on how fast your lwlr fun tion is, reating the plot might
take some time, so we re ommend debugging your ode with resolution = 50; and
later in rease it to at least 200 to get a better idea of the de ision boundary.

b. Evaluate the system with a variety of dierent bandwidth parameters τ .


In parti ular, try τ = 0.01, 0.05, 0.1, 0.5, 1.0, 5.0. How does the lassi ation
boundary hange when varying this parameter? Can you predi t what the
de ision boundary of ordinary (unweighted) logisti regression would look
like?

Solution:

a. Our implementation of lwlr.m:


fun tion y = lwlr(X_train, y_train, x, tau)
m = size(X_train, 1);
n = size(X_train, 2);
theta = zeros(n, 1);
% ompute weights

38
w = exp(-sum((X_train - repmat(x', m, 1)).∧ 2, 2) / (2*tau));
% perform Newton's method
g = ones(n, 1);
while (norm(g) > 1e-6)
h = 1 ./ (1 + exp(-X_train * theta));
g = X_train' * (w.*(y_train - h)) - 1e-4*theta;
H = -X_train' * diag(w.*h.*(1-h)) * X_train - 1e-4*eye(n);
theta = theta - H g;
end
% return predi ted y
y = double(x'*theta > 0);
b. These are the resulting de ision boundaries, for the dierent values of τ:

For smaller τ , the lassier appears to overt the data set, obtaining zero trai-
ning error, but outputting a sporadi looking de ision boundary. As τ grows,
the resulting de ision boundary be omes smoother, eventually onverging (in
the limit as τ → ∞ to the unweighted linear regression solution).

39
16. (Logisti regression with L2 regularization;
xxx appli ation on handwritten digit re ognition;
xxx omparison between the gradient method and Newton's method)
•◦ MIT, 2001 fall, Tommi Jaakkola, HW2, pr. 4
Here you will solve a digit lassi ation problem with logisti regression mo-
dels. We have made available the following training and test sets: digit_x.dat,
digit_y.dat, digit_x_test.dat, digit_y_test.dat.
a. Derive the sto hasti gradient as ent learning rule for a logisti regression
model starting from the regularized likelihood obje tive

J(w; c) = . . .
Pd
where kwk2 = 2
i=0 wi [or by modifying your derivation of the delta rule for
the softmax model℄. (Normally we would not in lude w0 in the regularization
penalty but have done so here for simpli ity of the resulting update rule).

b. Write a MATLAB fun tion w = SGlogisti reg(X,y, ,epsilon) that takes inputs si-
milar to logisti reg from the previous se tion, and a learning rate parameter
ε, and uses sto hasti gradient as ent to learn the weights. You may in lude
additional parameters to ontrol when to stop, or hard- ode it into the fun -
tion.

. Provide a rationale for setting the learning rate and the stopping riterion
in the ontext of the digit lassi ation task. You should assume that the
regularization parameter remains xed at 1. (You might wish to experiment
with dierent learning rates and stopping riterion but do NOT use the test
set. Your justi ation should be based on the available information before
seeing the test set.)

d. Set c = 1 and apply your pro edure for setting the learning rate and the
stopping riterion to evaluate the average log-probability of labels in the trai-
ning and test sets. Compare the results to those obtained with logisti reg. For
ea h optimization method, report the average log-probabilities for the labels
in the training and test sets as well as the orresponding mean lassi ation
errors (estimates of the miss- lassi ation probabilities). (Please in lude all
MATLAB ode you used for these al ulations.)

e. Are the train/test dieren es between the optimization methods reasona-


ble? Why? (Repeat the gradient as ent pro edure a ouple of times to ensure
that you are indeed looking at a typi al out ome.)

f. The lassiers we found above are both linear lassiers, as are all logisti
regression lassiers. In fa t, if we set c to a dierent value, we are still
sear hing the same set of linear lassiers. Try using logisti reg with dierent
values of c, to see that you get dierent lassi ations. Why are the resulting
lassiers dierent, even though the same set of lassiers is being sear hed?
Contrast the reason with the reason for the dieren es you explained in the
previous question.

g. Gaussian mixture models with identi al ovarian e matri es also lead to


linear lassiers. Is there a value of c su h that training a Gaussian mix-
ture model ne essarily leads to the same lassi ation as training a logisti
regression model using this value of c? Why?

40
Solution:

a.  ε c
w ← 1− w + ε(yi − P (1|xi , w))xi .
n
[LC: You an nd the details in the MIT do ument.]
b.

fun tion [w℄ = SGlogisti reg(X,y, ,epsilon,stopdelta)


[n,d℄ = size(X);
X = [ones(n,1),X℄;
w = zeros(d+1,1);
ont = 1;
while ( ont)
perm = randperm(n);
oldw = w;
for i = 1:n
w = (1 - epsilon * / n) * w + epsilon * (y(i) - g(X(i,:) * w)) * X(i,:)' ;
end
ont = norm(oldw - w) >= stopdelta * norm(oldw) ;
end
. Learning rate: If the learning rate is too high, any memory of previous
updates will be wiped out (beyond the last few points used in the updates).
It's important that all the points ae t the resulting weights and so the lear-
ning rate should s ale somehow with the number of examples. But how?
When the sto hasti gradient updates onverge, we are not hanging the wei-
ghts on average. So ea h update an be seen as a slight random perturbation
around the orre t weights. We'd like to keep su h sto hasti ee ts from
pushing the weights too far from the optimal solution. One way to deal with
this is to simply average the random ee ts by making the learning rate s ale
c
as ε= for a onstant c, somewhat less than one.
n
But this would be slow. It's good to keep the varian e of the sum of the
c
random perturbation at a onstant and instead set ε = √ : You may re all
n
i = 1n Zi
P
that if Zi is a Gaussian with zero zero and unit varian e, then
has varian e n. Here Zi orresponds to a gradient update based on the i-

th example. Dividing by the standard deviation of the sum, n, makes the
gradient updates have an overall xed varian e.
Sin e the update is also proportional to the norm of the input examples you
might also divide the learning rate by the overall s ale of the inputs. If we
have d binary oordinates, the norm is at most d. We get a learning rate of
c
ε= √ .
nd
Stopping riterion : We want to stop when a full iteration through the training
set does not make mu h dieren e on average
. Note that unless we an
perfe tly separate the training set, we would still expe t to get spe i training
examples that will ause hange, but at onvergen e they should an el ea h
other out. We should also not stop just be ause one, or a few, examples did
not ause mu h hange - it might be that other examples will.
And so, after ea h full iteration through the training set, we see how mu h
the weight hanged sin e before the iteration. As we do not know what the

41
Figura 1: Logisti regression log-likelihood, when trained with sto hasti gradient as ent, for
varying stopping riteria.

s ale of the weights will be, we he k the magnitude of the hange relative to
the magnitude of the weights. We stop if the hange falls bellow some low
threshold, whi h represents out desired a ura y of the result (this ratio is
the parameter stopdelta).

d. To al ulate also the lassi ation errors, we use a slightly expanded version
of logisti ll.m:

fun tion [ll,err℄ = logisti le(x,y,w)


p = g(w(1) + x*w(2:end));
ll = mean(y.*log(p) + (1-y).*log(1-p));
err = mean(y ˜ = (p > 0.5));
0.1 0.1
We set the learning rate to: ε = √ = , try a stopping granularity of
nd 80
δ = 0.0001, and get:
Sto hasti
Newton-Raphson Gradient As ent
Average log-probabilities:
Train -0.0829 -0.1190
test -0.2876 -0.2871

Sto hasti
Newton-Raphson Gradient As ent
Classi ation errors:
Train 0.01 0.02
test 0.125 0.1125
Results for various stopping granularities are presented in gures 1 and 2.

e. Although both optimization methods are trying to optimize the same


obje tive fun tion, neither of them is perfe t, and so we expe t to see some
dis repan ies, as we do in fa t see.

42
Figura 2: Logisti regression mean lassi ation error, when trained with sto hasti gradient
as ent, for varying stopping riteria.

In general, we would expe t the Newton-Raphson method implemented in


logisti reg.m to perform better, i.e. ome loser to the true optimum. This
should lead to a better obje tive fun tion, whi h espe ially for small values
of c, would translate into higher training performan e / lower training error.
On the other hand, the sto hasti gradient as ent might not ome as lose
to the optimum, espe ially when the stopping riteria is very relaxed. This
an be learly seen in gures 1 and 2, where training performan e improves
as the stopping riteria be omes more stringent, and eventually onverges to
the (almost) true optimum found with Newton-Raphson. Note also the slight
deviations from monotoni ity, whi h are a result of the randomness in the
sto hasti gradient as ent pro edure.
However, the same annot ne essarily be said about the test error. In fa t,
early stopping of sto hasti gradient as ent an in some ases be seen as a
form of regularization, that might lead to better generalization, and hen e
better training error. This an be seen in the gures (as well as in the tables
for δ = 0.001), espe ially when omparing the lassi ation errors. For values
of δ of around 0.01 to 0.0005, the logisti model found with sto hasti gradient
outperforms the optimum logisti model found with Newton-Raphson. This
does not mean that Newton-Raphson did not orre tly solve the optimization
problem  we tried to optimize maximize training log likelihood, whi h indeed
we did. We simply did too good of a job and overt the training data.
Early stopping an sometimes be useful as a regularization te hnique. In this
ase, we ould have also in reased to get stronger regularization.

f. We are sear hing the same spa e of lassiers, but with a dierent obje tive
fun tion. This time not the optimization method is dierent (whi h in theory
should not make mu h dieren e), but the a tual obje tive is dierent, and
hen e the true optimum is dierent. We would not expe t to nd the same

43
lassier.

g. There is no su h value of c. The obje tive fun tions are dierent, even
for c = 0. The logisti regression obje tive fun tion aims to maximize the
given
likelihood of the labels the input ve tors, while the Gaussian mixture
obje tive is to t a probabilisti model for the training input ve torsand
joint
labels, by maximizing their joint likelihood.

44
17. (Multi- lass regularized (L2 ) Logisti Regression
xxx with gradient des ent:
xxx appli ation to hand-written digit re ognition)
•◦ CMU, 2014 fall, W. Cohen, Z. Bar-Joseph, HW3, pr. 1
xxx CMU, 2011 spring, Tom Mit hell, HW3, pr. 2
A. In this part of the exer ise you will implement the two lass Logisti Re-
gression lassier and evaluate its performan e on digit re ognition.
The dataset
we are using for this as-
signment is a subset of the MNIST
handwritten digit database,6 whi h
is a set of 70,000 28 × 28 han-
dwritten digits from a mixture of
high s hool students and govern-
ment Census Bureau employees.
Your goal will be to write a logis-
ti regression lassier to distingu-
ish between a olle tion of 4s and 7s,
of whi h you an see some examples
in the nearby gure.
The data is given to you in the form
of a design matrix X and a ve -
tor y of labels indi ating the lass.
There are two design matri es, one
for training and one for evaluation.
The design matrix is of size m × n, where m is the number of examples and
n is the number of features. We will treat ea h pixel as a feature, giving us
n = 28 × 28 = 784.
Given a set of training points x1 , x2 , . . . , xm and a set of labels y1 , . . . , ym we want
to estimate the parameters of the model w. We an do this by maximizing
the log-likelihood fun tion.7
Given the sigmoid / logisti fun tion,
1
σ(x) = ,
1 + e−x
the ost fun tion and its gradient are
m
λ X
J(w) = ||w||22 − yi log σ(w⊤ xi ) + (1 − yi ) log(1 − σ(w⊤ xi ))
2 i=1
m
X
∇J(w) = λw − (yi − σ(w⊤ xi ))xi
i=1

Note (1): The ost fun tion ontains the regularization term, λ
2
||w||22 . Regu-
larization for es the parameters of the model to be pushed towards zero by
6 Y. LeCun, L. Bottou, Y. Bengio, and P. Haner, Gradient-based learning applied to do ument re ognition .
Pro eedings of the IEEE 86, 11 (Nov 1998), pp. 22782324.
7 LC: For the derivation of the update rule for logisti regression (together with an L regularization term),
2
see CMU, 2012 fall, T. Mit hell, Z. Bar-Joseph, HW2, pr. 2.

45
Tabela 1: Summary of notation used for Logisti Regression

Not. Meaning Type


m number of training examples s alar
n number of features s alar
xi ith augmented training data point (one digit example) (n + 1) × 1
X design matrix (all training examples) m × (n + 1)
yi ith training label (is the digit a 7?) {0, 1}
Y or y all training labels m×1
w parameter ve tor (n + 1) × 1
−t −1
S sigmoid fun tion, S(t) = (1 + e ) dim(t) → dim(t)
J ost (loss) fun tion Rn → R
∇J gradient of J (ve tor of derivatives in ea h dimension) Rn → Rn
α (parameter) gradient des ent learning rate s alar
d (parameter) de ay onstant for α to de rease s alar
by every iteration
λ (parameter) regularization strength s alar

penalizing large w values. This helps to prevent overtting


and also makes
the obje tive fun tion stri tly on ave, whi h means that there is a unique
solution.

Note (2): Please regularize the inter ept term


too, i.e., w(0) should also be
8
regularized. In order to keep the notation lean and make the implementation
easier, we assume that ea h xi has been augmented with an extra 1 at the
beginning, i.e., x′i = [1; xi ]. Therefore our model
of the log-odds is

1−p
log = w(0) + w(1) x(1) + . . . + w(n) x(n) .
p

Note (3): For models su h as linear regression we were able to nd a lo-
sed form solution for the parameters of the model. Unfortunately, for many
ma hine learning models, in luding Logisti Regression, no su h losed form
solutions exist. Therefore we will use a gradient-based method
to nd our
parameters.

The update rule for gradient as ent is

wi+1 = wi + αdi ∇J(wi ),

where α spe ies the learning rate


, or how large a step we wish to take, and d
is a de ay term that we use to ensure that the step sizes we make will gradually
get smaller, so long as we onverge. The iteration stops when the hange of x
or f (x) is smaller than a threshold.

a. Implement the ost fun tion and the gradient for logisti regression in
ostLR.m.9 Implement gradient des ent in minimize.m. Use your minimizer to
omplete trainLR.m.
8 Many resour es about Logisti Regression on the web do not regularize the inter ept term, so be aware if
you see dierent obje tive fun tions.
9 You an run run_logit.m to he k whether your gradients mat h the ost. The s ript should pass the
gradient he ker and then stop.

46
b. On e you have trained the model, you an then use it to make predi tions.
Implement predi tLR, whi h will generate the most likely lasses for a given
xi .

B. In this part of the exer ise you will implement the multi- lass lass Logisti
Regression lassier and evaluate its performan e on another digit re ognition,
provided by USPS. In this dataset, ea h hand-written digital image is 16 by
16 pixels. If we treat the value of ea h pixel as a boolean feature (either 0
for bla k or 1 for white), then ea h example has 16 × 16 = 256 {0, 1}-valued
features, and hen e x has 256 dimension. Ea h digit (i.e., 1,2,3,4,5,6,7,8,9,0)
orresponds to a lass label y (y = 1, . . . , K, K = 10). For ea h digit, we have
600 training samples and 500 testing samples.10
Please download the data from the website. Load the usps digital.mat le in
usps_digital.zip into Matlab. You will have four matri es:
• tr_X: training input matrix with the dimension 6000 × 256.
• tr_y: training label of the length 6000, ea h element is from 1 to 10.
• te_X: testing input matrix with the dimension 5000 × 256.
• te_y: testing label of the length 5000, ea h element is from 1 to 10.
For those who do NOT want to use Matlab, we also provide the text le
for these four matri es in usps_digital.zip. Note that if you want to view
the image of a parti ular training/testing example in Matlab, say the 1000th
training example, you may use the following Matlab ommand:
imshow(reshape(tr_X(1000,:),16,16)).
. Use the gradient as ent algorithm to train a multi- lass logisti regression
lassier. Plot (1) the obje tive value (log-likelihood), (2) the training a u-
ra y, and (3) the testing a ura y versus the number of iterations. Report
your nal testing a ura y, i.e., the fra tion of test images that are orre tly
lassied.
Note that you must hoose a suitable learning rate (i.e. stepsize) of the
gradient as ent algorithm. A hint
is that your learning rate annot be too
large otherwise your obje tive will in rease only for the rst few iterations.
In addition, you need to hoose a suitable stopping riterion. You might use
the number of iterations, the de rease of the obje tive value, or the maxi-
mum of the L2 norms of the gradient with respe t to ea h wk . Or you might
wat h the in rease of the testing a ura y and stop the optimization when the
a ura y is stable.

λ PK−1
d. Now we add the regularization term ||wl ||22 . For λ = 1, 10, 100, 1000,
2 i=1
report the nal testing a ura ies.

e. What an you on lude from the above experiment? (Hint: the relationship
between the regularization weight and the predi tion performan e.)

Solution:

b. You should get about 96% a ura y.


10 You an view these images at http://www. s.nyu.edu/∼roweis/data/usps_0.jpg, ...,
http://www. s.nyu.edu/∼roweis/data/usps_9.jpg.

47
. I use the stepsize η = 0.0001 and run the gradient as ent method for 5000
iterations. The obje tive value vs. the number of iterations, training error
vs. the number of iterations, testing error vs. the number of iterations are
presented in gure below:

d. For λ = 0, 1, 10, 100, 1000, the omparison of the testing a ura y is presented
in the next table:

λ 0 1 10 100 1000
Testing a ura y 91.44% 91.58% 91.92% 89.74% 79.78%

e. From the above result, we an see that adding the regularization ould
avoid overtting and lead to better generalization performan e (e.g., λ = 1, 10).
However, the regularization annot be too large. Although a larger regulari-
zation an de rease the varian e, it introdu es additional bias
and may lead
to worse generalization performan e.

48
18. (Multinomial/Categori al Logisti Regression,
xxx Gaussian Naive Bayes, Gaussian Joint Bayes, and k -NN:
xxx appli ation on the ORL Fa es dataset)

CMU 2010 spring, E. Xing, T. Mit hell, A. Singh, HW2, pr. 2


•·

In this part, you are going to play with The ORL Database of Fa es.

6 sample images from two persons

Ea h image is 92 by 112 pixels. If we treat the luminan e of ea h pixel as


a feature, ea h sample has 92 ∗ 112 = 10304 real value features, whi h an
be written as a random ve tor X . We will treat ea h person as a lass Y
(Y = 1, . . . , K, K = 10). We use Xi to refer the i-th feature. Given a set
of training data D = {(y l , xl )}, we will train dierent lassi ation models to
lassify images to their person id's. To simplify notation, we will use P (y|x)
in pla e of P (Y = y|X = x).

We will sele t our models by 10-fold ross validation: partition the data for
ea h fa e into 10 mutually ex lusive sets (folds). In our ase, exa tly one
image for ea h fold. Then, for k = 1, . . . , 10, leave out the data from fold k for
all fa es, train on the rest, and test on the left out data. Average the results
of these 10 tests to estimate the training a ura y of your lassier.

Note : Beware that we are a tually not evaluating the of generalization errors
the lassier here. When evaluating generalization error, we would need an
independent test set that is not at all tou hed during the whole developing
and tuning pro ess.

For your onvenien e, a pie e of ode loadFa es.m is provided to help loading
images as feature ve tors.

From Tom Mit hell's additional book hapter,11 page 13, you will see a gene-
ralization of logisti regression
, whi h allows Y to have more than two possible
values.

a. Write down the obje tive fun tion, and the rst order derivatives of the
multinomial logisti regression model (whi h is a binary lassier).12
Here we will onsider a L2 -norm regularized obje tive fun tion (with a term
λ|θ|2 ).

b. Implement the logisti regression model with gradient as ent. Show your
evaluation result here. Use regularization parameter λ = 0.
11 www. s. mu.edu/∼tom/mlbook/NBayesLogReg.pdf.
12 Hint : In order to do k - lass lassi ation with binary lassier, we use a voting s heme. At training time,
a lassier is trained for any pair of lasses. At testing time, all k(k − 1)/2 lassiers are applied to the testing
sample. Ea h lassier either vote for its rst lass or its se ond lass. The lass voted by most number of
lassiers is hosen as the predi tion.

49
Hint: The gradient as ent method (also known as steepest as ent) is a rst-
order optimization algorithm. It optimizes a fun tion f (x) by

xt+1 = xt + αt f ′ (xt ),

where αt is alled the step size


, whi h is often pi ked by line sear h
. For
example, we an initialize αt = 1.0. Then set αt = αt /2 while f (xt + αt f ′ (xt )) <
f (xt ). The iteration stops when the hange of x or f (x) is smaller than a
threshold.
Hint : If the training time of your model is too long, you an onsider use just
a subset of the features (e.g., in Matlab X = X(:,1:100:d)).

. Overtting and Regularization

Now we test how regularization


an help prevent overtting
. During ross-
validation, let's use m images from ea h person for training, and the rest
for testing. Report your ross-validated result with varying m = 1, . . . , 9 and
varying regularization parameter λ.

d. Logisti Regression and Newton's method

Newton's method (also known as the Newton-Raphson method) is a rst-


order optimization algorithm, whi h often onverges in a few iterations. It
Optimize a fun tion f (x) by the update equation

f ′ (xt )
xt+1 = xt −
f ′′ (xt )

The iteration stops when the hange of x or f (x) is smaller than a threshold.
Write down the se ond order derivatives and the update equation of the lo-
gisti regression model.
Implement the logisti regression model with Newton's method. Show your
evaluation result here.

B. Implement the k -NN algorithm. Use L2 -norm as the distan e metri . Show
your evaluation result here, and ompare dierent values of k .

C. Conditional Gaussian Estimation

For a Gaussian model we have

P (x|y)P (y)
P (y|x) = ,
P (x)
where
1
P (x|y) = exp(−(x − µy )⊤ Σ−1 (x − µy )/2),
(2π)d/2 |Σy |1/2
and P (y) = πy . Please write down the MLE estimation of model parameters
Σy , µy , and πy . Here we do not assume that Xi are independent given Y .
D. Gaussian Naive Bayes is a form of Guassian model with assumption that
Xi are independent given Y . Implement the Gaussian NB model, and briey
des ribe your evaluation result.

E. Compare the above methods by training/testing time, and a ura y.


Whi h method do you prefer?

50
19. (Model sele tion:
xxx sentiment analysis for musi reviews
xxx using a dataset provided by Amazon,
xxx using lasso logisti regression)
•· CMU, 2014 spring, B. Po zos, A. Singh, HW2, pr. 5
In this homework, you will perform model sele tion on a sentiment analysis
dataset of musi reviews.13 The dataset
onsists of reviews from Amazon. om
for musi s. The ratings have been onverted to a binary label, indi ating a
negative review or a positive review
. We will use lasso logisti regression
for
this problem.14 The lasso logisti regression obje tive fun tion
to minimize
during training is:

L(β) = log(1 + exp(−yβ ⊤ x)) + λkβk1

In lasso logisti regression, we penalize the loss fun tion by an L1 norm of


the feature oe ients. Penalization with an L1 norm tends to produ e so-
lutions where some oe ients are exa tly 0. This makes it attra tive for
high-dimensional data su h as text, be ause in most ases most words an
typi ally be ignored. Furthermore, sin e we are often left with only a few
nonzero oe ients, the lasso solution is often easy to interpret.
The goal
of model sele tion here is to hoose λ sin e ea h setting of λ implies
a dierent model size (number of non-zero oe ients).
You do not need to implement lasso logisti regression. You an download
an implementation from https://github. om/redpony/ reg, and the dataset an be
found on the ourse web page. There are three feature les
and three response
(label) les
(all response les end with .res). They are already in the format
required by the implementation you will use. The les are:

• Training data: musi .train and musi .train.res


• Development data: musi .dev and musi .dev.res
• Test data: musi .test and musi .test.res
Important note : The ode outputs a ura y , whereas you need to plot las-
si ation error here. You an simply transform a ura y to error by using
1 − a ura y.

Error on development (validation) data

In the rst part of the problem, we will use error on a development data to
hoose λ. Run the model with λ = {10−8, 10−7 , 10−6 , . . . , 10−1 , 1, 10, 100}.

a. Plot the error on training data and development data as a fun tion of log λ.
b. Plot the model size (number of nonzero oe ients) on development data
as a fun tion of log λ.

. Choose λ that gives the lowest error on development data. Run it on the
test data and report the test error.
13 John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders:
Domain Adaptation for Sentiment Classi ation. In Pro eedings of ACL, 2007.
14 Robert Tibshirani. Regression shrinkage and sele tion via the lasso, In
Journal of Royal Statisti al So iety
B, 58(1):267:288, 1996.

51
Briey dis uss all the results.

Model Complexity and Bias-Varian e Tradeo

d. Give a high-level explanation on the relation between λ and the bias


and
varian e of parameter estimates β̂ . Does larger λ orrespond to higher or
lower bias? What about the varian e? Does larger λ lead to a more omplex
or a less omplex model?

Resolving a tie

e. If there are more than one λ that minimizes the error on the development
data, whi h one will you pi k? Explain your hoi e.

Random sear h

f. An alternative way to sear h λ is by randomly sampling its value from an


interval.

i. Sample eleven random values log uniformly from an interval [10−8 , 100] for
λ and train a lasso logisti regression model. Plot the error on develop-
ment data as a fun tion of log λ.
ii. Choose λ that gives the lowest error on development data. Run it on the
test data and report the test error.

Random vs. grid sear h

g. Whi h one do you think is a better method for sear hing values to try for
λ? Why?

52
20. (Metoda [sub-℄gradientului:
x diferite fun µii de ost / pierdere ³i
x diferite fun µii / metode de regularizare)
• CMU, 2015 spring. Alex Smola, HW8, pr. 1

53
2 De ision Trees
21. (De ision trees: analysing the relationship between
xxx the dataset size and model omplexity)
•◦ CMU, 2012 fall, T. Mit hell, Z. Bar-Joseph, HW1, pr. 2.e
Here we will use a syntheti dataset generated by the following algorithm:
To generate an (x, y) pair, rst, six binary valued x1 , . . . , x6 are randomly gene-
rated, ea h independently with probability 0.5. This six-tuple is our x. Then,
to generate the orresponding y value:

f (x) = x1 ∨ (¬x1 ∧ x2 ∧ x6 )

f (x) with probability θ,
y =
else (1 − f (x)).

So Y is a possibly orrupted version of f (X), where the parameter θ ontrols


the noisiness. (θ = 1 is noise-free. θ = 0.51 is very noisy.) Get ode and test
data from . . . .
We will experimentally investigate the relationships between model omple-
xity training size
, , and lassier a ura y
.
We provide a Matlab implementation of ID3, without pruning, but featuring
a maxdepth parameter: traintree(trainX, trainY, maxdepth)
. It returns an
obje t representing the lassier, whi h an be viewed with printtree(tree)
.
Classify new data via lassifywithtree(tree, testX)
. We also provide the simu-
lation fun tion to generate the syntheti data: generatedata(N, theta)
, that
you an use to reate training data. Finally, there is a xed test set for all ex-
periments (generated using θ = 0.9). See tt1.m for sample ode to get started.
In lude printouts of your ode and graphs.

a. For a depth = 3 de ision tree learner, learn lassiers for training sets size
10 and 100 (generate using θ = 0.9). At ea h size, report training and test
a ura ies.

b. Let's tra k the learning urves for simple omplex lassiers


versus . For
maxdepth = 1 and maxdepth = 3, perform the following experiment:
For ea h training set size {21 , 22 , . . . , 210 }, generate a training set, t a tree, and
re ord the train and test a ura ies. For ea h (depth, trainsize) ombination,
average the results over 20 dierent simulated training sets. Make three
learning urve plots, where the horizontal axis is training size, and verti al
axis is a ura y. First, plot the two testing a ura y urves, for ea h maxdepth
setting, on the same graph. For the se ond and third graphs, have one for
ea h maxdepth setting, and on ea h plot its training and testing a ura y
urves. Pla e the graphs side-by-side, with identi al axis s ales. It may be
helpful to use a log-s ale for data size.

Next, answer several questions with no more than three senten es ea h:

. When is the simpler model better? When is the more omplex model
better?

d. When are train and test a ura ies dierent? If you're experimenting in the
real world and nd that train and test a ura ies are substantially dierent,
what should you do?

54
e. For a parti ular maxdepth, why do train and test a ura ies onverge to the
same pla e? Comparing dierent maxdepths, why do test a ura ies onverge
to dierent pla es? Why does it take smaller or larger amounts of data to do
so?

f. For maxdepths 1 and 3, repeat the same vary-the-training-size experiment


with θ = 0.6 for the training data. Show the graphs. Compare to the previous
ones: what is the ee t of noisier data?

Solution:

a.

b. Pink stars: depth = 3. Bla k x's: depth = 1. Blue ir les: training


a ura y. Red squares: testing a ura y.

. It is good to have high model omplexity when there is lots of training


data. When there is little training data, the simpler model is better.

d. They're dierent when you're overtting. If this is happening you have


two options: (1) de rease your model omplexity, or (2) get more data.

e. They onverge when the algorithm is learning the best possible model
from the model lass pres ribed by maxdepth: this gets same a ura y on the
training and test sets. (2) The higher omplexity (maxdepth=3) model lass
learns the underlying fun tion better, thus gets better a ura y. But, (3) the
higher omplexity model lass has more parameters to learn, and thus takes
more data to get to this point.

f. It's mu h harder for the omplex model to do better. Also, it takes mu h


longer for all test urves to onverge. (Train/test urves don't onverge to
same pla e be ause noise levels are dierent.)
Note: Colors and styles are the same as for previous plots. These plots test
all the way up to 215 training examples: you an see where they onverge to,
whi h is not ompletely lear with only 210 examples.

55
22. (De ision trees: experiment
xxx with an ID3 implementation (in C))

•◦ CMU, 2012 spring, Roni Rosenfeld, HW3


This exer ise gives you the opportunity to experiment with a de ision tree
learning program. You are rst asked to experiment with the simple Play-
Tennis data des ribed in Chapter 3 of Tom Mit hell's Ma hine Learning
book,
and then to experiment with a onsiderably larger data set.

We provide most of the de ision tree ode. You will have to omplete the
ode, test, and prune a de ision tree based on the ID3 algorithm des ribed in
Chapter 3 of the textbook. You an obtain it as a gzipped ar hive from . . . .

To unzip and get started on a Linux/Ma ma hine, do the following:


1. Download the hw3.tgz le to your working dire tory.
2. Issue the ommand tar -zxvf hw3.tgz to unzip and untar the le. This will
reate a subdire tory hw3 in the urrent dire tory.
3. Type make to ompile and you are ready to go. The exe utable is alled
dt. There is a help le named README.dt whi h ontains instru tions on how
to run the dt program.

If you work from a Windows ma hine, you an install Cygwin and that should
give you a Linux environment. Remember to install the Devel ategory to
get g . Depending on your ma hine some tweaks might be needed to make
it work.

A. Play Tennis Data

The training data from Table 3.2 in the textbook is available in the le
tennis.ssv. Noti e that it ontains the fourteen training examples repeated
twi e. For question A.2, please use it as given (with the 28 training exam-
ples). For question A.4, you will need to extra t the fourteenuniquetraining
examples and use those in addition to the ones you invent.

A1. If you try running the ode now it will not work be ause the fun tion that
al ulates the entropy has not been implemented. (Remember that entropy is
required in turn to ompute the information gain.) It is your job to omplete
it. You will have to make your hanges in the le entropy. . After you
orre tly implement the entropy al ulation, the program will produ e the
de ision tree shown in Figure 3.1 of the textbook when run on tennis.ssv
(with all examples used for training).

Hint : When you implement the entropy fun tion, be sure to deal with asts
from int to double orre tly. Note that num_pos/num_total = 0 if num_pos and
num_total are both int's. You must do ((double)num_pos)/num_total to get the
desired result or, alternately, dene num_total as a double.

A2. Try running the program a few times with half of the data set for training
and the other half for testing (no pruning). Print out your ommand for
running the program. Do you get a dierent result ea h time? Why? Report
the average a ura y of 10 runs on both training and test data (use the bat h
option of dt). For this question please use tennis.ssv as given.

A3. If we add the following examples:

56
0 sunny hot high weak
1 sunny ool normal weak
0 sunny mild high weak
0 rain mild high strong
1 sunny mild normal strong

to the original tennis.ssv that has 28 examples, whi h attribute would be


sele ted at the root node? Compute the information gain for ea h attribute
and nd out the max. Show how you al ulate the gains.

A4. By now, you should be able to tea h ID3 the on ept represented by
Figure 3.1; we all this the orre t on ept . If an example is orre tly labeled
by the orre t on ept, we all the example a orre t example. For this ques-
tion, you will need to extra t the fourteen unique examples from tennis.ssv.
In ea h of the questions below, you will add some your own examples to the
original fourteen, and use all of them for training (You will not use a testing
or pruning set here.). Turn in your datasets (named a ording to the Hand-in
se tion at the end).
a. Dupli ate some of the fourteen training examples in order to get ID3 to
learn a dierent de ision tree.
b. Add new orre tsamples to the original fourteen samples in order to get
ID3 to in lude the attribute temperature in the tree.

A5. Use the fourteen unique examples from tennis.ssv. Run ID3 on e using
all fourteen for training, and note the stru ture (whi h is in Figure 3.1). Now,
try ipping the label (0 to 1, or 1 to 0) of any one example, and run ID3 again.
Note the stru ture of the new tree.
Now hange the example's label ba k to orre t. Do the same with another
four samples. (Flip one label, run ID3, and then ip the label ba k.) Give
some general observations about the stru tures (dieren es and similarities
with the original tree) ID3 learns on these slightly noisy
datasets.

B. Agari us-Lepiota Data Set


The le mushroom.ssv ontains re ords drawn from The Audobon So iety Fi-
eld Guide to North Ameri an Mushrooms (1981), G. H. Lin o (Pres.), New
York. Alfred A. Knopf, posted [LC: it on the℄ UCI Ma hine Learning Repo-
sitory. Ea h re ord ontains information about a mushroom and whether or
not it is poisonous.

B1. First we will look at how the quality of the learned hypothesis varies
with the size of the training set. Run the program with training sets of size
10%, 30%, 50%, 70%, and 90%, using 10% to test ea h time. Please run a
parti ular size at least 10 times. You may want to use the bat h mode option
provided by dt.
Constru t a graph with training set size on the x-axis and test set a ura y
on the y-axis. Remember to pla e errorbars
on ea h point extending one
standard deviation above and below the point. You an do this in Matlab,
Mathemati a, GNUplot or by hand.
If you use gnuplot:
1. Create a le data.txt with your results with the following format:
2. Ea h line has <training size> <a ura y> <standard deviation>.

57
3. Type gnuplot to get to the gnuplot ommand prompt. At the prompt,
type set terminal posts ript followed by set output graph.ps and nally plot
data.txt with errorbars to plot the graph.
B2. Now repeat the experiment, but with a noisy dataset, noisy10.ssv, in
whi h ea h label in has been ipped with a han e of 10%. Run the program
with training sizes from 30% to 70% at the step of 5% (9 sizes in total), using
10% at ea h step to test and at least 10 trials for ea h size. Plot the graph
of test a ura y and ompare it with the one from B.1. In addition, plot
the number of nodes in the result trees against the training %. Note that
the training a ura y de reases slightly after a ertain point. You may also
observe dips in the test a ura y. What ould be ausing this?

B3. One way to battle these phenomena is with pruning. For this question,
you will need to omplete the implementation of pruning that has been pro-
vided. As it stands, the pruning fun tion onsiders only the root of the tree
and does not re ursively des end to the sub-trees. You will have to x this by
implementing the re ursive all in PruneDe isionTree() (in prune-dt. ). Re all
that pruning traverses the tree removing nodes whi h do not help lassi a-
tion over the validation set. Note that pruning a node entails removing the
sub-tree(s) below a node, and not the node itself.
In order to implement the re ursive all, you will need to familiarize yourself
with the trees representation in C. In parti ular, how to get at the hildren of
a node. Look at dt.h for details. A de ision you will make is when to prune
a sub-tree, before or after pruning the urrent node. Bottom-Up pruning
is when you prune the subtree of a node, before onsidering the node as a
pruning andidate. Top-down pruning is when you rst onsider the node
as a pruning andidate and only prune the subtree should you de ide not to
eliminate the node. Please do NOT mix the two up. If you are in doubt,
onsult the book.
Write out on paper the ode that you would need to add for both bottom-up
and top-downpruning. Implement only the bottom-up ode and repeat the
experiments in B.2 using 20% of the data for pruning at ea h trial. Plot the
graph of test a ura y and number of nodes and omment on the dieren es
from B.2.

B4. Answer the following questions with explanation.


a. Whi h pruning strategy would make more alls? top-down or bottom-up?
or is it data-dependent?
b. Whi h pruning strategy would result in better a ura y on the pruning
set?top-down or bottom-up? or is it data-dependent?
. Whi h pruning strategy would result in better a ura y on the testing set?
top-down or bottom-up? or is it data-dependent?

Hand-In Instru tions

Besides your assignment write-up, here is the additional materials you will
need to hand in. Your write-up should in lude the graphs asked for.
A.1. Hand in your modied entropy. .
A.2. Nothing to hand in.
A.3. Nothing to hand in.

58
A.4. For part a, hand in tennis1.4.a.ssv, whi h should ontain the original
data plus the samples you reated appended at the end of the le. Likewise
for b.
B.1. Nothing to hand in.
B.2. Nothing to hand in.
B.3. Hand in your modied prune-dt. .
B.4. Nothing to hand in.

Hints:

− If you are unsure about your answer, play with ode to see if you an
experimentally verify your intuitions.

− It is very helpful to in lude explanations with your examples, or at least


mention how you onstru ted the example, what was the reasoning behind
your hoi es, et .

− Please label the axes and spe ify what a ura y/performan e metri you
are measuring and on what dataset: e.g. training, testing, validation, noisy10
et .

Solution:

A1. The Entropy fun tion should be like:

double Entropy( int num_pos, int num_neg )


{
if (num_pos == 0 || num_neg == 0)
return 0.0;
double entropy = 0.0;
double total = (double) (num_pos + num_neg);
entropy = - (num_pos / total) * LogBase2( num_pos / total )
- (num_neg / total) * LogBase2( num_neg / total );
return entropy;
}
A2. The ommand is: ./dt 0.5 0 0.5 tennis.ssv
The results are dierent, be ause the ode randomly splits the data, and ea h
time a dierent training set is used.
The training a ura y is always 100%. The testing a ura y should be around
82%.

A3. The data set now has 13 negatives and 20 positives. So the overall entropy
is: 0.967.
Using outlook, the information gain is: 0.218.
Using temperature, the information gain is: 0.047.
Using humidity, the information gain is: 0.221.
Using wind, the information gain is: 0.025.
Therefore, humidity should be sele ted.

A4.

59
a. Examples given in question A.3 are a tually dupli ates that hanged the
tree.
b. The idea is to let temperature determine Play-Tennis. For example, we
an add the following:

1 sunny ool normal weak


1 over ast ool normal weak
1 rain ool normal weak
1 sunny ool normal weak
0 rain hot normal strong
0 sunny hot high weak
0 sunny hot high weak
0 rain hot normal strong

A5. Generally, noisy data sets produ e bigger trees. However the rules im-
plied by these trees are quite stable. Some trees may have the same top
stru ture as the true stru ture. These overall similarities to the true stru -
ture give some intuition for why pruninghelps; pruning an ut away the
extra subtrees whi h model small ee ts whi h might be from noise .

B1. I ran ea h size 20 times, and got a graph like this:

B2.

60
This de rease in testing a ura y with the larger training may be aused by
a form of overtting; that is, the algorithm tries to perfe tly mat h the data
in the training set, in luding the noise, and as a result the omplexity of
the learned tree in reases very rapidly as the number of training examples
in reases.

Note that this is not the usual sense of overtting, sin e typi ally overtting is
more of a problem when the number of training examples is small. However,
here we also have the problem that the omplexity of the hypothesis spa e
is an in reasing fun tion of the number of training examples. See how the
number of nodes grows.

There are also dips in the a ura y on the test set, a point where the a -
ura y de reased before in reasing again. This is be ause of more omplex
on epts; there are always two ompeting for es here: the information ontent
of the training data, whi h in reases with the number of training examples
and pushes toward higher a ura ies, and the omplexity of the hypothesis
spa e, whi h gets worse as the number of training examples in reases.

You may also noti e that the training a ura y slightly de reases as the size
of the training set grows. This seems to be purely due to the noisy labels,
whi h makes it impossible to onstru t a onsistent tree, and the more pairs
of examples you have in the training set that have ontradi ting labels, the
worse will be the training error.

B3.
For bottom-up pruning, add to the beginning of the fun tion:

/*******************************************************************
You ould insert the re ursive all BEFORE you he k the node
*******************************************************************/
for (i = 0 ; i < node->num_ hildren ; i++)
PruneDe isionTree(root, node-> hildren[i℄, data, num_data, pruning_set, num_prune, ssvinfo);

For top-down pruning, add to the end of the fun tion:

/*******************************************************************
Or you ould do the re ursive all AFTER you he k the node
(given that you de ided to keep it)
*******************************************************************/
for (i = 0 ; i < node->num_ hildren ; i++)
PruneDe isionTree(root, node-> hildren[i℄, data, num_data, pruning_set, num_prune, ssvinfo);

61
By running ea h size 20 times I got a graph like this:

B4.
a. Bottom-up. Bottom-up pruning examines all the nodes. Top-down pruning
may eliminate a subtree without examining the nodes in the subtree, leading
to fewer alls than bottom-up.
b. Bottom-up. By the property of the algorithm, bottom-up pruning returns
the tree with the LOWEST POSSIBLE ERROR over the pruning set. Sin e
top-down an aggressively eliminate subtree's without onsidering ea h of the
nodes in the subtree, it ould return a non-optimal tree (over the pruning
set that is). Keep in mind that the fun tion used to de ide whether a node
should be removed or not is the same for both BU and TD and only the sear h
strategy diers.
. Data depedent. If the test set is very dierent from the training set, a
shorter tree yielded by top-down pruning may perform better, be ause of its
potentially better generalization power.

62
23. (ID3 with ontinuous attributes:
xxx experiment with a Matlab implementation
xxx on the Breast Can er dataset)
•◦ CMU, 2011 fall, T. Mit hell, A. Singh, HW1, pr. 2
One very interesting appli ation area of ma hine learning is in making medi al
diagnoses. In this problem you will train and test a binary de ision tree to
dete t breast an er using real world data. You may use any programming
language you like.
The Dataset

We will use the Wis onsin Diagnosti Breast Can er (WDBC) dataset15 The
dataset onsists of 569 samples of biopsied tissue. The tissue for ea h sample
is imaged and 10 hara teristi s of the nu lei of ells present in ea h image
are hara terized. These hara teristi s are

(a) Radius
(b) Texture
( ) Perimeter
(d) Area
(e) Smoothness
(f ) Compa tness
(g) Con avity
(h) Number of on ave portions of ontour
(i) Symmetry
(j) Fra tal dimension

Ea h of the 569 samples used in the dataset onsists of a feature ve tor of


length 30. The rst 10 entries in this feature ve tor are the of the mean
hara teristi s listed above for ea h image. The se ond 10 are the standard
deviation and last 10 are the largest value of ea h of these hara teristi s
present in ea h image.
Ea h sample is also asso iated with a label. A label of value 1 indi ates the
sample was for malignant ( an erous) tissue. A label of value 0 indi ates the
sample was for benign tissue.
This dataset has already been broken up into training, validation and test sets
for you and is available in the ompressed ar hive for this problem on the lass
website. The names of the les are trainX. sv, trainY. sv, validationX. sv,
validationY. sv, testX. sv and testY. sv. The le names ending in X. sv
ontain feature ve tors and those ending in Y. sv ontain labels. Ea h le is
in omma separated value format where ea h row represents a sample.

A. Programming

A1. Learning a binary de ision tree

As dis ussed in lass and the reading material, to learn a binary de ision tree
we must determine whi h feature attribute to sele t as well as the threshold
15 Original dataset available at http://ar hive.i s.u i.edu/ml/datasets/Breast+Can er+Wis onsin+(Diagnosti ).

63
value to use in the split riterion for ea h non-leaf node in the tree. This
an be done in a re ursive manner, where we rst nd the optimal split for
the root node using all of the training data available to us. We then split
the training data a ording to the riterion sele ted for the root node, whi h
will leave us with two subsets of the original training data. We then nd
the optimal split for ea h of these subsets of data, whi h gives the riterion
for splitting on the se ond level hildren nodes. We re ursively ontinue this
pro ess until the subsets of training data we are left with at a set of hildren
nodes are pure (i.e., they ontain only training examples of one lass) or the
feature ve tors asso iated with a node are all identi al (in whi h ase we an
not split them) but their labels are dierent.
In this problem, you will implement an algorithm to learn the stru ture of a
tree. The optimal splits at ea h node should be found using the information
gain riterion dis ussed in lass.
While you are free to write your algorithm in any language you hoose, if
you use the provided Matlab ode in luded in the ompressed ar hive for
this problem on the lass website, you only need to omplete one fun tion,
omputeOptimalSplit.m. This fun tion is urrently empty and only ontains
omments des ribing how it should work. Please omplete this fun tion so
that given any set of training data it nds the optimal split a ording to the
information gain riterion.
In lude a printout of your ompleted omputeOptimalSplit.m along with any
other fun tions you needed to write with your homework submission. If you
hoose to not use the provided Matlab ode, please in lude a printout of all
the ode you wrote to train a binary de ision tree a ording to the des ription
given above.

Note : While there are multiple ways to design a de ision tree, in this problem
we onstrain ourselves to those whi h simply pi k one
feature attribute to split
on. Further, we restri t ourselves to performing only binary splits. In other
words, ea h split should simply determine if the value of a parti ular attribute
in the feature ve tor of a sample is less than or equal to a threshold value or
greater than the threshold value.

Note : Please note that the feature attributes in the provided dataset are
ontinuously valued. There are two things to keep in mind with this.
First, this is slightly dierent than working with feature values whi h are
dis rete be ause it is no longer possible to try splitting at every possible
feature value (sin e there are an innite number of possible feature values).
One way of dealing with this is by re ognizing that given a set of training data
of N points, there are only N − 1 pla es we ould pla e splits for the data (if
we onstrain ourselves to binary splits). Thus, the approa h you should take
in this fun tion is to sort the training data by feature value and then test split
values that are the mean of ordered training points. For example , if the points
to split between were 1, 2, 3, you would test two split values: 1.5 and 2.5.
Se ond, when working working with feature values that an only take on one
of two values, on e we split using one feature attribute, there is no point in
trying to split on that feature attribute later. (Can you think of why this
would be?) However, when working with ontinuously valued data, this is no
longer the ase, so your splitting algorithm should onsider splitting on all
feature attributes at every split.

64
A2. Pruning a binary de ision tree

The method of learning the stru ture and splitting riterion for a binary de-
ision tree des ribed above terminates when the training examples asso iated
with a node are all of the same lass or there are no more possible splits.
In general, this will lead to overtting
. As dis ussed in lass, is one pruning
method of using validation data to avoid overtting.

You will implement an algorithm to use validation data


to greedily prune
a binary de ision tree in an iterative manner. Spe i ally, the algorithm
that we will implement will start with a binary de ision tree and perform an
exhaustive sear h for the single node for whi h removing it (and its hildren)
produ es the largest in rease (or smallest de rease) in lassi ation a ura y
as measured using validation data. On e this node is identied, it and its
hildren are removed from the tree, produ ing a new tree. This pro ess is
repeated, where we iteratively prune one node at a time until we are left with
a tree whi h onsists only of the root node.16

Implement a fun tion whi h starts with a tree and sele ts the single best
node to remove in order to produ e the greatest in rease (or smallest de-
rease) in lassi ation a ura y as measured with validation data. If you
are using Matlab, this means you only need to omplete the empty fun tion
pruneSingleGreedyNode.m. Please see the omments in that fun tion for de-
tails on what you should implement. We suggest that you make use of the
provided Matlab fun tiona pruneAllNodes.m whi h will return a listing of all
possible trees that an be formed by removing a single node from a base tree
and bat hClassifyWithDT.m whi h will lassify a set of samples given a de ision
tree.

Please in lude your version of pruneSingleGreedyNode.m along with any other


fun tions you needed to write with your homework. If not using Matlab,
please atta h the ode for a fun tion whi h performs the same fun tion des-
ribed for pruneSingleGreedyNode.m.

B. Data analysis

B1. Training a binary de ision tree

In this se tion, we will make use of the ode that we have written above.
We will start by training a basi de ision tree. Please use the training data
provided to train a de ision tree. (In Matlab, assuming you have ompleted
the omputeOptimalSplit.m fun tion, the fun tion trainDT.m an do this training
for you.)

Please spe ify the total number of nodes and the total number of leaf nodes
in the tree. (In Matlab, the fun tion gatherTreeStats.m will be useful.) Also,
please report the lassi ation a ura y (per ent orre t) of the learned de i-
sion tree on the provided training and testing data. (In Matlab, the fun tion
bat hClassifyWithDT.m will be useful.)

B2. Pruning a binary de ision tree

16 In pra ti e, you an often simply ontinue the pruning pro ess until the validation error fails to in rease by
a predened amount. However, for illustration purposes, we will ontinue until there is only one node left in the
tree.

65
Now we will make use of the pruning ode we have written. Please start
with the tree that was just trained in the previous part of the problem and
make use of the validation data to iteratively remove nodes in the greedy
manner des ribed in the se tion above. Please ontinue iterations until a
degenerate tree with only a single root node remains. For ea h tree that
is produ ed, please al ulate the lassi ation a ura y for that tree on the
training, validation and testing datasets.
After olle ting this data, please plot a line graph relating lassi ation a u-
ra y on the test set to the number of leaf nodes in ea h tree (so number of
leaf nodes should be on the X-axis and lassi ation a ura y should be on
the Y-Axis). Please add to this same gure, similar plots for per ent a ura y
on training and validation data. The number of leaf nodes should range from
1 (for the degenerate tree) to the the number present in the unpruned tree.
The Y-axis should be s aled between 0 and 1.
Please omment on what you noti e and how this illustrates overtting. In-
lude the produ ed gure and any ode you needed to write to produ e the
gure and al ulate intermediate results with your homework submission.

B3. Drawing a binary de ision tree

One of the benets of de ision trees is the lassi ation s heme they en ode
is easily understood by humans. Please sele t the binary de ision tree from
the pruning analysis above that produ ed the highest a ura y on the vali-
dation dataset and diagram it. (In the event that two trees have the same
a ura y on validation data, sele t the tree with the smaller number of leaf
nodes.) When stating the feature attributes that are used in splits, please
use the attribute names (instead of index) listed in the dataset se tion of this
problem. (If using the provided Matlab ode, the fun tion trainDT has a se -
tion of omments whi h des ribes how you an interpret the stru ture used
to represent a de ision tree in the ode.)

Hint :The best de ision tree as measured on validation data for this problem
should not be too ompli ated, so if drawing this tree seems like a lot of work,
then something may be wrong.

B4. An alternative splitting method


While information gain is one riterion to use when estimating the optimal
split, it is by no means the only one. Consider instead using a riterion where
we try to minimize the weighted mis lassi ation rate
.
Formally, assume a set of D data samples {< ~ x(i) , y (i) >}D
i=1 , where y
(i)
is the
label of sample i, and ~ x is the feature ve tor for sample i. Let x(j)(i) refer
(i)

to the value of the j th attribute of the feature ve tor for data point i.
Now, to pi k a split riterion, we pi k a feature attribute, a, and a threshold
value, t, to use in the split. Let:

D
1 X  
pbelow (a, t) = I x(a)(i) ≤ t
D i=1
D
1 X  
pabove(a, t) = I x(a)(i) > t
D i=1

66
and let:


lbelow (a, t) = Mode {yi }i:x(a)(i) ≤t

labove(a, t) = Mode {yi }i:x(a)(i) >t

The split that minimizes the weighted mis lassi ation rate is then the one
whi h minimizes:

X  
O(a, t) =pbelow (a, t) I y (i) 6= lbelow (a, t) +
i:x(a)(i) ≤t
X  
pabove (a, t) I y (i) 6= labove (a, t)
i:x(a)(i) >t

Please modify the ode for your omputeOptimalSplit.m (or equivalent fun tion
if not using Matlab) to perform splits a ording to this riterion. Atta h the
ode of your modied fun tion when submitting your homework.

After modifying omputeOptimalSplit.m, please retrain a de ision tree (without


doing any pruning). In your homework submission, please indi ate the total
number of nodes and total number of leaf nodes in this tree. How does this
ompare with the tree that was trained using the information gain riterion?

Erratum: It is important to note there is an error


in question B.4, the alterna-
tive splitting method. The question stated that if you minimized the equation
for O(a, t) with respe t to a and t, you would nd the optimal split for the
mis lassi ation rate riteria. However, this fun tion was missing something
important. The terms summing the number of samples mis lassied above
and below the split point should have been normalized
. Spe i ally, the term
summing the number of samples mis lassied above the split should have been
divided by the total number of samples above the split and the term summing
the number of samples mis lassied below the split should have been divided
by the total number of samples below the split.

Solution:

B1. There are 29 total nodes and 15 leafs in the unpruned tree. The training
a ura y is 100% and test a ura y is 92.98%.

B2. The orre t plot is shown below.

67
Overtting is evident: as the number of leafs in the de ision tree grows, per-
forman e on the training set of data in reases. However, after a ertain point,
adding more leaf nodes (after 5 in this ase) detrimentally ae ts performan e
on test data as the more ompli ated de ision boundaries that are formed es-
sentially ree t noise in the training data.

B3. The orre t diagram is shown below.

B4. The new tree has 16 leafs and 31 nodes. The new tree has 1 more leaf
and 2 more nodes than the original tree.

68
24. (AdaBoost: apli ation on a syntheti dataset in R10 )
•· CMU, ? spring (10-701), HW3, pr. 3

69
25. (AdaBoost: apli ation on a syntheti dataset in R2 )
•· CMU, 2016 spring, W. Cohen, N. Bal an, HW4, pr. 3.5

70
26. (AdaBoost: appli ation on Bupa Liver Disorder dataset)
• CMU, 2007 spring, Carlos Guestrin, HW2, pr. 2.3
Implement the AdaBoost algorithm using a de ision stump as the weak las-
sier.
AdaBoost trains a sequen e of lassiers. Ea h lassier is trained on the
same set of training data (xi , yi ), i = 1, . . . , m, but with the signi an e Dt (i)
of ea h example {xi , yi } weighted dierently. At ea h iteration, a lassi-
er, ht (x) → {−1, 1}, is trained to minimize the weighted lassi ation er-
Pm
ror, i=1 Dt (i) · I(ht (xi ) 6= yi ), where I is the indi ator fun tion (0 if the
predi ted and a tual labels mat h, and 1 otherwise). The overall predi -
tion of the AdaBoost algorithm is a linear ombination of these lassiers,
HT (x) = sign( Tt=1 αt ht (x)).
P

A de ision stump is a de ision tree with a single node. It orresponds to a sin-


gle threshold in one of the features, and predi ts the lass for examples falling
above and below the threshold respe tively, ht (x) = C1 I(xj ≥ c) + C2 I(xj < c),
where xj is the j th omponent of the feature ve tor x. For this algorithm split
the data based on the weighted lassi ation a ura y des ribed above, and
nd the lass assignments C1 , C2 ∈ {−1, 1}, threshold c, and feature hoi e j
that maximizes this a ura y.

a. Evaluate your AdaBoost implementation on the Bupa Liver Disorder da-


taset that is available for download from the . . . website. The lassi ation
problem is to predi t whether an individual has a liver disorder (indi ated
by the sele tor feature) based on the results of a number of blood tests and
levels of al ohol onsumption. Use 90% of the dataset for training and 10% for
testing. Average your results over 50 random splits of the data into training
sets and test sets. Limit the number of boosting iterations to 100. In a single
plot show:

• average training error after ea h boosting iteration


• average test error after ea h boosting iteration

b. Using all of the data for training, display the sele ted feature omponent
j , threshold c, and lass label C1 of the de ision stump ht (x) used in ea h of
the rst 10 boosting iterations (t = 1, 2, ..., 10).

. Using all of the data for training, in a single plot, show the empiri al umu-
lative distribution fun tions of the margins yi fT (xi ) after 10, 50 and 100 ite-
PT
rations respe tively, where fT (x) = αt ht (x). Noti e that in this problem,
t=1 PT
before al ulating fT (x), you should normalize the αt s so that t=1 αt = 1.
This is to ensure that the margins are between −1 and 1.
Hint : The empiri al umulative distribution fun tion of a random variable X
at x is the proportion of times X ≤ x.

71
27. (AdaBoost (basi , and randomized de ision stumps versions):
xxx appli ation to high energy physi s)

•◦ Stanford, 2016 fall, A. Ng, J. Du hi, HW2, pr. 6.d


In this problem, we apply [two versions of the℄ AdaBoost algorithm to de-
te t parti le emissions in a high-energy parti le a elerator. In high energy
physi s, su h as at the Large Hadron Collider (LHC), one a elerates small
parti les to relativisti speeds and smashes them into one another, tra king
the emitted parti les. The goal is to dete t the emission of ertain interesting
parti les based on other observed parti les and energies.17 Here we explore
the appli ation of boosting to a high energy physi s problem, where we use
de ision stumps applied to 18 low- and high-level physi s-based features. All
data for the problem is available at ....

You will implement AdaBoost using de ision stumps and run it on data deve-
loped from a physi s-based simulation of a high-energy parti le a elerator.

We provide two datasets, boosting-train. sv and boosting-test. sv, whi h


onsist of training data and test data for a binary lassi ation problem. The
les are omma-separated les, the rst olumn of whi h onsists of binary
±1-labels y (i) , the remaining 18 olumns are the raw attribtues (low- and high-
level physi s-based features).

The MatLab le load_data.m, whi h we provide, loads the datasets into me-
mory, storing training data and labels in appropriate ve tors and matri es, and
then performs boosting using your implemented ode, and plots the results.

a. Implement a method that nds the optimal thresholded de ision stump


for a training set {x(i) , y (i) }m m
i=1 and distribution p ∈ R+ on the training set. In
parti ular, ll out the ode in the method find_best_threshold.m.

b. Implement boosted de ision stumps by lling out the ode in the method
stump_booster.m. Your ode should implement the weight updating at ea h
iteration t = 1, 2, . . . to nd the optimal value θt given the feature index and
threshold.

. Implement random boosting


, where at ea h step the hoi e of de ision stump
is made ompletely randomly. In parti ular, at iteration t random boosting
hooses a random index j ∈ {1, 2, . . . , n}, then hooses a random threshold
(i)
s from among the data values {xj }m i=1 , and then hooses the t-th weight θt
optimally for this (random) lassier φs,+ (x) = sign(xj − s). Implement this by
lling out the ode in random_booster.m.
d. Run the method load data.m with your implemented boosting methods.
In lude the plots this method displays, whi h show the training and test error
for boosting at ea h iteration t = 1, 2, . . .. Whi h method is better?

[A few notes: we do not expe t boosting to get lassi ation a ura y better
than approximately 80% for this problem.℄

Solution:
17 For more information, see the following paper: Baldi, Sadowski, Whiteson. Sear hing for Exo-
ti Parti les in High-Energy Physi s with Deep Learning. Nature Communi ations 5, Arti le 4308.
http://arxiv.org/abs/1402.4735.

72
Random de ision stumps require about 200 iterations to get to error .22 or
so, while regular boosting (with greedy de ision stumps) requires about 15
iterations to get this error. See gure below.

[Caption:℄ Boosting error for random sele tion of de ision stumps and the
greedy sele tion made by boosting.

73
28. (AdaBoost with logisti loss,
xxx applied on a breast an er dataset)
•◦ MIT, 2003 fall, Tommi Jaakkola, HW4, pr. 2.4-5
a. We have provided you with most of [MatLab ode for℄ the boosting
algorithm with the logisti loss and de ision stumps. The available om-
ponents are build_stump.m, eval_boost.m, eval_stump.m, and the skeleton of
boost_logisti .m. The skeleton in ludes a bi-se tion sear h of the optimizing
α but is missing the pie e of ode that updates the weights. Please ll in the
appropriate weight update.
model = boost logisti (X,y,10); returns a ell array of 10 stumps. The ro-
utine eval_boost(model,X) evaluates the ombined dis riminant fun tion or-
responding to any su h array.

b. We have provided a dataset pertaining to an er lassi ation (see an er.txt


for details). You an get the data by data = loaddata; whi h gives you trai-
ning examples data.xtrain and labels data.ytrain. The test examples are in
data.xtest and data.ytest. Run the boosting algorithm with the logisti loss
for 50 iterations and plot the training and test errors as a fun tion of the
number of iterations. Interpret the resulting plot.
Note that sin e the boosting algorithm returns a ell array of omponent
stumps, stored for example in model, you an easily evaluate the predi tions
based on any smaller number of iterations by sele ting a part of this array as
in model{1:10}.

Solution:

Plot of number of mis lassied test ases (out of 483 ases) vs. number of
boosting iterations.

74
29. (AdaBoost with onden e rated de ision stumps;
xxx appli ation to handwritten digit re ognition;
xxx analysis of the evolution of voting margins)

•◦ MIT, 2001 fall, Tommi Jaakkola, HW3, pr. 1.4

Let's explore how AdaBoost behaves in pra ti e. We have provided you with
MatLab ode that nds and evaluates ( onden e rated) de ision stumps.18
These are the hypothesis that our boosting algorithm assumes we an ge-
nerate. The relevant Matlab les are boost_digit.m, boost.m, eval_boost.m,
find_stump.m, eval_stump.m. You'll only have to make minor modi ations
to boost.m and, a bit later, to eval_boost.m and boost digit.m to make these
work.

a. Complete the weight update in boost.m and run boost_digit to plot the
training and test errors for the ombined lassier as well as the orresponding
training error of the de ision stump, as a fun tion of the number of iterations.
Are the errors what you would expe t them to be? Why or why not?

We will now investigate the lassi ation margins of training examples. Re-
all that the lassi ation margin of a training point in the boosting ontext
ree ts the  onden e in whi h the point was lassied orre tly. You an
view the margin of a training example as the dieren e between the weighted
fra tion of votes assigned to the orre t label and those assigned to the in or-
re t one. Note that this is not a geometri notion of margin but one based
on votes. The margin will be positive for orre tly lassied training points
and negative for others.

b. Modify eval boost.m so that it returns normalized predi tions (normalized


by the total number of votes). The resulting predi tions should be in the range
[−1, 1]. Fill in the missing omputation of the training set margins in boost
digit.m (that is, the lassi ation margins for ea h of the training points). You
should also un omment the plotting s ript for umulative margin distributions
(what is plotted is, for ea h −1 < r < 1 on the horizontal axis, what fra tion
of the training points have a margin of at least r). Explain the dieren es
between the umulative distributions after 4 and 16 boosting iterations.

Solution:

18 LC: For some theoreti al properties of onden e rated [weak℄ lassiers [when used℄ in onne tion with
AdaBoost, see MIT, 2001 fall, Tommi Jaakkola, HW3, pr. 2.1-3.

75
a.

As an be seen in the gure, the training and test errors de rease as we


perform more boosting iterations. Eventually the training error rea hes zero,
but we do not overt, and the test error remains low (though higher than the
training error). However, no single stump an predi t the training set well,
and espe ially sin e we ontinue to emphasize di ult parts of the training
set, the error of ea h parti ular stump remains high, and does not drop bellow
about 1/3.

b.

The key dieren e between the umulative distributions after 4 and 16 bo-
osting iterations is that the additional iterations seem to push the left (low
end) tail of the umulative distribution to the right. To understand the ee t,

76
note that the examples that are di ult to lassify have poor or negative
lassi ation margins and therefore dene the low end tail of the umulative
distribution. Additional boosting iterations on entrate on the di ult exam-
ples and ensure that their margins will improve. As the margins improve, the
left tail of the umulative distribution moves to the right, as we see in the
gure.

77
30. (AdaBoost with logisti loss:
xxx studying the evolution of voting margins
xxx as a fun tion of boosting iterations)
•◦ MIT, 2009 fall, Tommi Jaakkola, HW3, pr. 2.4
We have provided you with MatLab ode that you an run to test how Ada-
Boost works.
mod = boost(X,y,n omp) generates an ensemble (a ell array of n omp base lear-
ners) based on training examples X and labels y .
load data.mat gives you X and y for a simple lassi ation task. You an
then generate the ensemble with any number of omponents (e.g., 50). The
ell array mod simply lists the base learners in the order in whi h they were
found. You an therefore plot the ensemble orresponding to the rst i
base learners by plot_de ision(mod(1:i),X,y), or individual base learners via
plot_de ision({mod{i}},X,y).
plot_voting_margin(mod,X,y,th) helps you study how the voting margins hange
as a fun tion of boosting iterations. For example, the plot with th = 0 gives
the fra tion of orre tly lassied training points (voting margin > 0) as a
fun tion of boosting iterations. You an also plot the urves for multiple
thresholds at on e as in plot_voting_margin(mod,X,y,[0,0.05,0.1,0.5℄). Ex-
plain why some of these tend to in rease while others de rease as a fun tion
of boosting iterations. Why does the urve orresponding to th = 0.05 onti-
nue to in rease even after all the points are orre tly lassied?

Solution:
Pm
Let hm (x) = i=1 αi hi (x) denote the ensemble lassier after m boosting ite-
Pm
α h (x)
rations, and let ĥm (x) = Pm i i
i=1
be its normalized version. Let f (τ, m)
i=1 αi
denote the fra tion of training examples (xt , yt ) with voting margin yt ĥm (xt ) =
yt hm (xt )
Pm > τ . From our plot, we noti e that f (τ, m) is in reasing with m (quite
i=1 αi
roughly and not at all monotoni ally) for small values of τ , like τ = 0, 0.05, 0.1,
but de reasing for large values of τ , like τ = 0.5. (The threshold at whi h the
transition o urs seems to be somewhere in the interval 0.115 > τ > 0.105.)
Pn
To explain this, onsider the boosting loss fun tion, Jm = t=1 L(yt hm (xt )),
whi h is de reasing in the voting margins yt ĥm (xt ). To minimize Jm , AdaBoost
will try to make all the voting margins y ĥm (xt ) as positive as possible. As m
Pm
in reases, i=1 αi only grows, so a negative voting margin yt ĥm (xt ) < 0 only
be omes more ostly. So, after a su ient number of iterations, we know that
boosting will be able to lassify all points orre tly, and all points will have
positive voting margin. So, f (0, m) roughly in reases from 0.5 to 1, and stays
at 1 on e m is su iently large.
As m in reases even more, we should expe t that the minimum voting margin
mint yt ĥ(xt ) ontinues to in rease. This is be ause there is little in entive to
make the larger yt ĥm (xt ) any more positive; it is more ee tive to make the
smaller yt ĥm (xt ) more positive. Using an argument similar to the one from
part a of this problem (MIT, 2009 fall, T. Jaakkola, HW3, pr. 2.1), we an
show that the examples whi h are barely orre t have larger weight (Wm (t))
than the examples whi h are learly orre t, sin e dL(z) is larger near 0.

78
However, our de ision stumps are fairly weak lassiers. If we want to per-
form better on some subset of points (namely, the ones with smaller margin),
we must ompromise on the rest (namely, the ones with larger margin). Thus,
what we get is that the minimum voting margin (whi h osts more) will be-
ome larger at the expense of the maximum voting margin (whi h osts less).
Similarly, f (τ, n) for a small threshold τ will in rease at the expense of f (τ, n)
for a large τ .

A visual way to see this is to onsider a graph of the res aled loss fun tion
Pm
L(( i=1 αi )τ ) vs. the voting margin τ = yt ĥm (xt ). As the number of boosting
iterations in reases, the graph is ompressed along the horizontal axis (altho-
ugh in reasingly slowly). So to make Jm smaller, we must basi ally shift the
entire distribution of voting margins to the right as mu h as possible (tho-
ugh we an only do so in reasingly slowly). In doing this, we are for ed to
ompromise some of the points farthest to the right, moving them inward.
Thus, with more iterations, the distribution of margins narrows. Here, f (τ, n)
an be related to the umulative density on the empiri al distribution of vo-
ting margins. So, f (τ, n) = P (margin > τ ) for a small τ will in rease, while
1 − f (τ, n) = P (margin < τ ) for large τ will also in rease (or at least be non-
de reasing).

[Caption:℄ Voting Margins, 10-20 iterations.

79
[Caption:℄ Voting Margins, 40-80 iterations.

[Caption:℄ Voting Margins, 200-400 iterations.

[Caption:℄ Empiri al Distributions of Voting Margins.

80
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-0.5 0 0.5 -0.5 0 0.5

[Caption:℄ De ision boundaries, 10-20 iterations.

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-0.5 0 0.5 -0.5 0 0.5

[Caption:℄ De ision boundaries, 40-80 iterations.

81
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-0.5 0 0.5 -0.5 0 0.5

[Caption:℄ De ision boundaries, 200-400 iterations.

[Caption:℄ De ision boundaries.

82
31. (Liniear regression vs.
xxx AdaBoost using [weighted℄ linear weak lassiers
xxx and mean square error as loss fun tion;
xxx ross-validation;
xxx Wine
appli ation on the dataset)
• ◦ CMU, 2009 spring, Ziv Bar-Joseph, HW3, pr. 1
In this problem, you are going to ompare the performan e of a basi linear
lassier and its boosted version on the Wine
dataset (available on our web-
site). The dataset, given in the le wine.mat, ontains the results of a hemi al
analysis of wines grown in the same region in Italy but derived from three
dierent ultivars. The analysis determined the quantities of 13 onstituents
found in ea h of the three types of wines. Note
that when you are doing
ross-validation, you want to ensure that a ross all the folds the proportion
examples from ea h lass is roughly the same.

a. Implement a basi linear lassier using linear regression. All data points
are equally weighted.
A linear lassier is dened as:

−1 if β ⊤ · x < 0;


f (x; β) = sign(β · x) =
1 if β ⊤ · x ≥ 0.
Your algorithm should minimize the lassi ation error dened as:

n
X (yi − f (xi ))2
err(f ) =
i=1
4n

Note : The rst step for data prepro essing is to augment the data. In MatLab,
this an be done as:

X_new = [ones(size(X,1), 1) X℄;


Hint : You may want to use the MatLab fun tion fminsear h to get the optimal
solution for β .
Handin : Please turn in a MatLab sour e le linear_learn.m whi h takes in
two inputs data matrix x and label y, and returns a linear model. You may
have additional fun tions/les if you want.

b. Do 10-fold ross validation for the linear lassier. Report the average
training and test errors for all the folds.
Handin : Please turn in a MatLab sour e le v.m.
. Modify your algorithm in linear_learn.m to a ommodate weighted sam-
ples. Given the weight w for sample data X , what is the lassi ation error?
You may want to refer to part a. Please implement the weighted version
of
the learning algorithm for the linear lassier
.
Note originally the unweighted version ould be viewed as one with equal
weights 1/n.
Handin : Please turn in a MatLab sour e le linear_learn.m whi h takes in
three inputs data matrix x, label y and weights w, and returns a linear model.
You may have additional fun tions/les if you want. Note
that your ode

83
should have ba kward ompatibility  it behaves like unweighted version if w
is not given.

d. Implement AdaBoost for the linear lassifer using the re-weighting and
re-training idea. Refer to the le ture slides or to Ciortuz et al's ML exer ise
book for the AdaBoost algorithm.
Handin : Please turn in a MatLab sour e le adaBoost.m.
e. Do 10-fold ross-validation on the Wine dataset using AdaBoost with the
linear lassier as the weak learner, for 1 to 100 iterations. Plot the average
training and test errors for all the folds as a fun tion of the number of boosting
iterations. Also, draw horizontal lines orresponding to the training and test
errors for the linear lassier that you obtain in part b. Dis uss your results.
Handin : Please turn in a MatLab sour e le v ab.m. You may reuse fun tions
in part b.

Solution:

.
n
X (yi − f (xi ))2
errw (f ) = wi .
i=1
4

e. Sample plot:

0.7
train error lin class
test error lin class
0.6 train error AdaBoost
test error AdaBoost

0.5

0.4
error

0.3

0.2

0.1

0
0 20 40 60 80 100
no of AdaBoost iterations

84
32. (AdaBoost using Naive Bayes as weak lassier;
xxx appli ation on the US House of Representatives votesdataset)
• ◦ CMU, 2005 spring, C. Guestrin, T. Mit hell, HW2, pr. 1.2

Solution:

85
3 Bayesian Classi ation

33. (Naive Bayes: weather predi tion;


xxx feature sele tion based on CVLOO)
•◦ CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 4
xxx CMU, 2009 fall, Ziv Bar-Joseph, HW1, pr. 3
You need to de ide whether to arry an umbrella to s hool in Pittsburgh as
the lo al weather hannel has been giving in onsistent predi tions re ently.
You are given several input features (observations). These observations are
dis rete, and you are expe ted to use a Naive Bayes lassi ation s heme to
de ide whether or not you will take your umbrella to s hool. The domain of
ea h of the features is as follows:
season = (w, sp, su, f)
yesterday = (dry, rainy)
daybeforeyesterday = (dry, rainy)
loud = (sunny, loudy)
and the possible lasses of the output being: umbrella = (y, n).
See data1.txt (posted on website with problem set) for data based on the
above s enario with spa e separated elds onforming to:
season yesterday daybeforeyesterday loud umbrella
a. Write ode in MATLAB to estimate the onditional probabilities of ea h
the features given the out ome. Generate a spa e separated le with the
estimated parameters from the entire dataset by writing out all the onditional
probabilities.

b. Write ode in MATLAB to perform inferen e by predi ting the maximum


likelihood lass based on training data using a leave one out ross validation
s heme. Generate a [spa e separated℄ le with the maximum likelihood lasses
in order.

. Are the features yesterday and daybeforeyesterday independent of ea h


other?

d. Does the Naive Bayes assumption hold on this pair of input features? Why
or why not?

e. Find a subset of 3 features from this set of 4 features where your algorithm
improves its predi tive ability based on a leave one out ross validation s heme.
Report your improvement.

Solution:

a. Without pseudo ounts, the onditional probabilities are:


loud loudy y 0.8
loud sunny y 0.2
loud sunny n 0.6
loud loudy n 0.4
daybeforeyesterday dry y

86
daybeforeyesterday rainy
daybeforeyesterday dry n
daybeforeyesterday rainy
season w n 0.3
season sp n 0
season su n 0.3
season f n 0.4
season sp y 0.2
season su y 0.1
season f y 0.4
season w y 0.3
yesterday dry y 0.4
yesterday rainy y 0.6
yesterday rainy n 0.4
yesterday dry n 0.6
b. After adding pseudo ounts, the ML lasses for the data are:
yyynynnnnnnynnynyyynnynynnnnnyyynnnyynnn
Results may vary slightly be ause of the way pseudo ounts are implemented.

. Let yesterday be denoted by Y and daybeforeyesterday by D, and umbrella


by U.

Empiri al Empiri al Empiri al Produ t


Y D joint prob. prob. Y prob. D of the two
dry dry 0.3 0.5 0.5 0.25
dry rain 0.2 0.5 0.5 0.25
rain dry 0.2 0.5 0.5 0.25
rain rain 0.3 0.5 0.5 0.25

Thus, they do not look independent from the data, but a stri ter way to test
is to subje t it to a Chi-Square Test of independen e. Sin e the number of
data samples are low, we do not obtain a statisti ally signi ant result either
way.

d. To he k if the naive bayes assumption holds, we need to he k for the


onditional independen e of the two given the umbrella label, so we partition
the data based on the umbrella label and he k for independen e.

Empiri al Empiri al Empiri al Produ t


U Y D joint prob. prob. Y prob. D of the two
y dry dry 0.2 0.4 0.4 0.16
y dry rain 0.2 0.4 0.6 0.24
y rain dry 0.2 0.6 0.4 0.24
y rain rain 0.4 0.6 0.6 0.36
n dry dry 0.4 0.6 0.6 0.36
n dry rain 0.2 0.6 0.4 0.24
n rain dry 0.2 0.4 0.6 0.24
n rain rain 0.2 0.4 0.4 0.16

Again, they do not look independent, but the empiri al joint probabilities
and the produ t of the individual empiri al probabilities look slightly loser.

87
Again, a stri ter way to test is to subje t it to a Chi-Square Test of indepen-
den e. Again, sin e the number of data samples are low, we do not obtain a
statisti ally signi ant result either way.

e. Leaving out yesterday, using a LOOCV s heme, the per entage of orre tly
predi ted instan es jumps from 55% to 75%.

88
34. (Naive Bayes: spam ltering)
• ◦ Stanford, 2012 spring, Andrew Ng, pr. 6
xxx Stanford, 2015 fall, Andrew Ng, HW2, pr. 3.a-
xxx Stanford, 2009 fall, Andrew Ng, HW2, pr. 3.a-
In this exer ise, you will use Naive Bayes to lassify email messages into spam
and nonspam groups. Your dataset is a prepro essed subset of the Ling-Spam
Dataset ,19 provided by Ion Androutsopoulos. It is based on 960 real email
messages from a linguisti s mailing list.
There are two ways to omplete this exer ise. The rst option is to use the
Matlab/O tave-formatted features we have generated for you. This requi-
res using Matlab/O tave to read prepared data and then writing an imple-
mentation of Naive Bayes. To hoose this option, download the data pa k
ex6DataPrepared.zip.
The se ond option is to generate the features yourself from the emails and then
implement Naive Bayes on top of those features. You may want this option
if you want more pra ti e with features and a more open-ended exer ise. To
hoose this option, download the data pa k ex6DataEmails.zip.

Data Des ription:

The dataset you will be working with is split into two subsets: a 700-email
subset for training and a 260-email subset for testing. Ea h of the training
and testing subsets ontain 50% spam messages and 50% nonspam messages.
Additionally, the emails have been prepro essed in the following ways:
1. Stop word removal: Certain words like and, the, and of, are very
ommon in all English senten es and are not very meaningful in de iding
spam/nonspam status, so these words have been removed from the emails.
2. Lemmatization: Words that have the same meaning but dierent endings
have been adjusted so that they all have the same form. For example, in-
lude, in ludes, and in luded, would all be represented as "in lude." All
words in the email body have also been onverted to lower ase.
3. Removal of non-words: Numbers and pun tuation have both been remo-
ved. All white spa es (tabs, newlines, spa es) have all been trimmed to a
single spa e hara ter.

As anexample, here are some messages before and after prepro essing:
Nonspam message 5-1361msg1 before prepro essing:
Subje t: Re: 5.1344 Native speaker intuitions
The dis ussion on native speaker intuitions has been extremely
interesting, but I worry that my brief intervention may have
muddied the waters. I take it that there are a number of
separable issues. The first is the extent to whi h a native
speaker is likely to judge a lexi al string as grammati al
or ungrammati al per se. The se ond is on erned with the
relationships between syntax and interpretation (although even
here the distin tion may not be entirely lear ut).
Nonspam message 5-1361msg1 after prepro essing:
19 http:// smining.org/index.php/ling-spam-datasets.html, a essed on 21st Spetember 2016.

89
re native speaker intuition dis ussion native speaker intuition
extremely interest worry brief intervention muddy waters number
separable issue first extent native speaker likely judge lexi al
string grammati al ungrammati al per se se ond on ern relationship
between syntax interpretation although even here distin tion entirely lear
ut
For omparison, here is a prepro essed spam message:

Spam message spmsg 19 after prepro essing:


finan ial freedom follow finan ial freedom work ethi
extraordinary desire earn least per month work home spe ial skills
experien e required train personal support need ensure su ess
legitimate homebased in ome opportunity put ba k ontrol finan e
life ve try opportunity past fail live promise
As you an dis over from browsing these messages, prepro essing has left
o asional word fragments and nonwords. In the end, though, these details
do not matter so mu h in our implementation (you will see this for yourself ).

Categori al Naive Bayes

To lassify our email messages, we will use a Categori al Naive Bayes model.
The parameters of our model are as follows:
P 
m Pni
not. i=1 1
j=1 {xj =k and y (i) =1} + 1
(i)

φk|y=1 = p(xj = k|y = 1) = Pm 


i=1 1{y (i) =1} ni + |V |
P 
m Pni
i=1 j=1 1{x(i) =k and y(i) =0} + 1
not. j
φk|y=0 = p(xj = k|y = 0) = Pm 
i=1 {y =0} ni + |V |
1 (i)
Pm
not. 1{y(i) =1}
φy = p(y = 1) = i=1 ,
m
where

φk|y=1 estimates the probability that a parti ular word in a spam email will
be the k -th word in the di tionary,
φk|y=0 estimates the probability that a parti ular word in a nonspam email
will be the k -th word in the di tionary,
φy estimates the probability that any parti ular email will be a spam email.

Here are some other notation onventions :


m is the number of emails in our training set,
the i-th email ontains ni words,
the entire di tionary ontains |V | words.

You will al ulate the parameters φk|y=1 , φk|y=0 and φy from the training data.
Then, to make a predi tion on an unlabeled email, you will use the parameters
to ompare p(x|y = 1)p(y = 1) and p(x|y = 0)p(y = 0) [A. Ng: as des ribed in
the le ture videos℄. In this exer ise, instead of omparing the probabilities

90
dire tly, it is better to work with their logs. That is, you will lassify an email
as spam if you nd

log p(x|y = 1) + log p(y = 1) > log p(x|y = 0) + log p(y = 0).

A1. Implementing Naive Bayes using prepared features

If you want to omplete this exer ise using the formatted features we provided,
follow the instru tions in this se tion.
In the data pa k for this exer ise, you will nd a text le named train-features.txt,
that ontains the features of emails to be used in training. The lines of this
do ument have the following form:
2 977 2
2 1481 1
2 1549 1
The rst number in a line denotes a do ument number, the se ond number
indi ates the ID of a di tionary word, and the third number is the number
of o urren es of the word in the do ument. So in the snippet above, the
rst line says that Do ument 2 has two o urren es of word 977. To look up
what word 977 is, use the feature-tokens.txt le, whi h lists ea h word in the
di tionary alongside an ID number.

Load the features

Now load the training set features into Matlab/O tave in the following way:

numTrainDo s = 700;
numTokens = 2500;
M = dlmread('train-features.txt', ' ');
spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTrainDo s, numTokens);
train_matrix = full(spmatrix);
This loads the data in our train-features.txt into a sparse matrix (a matrix
that only stores information for non-zero entries). The sparse matrix is then
onverted into a full matrix, where ea h row of the full matrix represents one
do ument in our training set, and ea h olumn represents a di tionary word.
The individual elements represent the number of o urren es of a parti ular
word in a do ument.
For example , if the element in the i-th row and the j -th olumn of train_matrix
ontains a 4, then the j -th word in the di tionary appears 4 times in the
i-th do ument of our training set. Most entries in train_matrix will be zero,
be ause one email in ludes only a small subset of the di tionary words.
Next, we'll load the labels for our training set.
train_labels = dlmread('train-labels.txt');
This puts the y -labels for ea h of the m the do uments into an m × 1 ve tor.
The ordering of the labels is the same as the ordering of the do uments in the
features matrix, i.e., the i-th label orresponds to the i-th row in train_matrix.

A note on the features

In a Categori al Naive Bayes model, the formal denition of a feature ve tor


~x for a do ument says that xj = k if the j -th word in this do ument is the

91
k -th word in the di tionary. This does not exa tly mat h our Matlab/O tave
matrix layout, where the j -th term in a row ( orresponding to a do ument)
is the number of o urren es of the j -th di tionary word in that do ument.
Representing the features in the way we have allows us to have uniform rows
whose lengths equal the size of the di tionary. On the other hand, in the
formal Categori al Naive Bayes denition, the feature ~ x has a length that
depends on the number of words in the email. We've taken the uniform-row
approa h be ause it makes the features easier to work with in Matlab/O tave.
Though our representation does not ontain any information about the posi-
tion within an email that a ertain word o upies, we do not lose anything
relevant for our model. This is be ause our model assumes that ea h φk|y
is the same for all positions of the email, so it's possible to al ulate all the
probablities we need without knowing about these positions.

Training

You now have all the training data loaded into your program and are ready
to begin training your data. Here are the re ommended steps for pro eeding:
1. Cal ulate φy .
2. Cal ulate φk|y=1 for ea h di tionary word and store all results in a ve tor.
3. Cal ulate φk|y=0 fore ea h di tionary word store all results in a ve tor.

Testing

Now that you have al ulated all the parameters of the model, you an use
your model to make predi tions on test data. If you are putting your program
into a s ript for Matlab/O tave, you may nd it helpful to have separate
s ripts for training and testing. That way, after you've trained your model,
you an run the testing independently as long as you don't lear the variables
storing your model parameters.
Load the test data in test-features.txt in the same way you loaded the trai-
ning data. You should now have a test matrix of the same format as the
training matrix you worked with earlier. The olumns of the matrix still or-
respond to the same di tionary words. The only dieren e is that now the
number of do uments are dierent.
Using the model parameters you obtained from training, lassify ea h test
do ument as spam or non-spam. Here are some general steps you an take:
1. For ea h do ument in your test set, al ulate log p(~x|y = 1) + log p(y = 1).
2. Similarly, al ulate log p(~x|y = 0) + log p(y = 0).
3. Compare the two quantities from (1) and (2) above and make a de ision
about whether this email is spam. In Matlab/O tave, you should store your
predi tions in a ve tor whose i-th entry indi ates the spam/nonspam status
of the i-th test do ument.
On e you have made your predi tions, answer the questions in the Questions
se tion.

Note

Be sure you work with log probabilities in the way des ribed in the earlier in-
stru tions [A. Ng: and in the le ture videos
℄. The numbers in this exer ise are

92
small enough that Matlab/O tave will be sus eptible to numeri al underow
if you attempt to multiply the probabilities. By taking the log, you will be
doing additions instead of multipli ations, avoiding the underow problem.

A2. Implementing Naive Bayes without prepared features

Here are some guidelines that will help you if you hoose to generate your
own features. After reading this, you may nd it helpful to read the previous
se tion, whi h tells you how to work with the features.

Data ontents

The data pa k you downloaded ontains 4 folders:


a. The folders nonspam-train and spam-train ontain the prepro essed emails
you will use for training. They ea h have 350 emails.
b. The folders nonspam-train and nonspam-test onstitute the test set ontai-
ning 130 spam and 130 nonspam emails. These are the do uments you will
make predi tions on. Noti e that even though the separate folders tell you the
orre t labeling, you should make your predi tions on all the test do uments
without this knowledge. After you make your predi tions, you an use the
orre t labeling to he k whether your lassi ations were orre t.

Di tionary

You will need to generate a di tionary for your model. There is more than
one way to do this, but an easy method is to ount the o urren es of all
words that appear in the emails and hoose your di tionary to be the most
frequent words. If you want your results to mat h ours exa tly, you should
pi k the di tionary to be the 2500 most frequent words.
To he k that you have done this orre tly, here are the 5 most ommon words
you will nd, along with their ounts.
1. email 2172
2. address 1650
3. order 1649
4. language 1543
5. report 1384
Remember to take the ounts over all of the emails: spam, nonspam, training
set, testing set.

Feature generation

On e you have the di tionary, you will need to represent your do uments
as feature ve tors over the spa e of the di tionary words. Again, there are
several ways to do this, but here are the steps you should take if you want to
mat h the prepared features we des ribed in the previous se tion.
1. For ea h do ument, keep tra k of the di tionary words that appear, along
with the ount of the number of o urren es.
2. Produ e a feature le where ea h line of the le is a triplet of (do ID,
wordID, ount). In the triplet, do ID is an integer referring to the email, wordID
is an integer referring to a word in the di tionary, and ount is the number
of o urren es of that word. For example , here are the rst ve entries of
a training feature le we produ ed (the lines are sorted by do ID, then by
wordID):

93
1 19 2
1 45 1
1 50 1
1 75 1
1 85 1
In this snippet, Do ument 1 refers to the rst do ument in the nonspam-train
folder, 3-380msg4.txt. Our di tionary is ordered by the popularity of the words
a ross all do uments, so a wordID of 19 refers to the 19th most ommon word.
This format makes it easy for Matlab/O tave to load your features as an
array. Noti e that this way of representing the emails does not ontain any
information about the position within an email that a ertain word o upies.
This is not a problem in our model, sin e we're assuming ea h φk|y is the same
for all positions.

Training and testing

Finally, you will need to train your model on the training set and predi t
the spam/nonspam lassi ation on the test set. For some ideas on how to
do this, refer to the instru tions in the previous se tion about working with
already-generated features.
When you are nished, answer the questions in the following Questions se -
tion.

B. Questions

Classi ation error

Load the orre t labeling for the test do uments into your program. If you
used the pre-generated features, you an just read test-labels.txt into your
program. If you generated your own features, you will need to write your own
labeling based on whi h do uments were in the spam folder and whi h were
in the nonspam folder.
Compare your Naive-Bayes predi tions on the test set to the orre t labeling.
How many do uments did you mis lassify? What per entage of the test set
was this?

Smaller training sets

Let's see how the lassi ation error hanges when you train on smaller trai-
ning sets, but test on the same test set as before. So far you have been
working with a 960-do ument training set. You will now modify your pro-
gram to train on 50, 100, and 400 do uments (the spam to nonspam ratio will
still be one-to-one).
If you are using our prepared features for Matlab/O tave, you will see text do-
uments in the data pa k named train-features-#.txt and train-labels-#.txt,
where the # tells you how many do uments make up these training sets.
For ea h of the training set sizes, load the orresponding training data into
your program and train your model. Then re ord the test error after testing
on the same test set as before.
If you are generating your own features from the emails, you will need to
sele t email subsets of 50, 100, and 400, keeping ea h subset 50% spam and

94
50% nonspam. For ea h of these subsets, generate the training features as you
did before and train your model. Then, test your model on the 260-do ument
test set and re ord your lassi ation error.

Solution:

An m-le implementation of Naive Bayes training for Matlab/O tave an be


found here here
[. . .℄, and another m-le for testing is [. . .℄. In order for test.m
to work, you must rst run train.m without learing the variables in the
workspa e after training.

Classi ation error


After training on the full training set (700 do uments), you should nd that
your algorithm mis lassies 5 do uments. This amounts to 1.9% of your test
set.
If your test error was dierent, you will need to debug your program. Make
sure that you are working with log probabilities, and that you are taking logs
on the orre t expressions. Also, he k that you understand the dimensions
of your features matrix and what ea h dimension means.

Smaller training sets


Here are the errors on the smaller training sets. Your answers may dier
slightly if you generated your own features and did not use the same do ument
subsets we used.
1. 50 training do uments: 7 mis lassied, 2.7%.
2. 100 training do uments: 6 mis lassied, 2.3%.
3. 400 training do uments: 6 mis lassied, 2.3%.

95
35. (Naive Bayes: appli ation to
xxx do ument [n-ary℄ lassi ation)
•◦ CMU, 2011 spring, Tom Mit hell, HW2, pr. 3
In this exer ise, you will implement the Naive Bayes do ument lassier and
apply it to the lassi 20 newsgroups dataset.20 In this dataset, ea h do ument
is a posting that was made to one of 20 dierent usenet newsgroups. Our goal
is to write a program whi h an predi t whi h newsgroup a given do ument
was posted to.21

Model

Let's say we have a do ument D ontaining n words; all the words {X1 , . . . , Xn }.
The value of random variable Xi is the word found in position i in the do u-
ment. We wish to predi t the label Y of the do ument, whi h an be one of
m ategories. We ould use the model:
Y
P (Y |X1 , . . . , Xn ) ∝ P (X1 , . . . , Xn |Y ) · P (Y ) = P (Y ) P (Xi |Y )
i

That is, ea h Xi is sampled from some distribution that depends on its position
Xi and the do ument ategory Y . As usual with dis rete data, we assume
that P (Xi |Y ) is a multinomial distribution over some vo abulary V ; that is,
ea h Xi an take one of |V | possible values orresponding to the words in the
vo abulary. Therefore, in this model, we are assuming (roughly) that for any
pair of do ument positions i and j , P (Xi |Y ) may be ompletely dierent from
P (Xj |Y ).
a. Explain in a senten e or two why it would be di ult to a urately estimate
the parameters of this model on a reasonable set of do uments (e.g. 1000
do uments, ea h 1000 words long, where ea h word omes from a 50,000 word
vo abulary).

To improve the model, we will make the additional assumption that:

∀i, j P (Xi |Y ) = p(Xj |Y )

Thus, in addition to estimating P (Y ), you must estimate the parameters for


the single distribution P (X|Y ), whi h we dene to be equal to P (Xi |Y ) for
all Xi . Ea h word in a do ument is assumed to be an i.i.d. drawn from this
distribution.

Data

The data le (available on the website) ontains six les:


1. vo abulary.txt is a list of the words that may appear in do uments. The
line number is word's id in other les. That is, the rst word (ar hive) has
wordId 1, the se ond word (name) has wordId 2, et .
2. newsgrouplabels.txt is a list of newsgroups from whi h a do ument may
have ome. Again, the line number orresponds to the label's id, whi h is
used in the .label les. The rst line (alt.atheism) has id 1, et .
3. train.label: Ea h line orresponds to the label for one do ument from the
20 http://qwone. om/∼jason/20Newsgroups/, a essed on 22nd September 2016.
21 For this question, you may write your ode and solution in teams of at most 2 students.

96
training set. Again, the do ument's id (do Id) is the line number.
4. test.label: The same as train.label, ex ept that the labels are for the
test do uments.
5. train.data spe ies the ounts for ea h of the words used in ea h of the
do uments. Ea h line is of the form do Id wordId ount, where ount spe ies
the number of times the word with id wordId appears in the training do ument
with id do Id. All word/do ument pairs that do not appear in the le have
ount 0.
6. test.data: Same as train.data, ex ept that it spe ied ounts for test
do uments. If you are using Matlab, the fun tions textread and sparse will
be useful in reading these les.

Implementation

Your rst task is to implement the Naive Bayes lassier spe ied above.
You should estimate P (Y ) using the MLE, and estimate P (X|Y ) using a MAP
estimate with the prior distribution Diri hlet(1 + α, . . . , 1 + α), where α = 1/|V |
and V is the vo abulary.

b. Report the overall testing a ura y


(the number of orre tly lassied
do uments in the test set over the total number of test do uments), and print
out the onfusion matrix(the matrix C , where cij is the number of times a
do ument with ground truth ategory j was lassied as ategory i).

. Are there any newsgroups that the algorithm onfuses more often than
others? Why do you think this is?

In your initial implementation, you used a prior Diri hlet(1 + α, . . . , 1 + α) to


estimate P (X|Y ), and we told you set α = 1/|V |. Hopefully you wondered
where this value ame from. In pra ti e, the hoi e of prior is a di ult
question in Bayesian learning: either we must use domain knowledge, or we
must look at the performan e of dierent values on some validation set. Here
we will use the performan e on the testing set to gauge the ee t of α.22

d. Re-train your Naive Bayes lassier for values of α between .00001 and 1
and report the a ura y over the test set for ea h value of α. Create a plot
with values of α on the x-axis and a ura y on the y -axis. Use a logarithmi
s ale for the x-axis (in Matlab, the semilogx ommand). Explain in a few
senten es why a ura y drops for both small and large values of α.

Identifying Important Features

One useful property of Naive Bayes is that its simpli ity makes it easy to
understand why the lassier behaves the way it does. This an be useful both
while debugging your algorithm and for understanding your dataset in general.
For example, it is possible to identify whi h words are strong indi ators of the
ategory labels we're interested in.

e. Propose a method for ranking the words in the dataset based on how
mu h the lassier `relies on' them when performing its lassi ation (hint:
22 It is tempting to hoose α to be the one with the best performan e on the testing set. However, if we do
this, then we an no longer assume that the lassier's performan e on the test set is an unbiased estimate of
the lassier's performan e in general. The a t of hoosing α based on the test set is equivalent to training on
the test set; like any training pro edure, this hoi e is subje t to overtting.

97
information theory will help). Your metri should use only the lassier's
estimates of P (Y ) and P (X|Y ). It should give high s ores to those words that
appear frequently in one or a few of the newsgroups but not in other ones.
Words that are used frequently in general English (`the', `of ', et .) should
have lower s ores, as well as words that only appear appear extremely rarely
throughout the whole dataset. Finally, your method this should be an overall
ranking for the words, not a per- ategory ranking.23

f. Implement your method, set α ba k to 1/|V |, and print out the 100 words
with the highest measure.

g. If the points in the training dataset were not sampled independently at


random from the same distribution of data we plan to lassify in the future,
we might all that training set biased. Dataset bias is a problem be ause the
performan e of a lassier on a biased dataset will not a urately ree t its
future performan e in the real world. Look again at the words your lassier
is `relying on'. Do you see any signs of dataset bias?

Solution:

a. In this model, ea h position


in a given do ument is assumed to have its own
probability distribution. Ea h do ument has only one word at ea h position,
so if there are M do uments then we must estimate the parameters of roughly
50,000-dimensional distributions using only M samples from that distribution.
In only a thousand do uments, there will not be enough samples.

To see it another way, the fa t that a word w appeared at the i'th position of
the do ument gives us information about the distribution at another position
j . Namely, in English, it is possible to rearrange the words in a do ument
without signi antly altering the do ument's meaning, and therefore the fa t
that w appeared at position i means that it is likely that w ould appear at
position j . Thus, it would be statisti ally ine ient to not to make use of the
information in estimating the parameters of the distribution of Xj .

b. The nal a ura y of this lassier is 78.52%, with the following onfusion

23 Some students might not like the open-endedness of this problem. I [Carl Doers h, TA at CMU℄ hate to
say it, but nebulous problems like this are ommon in ma hine learningthis problem was a tually inspired by
something I worked on last summer in industry. The goal was to design a metri for nding do uments similar
to some query do ument, and part of the pro edure involved lassifying words in the query do ument into one
of 100 ategories, based on the word itself and the word's ontext. The algorithm initially didn't work as well
as I thought it should have, and the only path to improving its performan e was to understand what these
lassiers were `relying on' in order to do their lassi ationsome way of understanding the lassiers' internal
workings, and even I wasn't sure what I was looking for. In the end I designed a metri based on information
theory and, after looking at hundreds of word lists printed from these lassiers, I eventually found a way to x
the problem. I felt this experien e was valuable enough that I should pass it on to all of you.

98
matrix:

. From the onfusion matrix, it is lear that newsgroups with a similar


topi s are onfused frequently. Notably, those related to omputers (e.g.,
omp.os.ms-windows.mis and omp.sys.ibm.p .hardware), those related to po-
liti s (e.g., talk.politi s.guns and talk.politi s.mis ), and those related to
religion (alt.atheism and talk.religon.mis ). Newsgroups with similar topi s
have similar words that identify them. For example, we would expe t the
omputer-related groups to all use omputer terms frequently.

d. For very small values of α, we have that the probability of rare words not
seen during training for a given lass tends to zero. There are many testing
do uments that ontain words seen only in one or two training do uments, and
often these training do uments are of a dierent lass than the test do ument.
As α tends to zero, the probabilities of these rare words tends to dominate.24

For large values of α, we see a lassi undertting


behavior: the nal para-
meter estimates tend toward the prior as α in reases, and the prior is just
something we made up. In parti ular, the lassier tends to underestimate
the importan e of rare words: for example, if α is 1 and we see only one
o urren e of the word w in the ategory C (and we see the same number
of words in ea h ategory), then the nal parameter estimates are 2/21 for
ategory C and 19/21 that it would be something else. Furthermore, the most
informative words tend to be relatively un ommon, and so we would like to

24 One may attribute the poor performan e at small values of α to overtting. While this is stri tly speaking
orre t (the lassier estimates P (X|Y ) to be smaller than is realisti simply be ause that was the ase in the
data), simply attributing this to overtting is not a sophisti ated answer. Dierent lassiers overt for dierent
reasons, and understanding the dieren es is an important goal for you as students.

99
rely on these rare words more.

e. There were many a eptable solutions to this question. First we will look
at H(Y |Xi = True), the entropy of the label given a do ument with a single
word wi . Intuitively, this value will be low if a word appears most of the time
in a single lass, be ause the distribution P (Y |Xi = True) will be highly peaked.
More on retely (and abbreviating True as T ),
X
H(Y |Xi = T ) = − P (Y = yk |Xi = T ) log(P (Y = yk |Xi = T ))
k
= −EP (Y =yk |Xi =T ) log(P (Y = yk |Xi = T ))
P (Xi = T |Y = yk )P (Y = yk )
= −EP (Y =yk |Xi =T ) log
P (Xi = T )
P (Xi = T |Y = yk )
= −EP (Y =yk |Xi =T ) log − EP (Y =yk |Xi =T ) log(P (Y = yk ))
P (Xi = T )

Note that
P (Xi = T |Y = yk )
log
P (Xi = T )
is exa tly what gets added to Naive Bayes' internal estimate of the posterior
probability log(P (Y )) at ea h step of the algorithm (although in implementa-
tions we usually ignore the onstant P (Xi = T )). Furthermore, the expe tation
is over the posterior distribution of the lass labels given the appearan e of
word wi . Thus, the rst term of this measure an be interpreted as the ex-
pe ted hange in the lassier's estimate of the log-probability of the ` orre t'
lass given the appearan e of word wi . The se ond term tends to be very
small relative to the rst term sin e P (Y ) is lose to uniform.25 26
25 I found that the word list is the same with or without it.
26 Another measure indi ated by many students was I(X , Y ). Prof. Mit hell said that this was quite useful
i

100
f. For the metri H(Y |Xi = True):
nhl, stephanopoulos, leafs, alomar, wolverine,  rypto, lemieux,
oname, rsa, athos, ripem, rbi, firearm, powerbook, pit her,
bruins, dyer, lindros, l iii, ahl, fprintf,  andida, azerbaijan,
baerga, args, iisi, gilmour,  lh, gf i, pit hers, gainey,
 lemens, dodgers, jagr, sabretooth, liefeld, hawks, hobgoblin, rlk,
adb,  rypt, anonymity, aspi,  ountersteering, xfree, punisher,
re hi,  ipher, oilers, soderstrom, azerbaijani, obp, goalie,
libxmu, inning, xmu, sdpa, argi , serdar, sumgait, denning,
io , obfus ated, umu, nsm a, dineen, ran k, xdm, rayshade,
gaza, stderr, dpy,  ardinals, potvin, orbiter, sandberg, imake,
plaintext, whalers, mon ton, jaeger, u xkvb, mydisplay, wip,
hi net, homi ides, bont hev,  anadiens, messier, bure, bikers,
 ryptographi , ssto, motor y ling, infante, karabakh, baku, mutants,
keown,  ousineau
For the metri I(Xi , Y ):
windows, god, he, s si,  ar, drive, spa e, team, dos, bike,
file, of, that, mb, game, key, ma , jesus, window, dod,
ho key, the, graphi s,  ard, image, his, gun, en ryption, sale,
apple, government, season, we, games, israel, disk, files, ide,
 ontroller, players, shipping,  hip, program, was,  ars, nasa,
win, year, were, they, turkish, motif, people, armenian, play,
drives, bible, use, widget, p ,  lipper, offer,jpeg, baseball,
bus, my, nhl, software, is, db, server, jews,os, israeli,
output, data, system, who, league, armenians, for,  hristian,
 hristians, entry, mhz, ftp, pri e,  hrist, guns,thanks,  hur h,
 olor, teams, priva y,  ondition, laun h, him,  om, monitor, ram
Note the presen e of the words  ar, of, that, et .

g. It is ertain that the dataset was olle ted over some nite time period in
the past. That means our lassier will tend to rely on some words that are
spe i to this time period. For the rst word list, stephanopolous refers
to a politi ian who may not be around in the future, and whalers refers to
the Conne ti ut ho key team that was a tually being desolved at the same
time as this dataset was being olle ted. For the se ond list, ghz has almost
ertainly repla ed mhz in modern omputer dis ussions, and the ontroversy
regarding Turkey and Armenia is far less newsworthy today. As a result, we
should expe t the lassi ation a ura y on the 20-newsgroups testing set to

in fun tional Magneti Resonan e Imaging (fMRI) data. Intuitively, this measures the amount of information
we learn by observingXi . A issue with this measure is that Naive Bayes only really learns from Xi in the event
that Xi = True, and essentially ignores this variable when Xi = False (thus, the issue was introdu ed be ause
we're omputing our measure on Xi rather than on X ). Note that this is not the ase in fMRI data (i.e., you
ompute the mutual information dire tly on the features used for lassi ation), whi h explains why mutual
information works better in that domain. Note that Xi = False most of the time for informative words, so in
the formula:

I(Xi , Y ) = H(Xi ) − H(Xi |Y ) =


" #
X X
− P (Xi = xi ) log P (Xi = xi ) − P (Y = yk |X = xi ) log P (Y = yk |X = xi )
xi ∈{T,F } k

we see that the term for xi = F tends to dominate even though it is essentially meaningless. Another disadvan-
tage of this metri is that it's more di ult to implement.

101
signi antly overestimate the lassi ation a ura y our algorithm would have
on a testing sample from the same newsgroups taken today.27

27 Sadly, there is a lot of bad ma hine learning resear h that has resulted from biased datasets. Resear hers
will train an algorithm on some dataset and nd that the performan e is ex ellent, but then apply it in the real
world and nd that the performan e is terrible. This is espe ially ommon in omputer vision datasets, where
there is a tenden y to always photograph a given obje t in the same environment or in the same pose. In your
own resear h, make sure your datasets are realisti !

102
36. (The relationship between Logisti Regression and Naive Bayes;
xxx evaluation on a text lassi ation task
xxx (ho key and baseball newsgroups);
xxx feature sele tionbased on the norm of weights omputed by LR
xxx analysis the ee t of feature(i.e. word)dupli ation on both NB and LR)
• ◦ CMU, 2009 spring, Tom Mit hell, HW3, pr. 2
In this assignment you will train a Naive Bayes and a Logisti Regression
lassier to predi t the lass of a set of do uments, represented by the words
whi h appear in them.
Please download the data from the ML Companion's site. The .data le
is formatted do Idx wordIdx ount. Note that this only has words with
nonzero ounts. The .label le is simply a list of label id's. The ith line of this
le gives you the label of the do ument with do Idx i. The .map le maps
from label id's to label names.
In this assignment you will lassify do uments into two lasses: re .sport.baseball
(10) and re .sport.ho key (11). The vo abulary.txt le ontains the vo abu-
lary for the indexed data. The line number in vo abulary.txt orresponds to
the index number of the word in the .data le.

A. Implement Logisti Regression and Naive Bayes

a. Implement regularized Logisti Regression using gradient des ent. We


found that learning rate η around 0.0001, and regularization parameter λ
around 1 works well for this dataset. This is just a rough point to begin your
experiments with, please feel free to hange the values based on what results
you observe. Report the values you use.
One way to determine onvergen e might be by stopping when the maximum
entry in the absolute dieren e between the urrent and the previous weight
ve tors falls below a ertain threshold. You an use other riteria for on-
vergen e if you prefer. Please spe ify what you are using. In ea h iteration
report the log-likelihood, the training-set mis lassi ation rate and the norm
of weight dieren e you are using for determining onvergen e.

b. Implement the Naive Bayes lassier for text lassi ation using the prin-
iples presented in lass. You an use a hallu inated ount of 1 for the MAP
estimates.

B. Feature Sele tion

. Train your Logisti Regression algorithm on the 200 randomly sele ted
datapoints provided in random_points.txt. Now look for the indi es of the
words baseball, ho key, n and runs. If you sort the absolute values
of the weight ve tor obtained from LR in des ending order, where do these
words appear? Based on this observation, how would you sele t interesting
features from the parameters learnt from LR?

d. Use roughly 1/3 of the data as training and 2/3 of it as test. About half
the number of do uments are from one lass. So pi k the training set with an
equal number of positive and negative points (198 of ea h in this ase). Now
using your feature sele tion s heme from the last question, pi k the [20, 50,
100, 500, all℄ most interesting features and plot the error-rates of Naive Bayes

103
and Logisti Regression. Remember to average your results on 5 random
training-test partitions. What general trend do you noti e in your results?
How does the error rate hange when you do feature sele tion? How would
you pi k the number of features based on this?

C. Highly Dependent Features: How do NB and LR dier?


In question 1.1 (i.e., CMU, 2009 spring, Tom Mit hell, HW3, pr. 1.1, aka
exer ise 12 in our ML exer ise book) you onsidered the impa t on Naive
Bayes when the onditional independen e assumption is violated (by adding
a dupli ate opy of a feature). Also question 1.3 (see exer ise 14.b in our ML
exer ise book) formulates the dis riminative analog of Naive Bayes, where we
expli itly model the joint distribution of two features. In the urrent question,
we introdu e highly dependent features to our Baseball Vs. Ho key dataset
and see the ee t on the error rates of LR and NB. A simple way of doing this
is by simply adding a few dupli ate opies of a given feature to your dataset.
First reate a dataset D with the wordIds provided in the good_features le.
For ea h of the three words baseball ho key
, , and : runs
e. Add 3 and 6 dupli ate opies of it to the dataset D and train LR and NB
again. Now report the respe tive average errors obtained by using 5 random
train-test splits of the data (as in part A). For ea h feature report the average
error-rates of LR and NB for the following:

• Dataset with no dupli ate feature added (D).


• Dataset with 3 dupli ate opies of feature added (D′ ).
• Dataset with 6 dupli ate opies added (D′′ ).
In order to have a fair omparison, use the same set of test-train splits for
ea h of the above ases.

f. How do Naive Bayes and Logisti Regression behave in the presen e of


dupli ate features?

g. Now ompute the weight ve tors for ea h of these datasets using logisti
regression. Let W , W ′ , and W ′′ be the weight ve tors learned on the datasets
D, D′ , and D′′ respe tively. You do not have to do any test-train splits.
Compute these on the entire dataset. Look at the weights on the dupli ate
features for ea h ase. Based on your observation an you nd a relation
between the weight of the dupli ated feature in W ′ , W ′′ and the same (not
dupli ated) feature in W ? How would you use this observation to explain the
behavior of NB and LR?

Solution:

. baseball is at rank 4. ho key is at rank 1. nhl is at rank 3, and runs


is at rank 2. This shows that a simple feature sele tion algorithm is to pi k
the top k elements from a list of features, whi h are sorted in a des ending
order of their absolute w values.28
28 Some of the students pointed out that words whi h o ur very often like of , and, at ome up towards
the top of the list. My understanding is that these words have very large ount and hen e pi k up large weight
values even if they are very ommon in both lasses. One way to x this will be to regularize dierent words
2
P P 2
dierently. The way we would do is by introdu ing a penalization term i λi wi instead of λ i wi in the
log-likelihood.

104
d. In the gure we see
that the error-rate is high for
a very small set of features
(whi h means the top k fea-
tures (for a small k ) are mis-
sing some good dis riminative
features). The error rate goes
down as we in rease the nu-
mber of interesting features.
With about 500 good featu-
res we obtain as good lassi-
ation a ura y as we an get
with all the features in luded.
This implies that feature se-
le tion helps.
I would pi k 500 words using this s heme, sin e that would help redu e both
time and spa e onsumption of the learning algorithms and at the same time
give me small error-rate.

e.

Dataset LR NB

D 0.1766 0.1615
Word = baseball
D′ 0.1751 0.1889
D′′ 0.1746 0.2252

Dataset LR NB

D 0.1746 0.1618
Word = ho key
D′ 0.1668 0.1746
D′′ 0.1711 0.2242

Dataset LR NB

D 0.1635 0.1595
Word = runs
D′ 0.1731 0.1965
D′′ 0.1728 0.2450

f. The error-rate of Naive Bayes in reases a lot ompared to the error-rate of


Logisti regression, as we keep dupli ating features.

g. We see that ea h of the dupli ated features in one dataset has identi al
weight values. Here is the table with the weights of dierent words for datasets
D, D′ , and D′′ . I have ex luded the weights of all 3 or 6 dupli ates, sin e they
are all identi al.
Dataset baseball ho key runs

D 1.2835 −3.1645 1.8859


D′ 0.3279 −0.9926 0.5722
D′′ 0.1878 −0.5826 0.3302
Note that in ea h ase LR divides the weight of a feature in D roughly equally
among its dupli ates in D ′ and D ′′ . For example for word runs
3 × 0.5722 = 1.7
and 6 × 0.3302 = 1.9, whereas the original feature weight is 1.9. Sin e NB treats
ea h dupli ate feature as onditionally independent of ea h other given the

105
lass variable, its error rate goes up as the number of dupli ates in reases. As
a result LR suers less from double ounting than NB does.

106
4 Instan e-Based Learning

37. (k -NN vs Gaussian Naive Bayes:


xxx appli ation on a [given℄ dataset of points from R2 )
•◦ CMU, (?) spring, ML ourse 10-701, HW1, pr. 5
In this problem, you are asked to implement and ompare the k -nearest neigh-
bor and Gaussian Naive Bayes lassiers in Matlab. You are only permitted
to use existing tools for simple linear algebra su h as matrix multipli ation.
Do NOT use any toolkit that performs ma hine learning fun tions. The pro-
vided data (traindata.txt for training, testdata.txt for testing) has two real
features X1 , X2 and the variable Y representing a lass. Ea h line in the data
les represents a data point (X1 , X2 , Y ).

a. How many parameters does Gaussian Naive Bayes lassier need to es-
timate? How many parameters for k -NN (for a xed k )? Write down the
equation for ea h parameter estimation.

b. Implement k -NN in MATLAB and test ea h point in testdata.txt using


traindata.txt as the set of possible neighbors using k = 1, . . . , 20. Plot the test
error vs. k . Whi h value of k is optimal for your test dataset?

. Implement Gaussian Naive Bayes in MATLAB and report the estimated


parameters, train error, and test error.

d. Plot the learning urves of k -NN (using k sele ted in part b) and Naive
Bayes: this is a graph where the x-axis is the number of training examples and
the y -axis is the a ura y on the test set (i.e., the estimated future a ura y
as a fun tion of the amount of training data). To reate this graph, randomize
the order of your training examples (you only need to do this on e). Create a
model using the rst 10% of training examples, measure the resulting a ura y
on the test set, then repeat using the rst 20%, 30%, . . . , 100% training
examples. Compare the performan e of two lassiers and summarize your
ndings.

Solution:

a. For Gaussian Naive Bayes lassier with n features for X and k lasses for Y ,
2
we have to estimate the mean µij and varian e σij of ea h feature i onditioned
on ea h lass j . So we have to estimate 2nk parameters. In addition, we need
the prior probabilities for Y , so there are k su h probabilities of πj = P (Y = j),
where the last one (πk ) an be determined from the rst k − 1 values by
Pk−1
P (Y = k) = 1 − j=1 P (Y = j). Therefore, we have 2nk + k − 1 parameters in
total.
P (l)
l Xi 1{Y (l) } = j)
µ̂ij = P
l 1{Y (l) =j}
(l)
− µij )2 1{Y (l) =j}
P
2 l (Xi
σ̂i,j = P
l 1{Y (l) =j}
P
l 1{Y (l) =j}
π̂j = .
N

107
In the given example where we onsider two features with binary labels, we
have 8+1=9 parameters. k -NN is nonparametri method, and there is no
parameter to estimate.

b.
0.12
0.11
test error

0.10
0.09

5 10 15 20

LC: The least test error was obtained for k = 14. However, for a better
ompromise between a ura y and e ien y (knowing that more omputa-
tions are required for omputing distan es in a spa e with a higher number
of dimensions / attributes), one might instead hoose another value for k , for
instan e k = 9.

−0.7438 0.9717 1.0468 0.8861


µ̂2ij = 2
σ̂ij = π̂j = 0.5100 0.4900
−0.9848 0.9769 0.8889 1.1822

The training error was 0.0700, and the test error was 0.0975.

108
d.

[LC: For generating the results for k -NN here it was employed the value of k
hosen at part b.]
LC's observations:
1. One an see in the above graph that when 40-180 training examples are
used, the test error produ ed by the two lasssiers are very lose (slightly
lower for Gaussian Naive Bayes). For 200 training examples k -NN be omes
slightly better.
2. The varian es are in general larger for k -NN, even very large when few
training examples (less than 40) are used.

109
38. (k -NN applied on hand-written digits
xxx from postal zip odes;
xxx ompare dierent methods to hoose k )
•◦ CMU, 2004 fall, Carlos Guestrin, HW4, pr. 3.2-8
You will implement a lassier in Matlab and test it on a real data set. The
data was generated from handwritten digits, automati ally s anned from en-
velopes by the U.S. Postal Servi e. Please download the knn.data le from the
ourse web page. It ontains 364 points. In ea h row, the rst attribute is the
lass label (0 or 1), the remaining 256 attributes are features (all olumns are
ontinuous values). You ould use Matlab fun tion load('knn.data) to load
this data into Matlab.

a. Now you will implement a k -nearest-neighbor (k -NN) lassier using Ma-


tlab. For ea h unknown example, the k -NN lassier olle ts its k nearest
neighbors training points, and then takes the most ommon ategory among
these k neighbors as the predi ted label of the test point.
We assume our lassi ation is a binary lassi ation task. The lass label
would be either 0 or 1. The lassier uses the Eu lidean distan e metri .
(But you should keep in mind that normal k -NN lassier supports multi-
lass lassi ation.) Here is the prototype of the Matlab fun tion you need to
implement:

fun tion[Y_test℄ = knn(k, X_train, Y_train, X_test);

X_train ontains the features of the training points, where ea h row is a 256-
dimensional ve tor. Y_train ontains the known labels of the training points,
where ea h row is an 1-dimensional integer either 0 or 1. X_test ontains the
features of the testing points, where ea h row is a 256-dimensional ve tor.
k is the number of nearest-neighbors we would onsider in the lassi ation
pro ess.

b. For k = 2, 4, 6, . . ., you may en ounter ties in the lassi ation. Des ribe
how you handle this situation in your above implementation.

. The hoi e of k is essential in building the k -NN model. In fa t, k an be


regarded as one of the most important fa tors of the model that an strongly
inuen e the quality of predi tions.
One simple way to nd k is to use the train-test
style. Randomly hoose 30%
of your data to be a test set. The remainder is a training set. Build the
lassi ation model on the training set and estimate the future performan e
with the test set. Try dierent values of k to nd whi h works best for the
testing set.
Here we use the error rate to measure the performan e of a lassier. It
equals to the per entage of in orre tly lassied ases on a test set.
Please implement a Matlab fun tion to implement the above train-test
way
for nding a good k for the kNN lassier. Here is the prototype of the Matlab
fun tion you need to implement:

fun tion[TestsetErrorRate, TrainsetErrorRate℄ =


knn_train_test(kArrayToTry, XData, YData);

110
XData ontains the features of the data points, where ea h row is a 256-
dimensional ve tor. YData ontains the known labels of the points, where ea h
row is an 1-dimensional integer either 0 or 1. kArrayToTry is a k × 1 olumn
ve tor, ontaining the k possible values of k you want to try. TestsetErrorRate
is a k × 1 olumn ve tor ontaining the testing error rate for ea h possible k .
TrainsetErrorRate is a k × 1 olumn ve tor ontaining the training error rate
for ea h possible k .

Then test your fun tion knn_train_test on data set knn.data.


Report the plot of train error rate
vs. k and the plot of test error rate
vs. k
for this data. (Make these two urves together in one gure. You ould use
hold on fun tion in Matlab to help you.) What is the best k you would hoose
a ording to these two plots?

d. Instead of the above train-test


style, we ould also do n -folds Cross-
validation to nd the best k . n-folds Cross-validation is a well established
te hnique that an be used to obtain estimates of model parameters that are
unknown. The general idea of this method is to divide the data sample into
a number of n folds (randomly drawn, disjointed sub-samples or segments).
For a xed value of k , we apply the k -NN model to make predi tions on the
i-th segment (i.e., use the n − 1 segments as the train examples) and evaluate
the error. This pro ess is then su essively applied to all possible hoi es of
i (i ∈ {1, . . . , v}). At the end of the n folds ( y les), the omputed errors are
averaged to yield a measure of the stability of the model (how well the model
predi ts query points). The above steps are then repeated for various k and
the value a hieving the lowest error rate is then sele ted as the optimal value
for k (optimal in a ross-validation sense).29

Then please implement a ross-validation fun tion to hoose k . Here is the


prototype of the Matlab fun tion you need to implement:

fun tion[ vErrorRate℄ = knn_ v(kArrayToTry, XData, YData, numCVFolds);

All the dimensionality of input parameters are the same as in part c. vErrorRate
is a k × 1 olumn ve tor ontaining the ross validation error rate for ea h po-
ssible k .

Apply this fun tion on the data set knn.data using 10- ross-folds. Report a
performan e urve of ross validation error rate
vs. k . What is the best k you
would hoose a ording to this urve?

e. Besides the train-test style and n-folds ross validation, we ould also
use leave-one-out Cross-validation
(LOOCV) to nd the best k . LOOCV
means omitting ea h training ase in turn and train the lassier model on
the remaining R − 1 datapoints, test on this omitted training ase. When
you've done all points, report the mean error rate. Implement a LOOCV
fun tion to hoose k for our k -NN lassier. Here is the prototype of the
matlab fun tion you need to implement:

fun tion[Loo vErrorRate℄ = knn_loo v(kArrayToTry, XData, YData);


29 If you want to understand more about ross validation, please look at Andrew Moore's Cross-Validation
slides online: http://www-2. s. mu.edu/ awm/tutorials/overt.html.

111
All the dimensionality of input parameters are the same as in part c.
Loo vErrorRate is a k × 1 olumn ve tor ontaining LOOCV error rate for ea h
possible k .
Apply this fun tion on the data set knn.data and report the performan e urve
of LOOCV error rate vs. k . What is the best k you would hoose a ording
to this urve?

f. Compare the four performan e urves (from parts c, d and e). Make the
four urves together in one gure here. Can you get some on lusion about
the dieren e between train-test, n-folds ross-validation and leave-one-out
ross validation?

Note : We provide a Matlab le TestKnnMain.m to help you test the above
fun tions. You ould download it from the ourse web site.

Solution:

b. There are many possible ways to handle this tie ase. For example, i.
hoose one of the lass; ii. use k − 1 neighbor to de ide; iii. weighted k -NN,
et .

-f. We would get four urves roughly having the similar trend. The best
error rate is around 0.02. If you run the program several times, you would
nd that LOOCV urve would be the same among multiple runs, be ause it
does not have randomness involved. CV urves varies roughly around the
LOOCV urve. The train-testtest urve varies a lot among dierent runs.
But anyway, roughly, as k in reases, the error rate in reases. From the urve,
we an a tually hoose a small range of k (1 − 5) as our model sele tion result.

112
39. (k -NN and SVM: appli ation on
xxx a fa ial attra tiveness task)
• CMU, 2007 fall, Carlos Guestrin, HW3, pr. 3
xxx CMU, 2009 fall, Carlos Guestrin, HW3, pr. 3
In this question, you will explore how ross-validation an be used to t ma-
gi parameters. More spe i ally, you'll t the onstant k in the k -Nearest
Neighbor algorithm, and the sla k penalty C in the ase of Support Ve tor
Ma hines.

Dataset

Download the le hw3_matlab.zip and unpa k it. The le fa es.mat ontains
the Matlab variables traindata (training data), trainlabels (training labels),
testdata (test data), testlabels (test labels) and evaldata (evaluation data,
needed later).
This is a fa ial attra tiveness
lassi ation task: given a pi ture of a fa e, you
need to predi t whether the average rating of the fa e is or hot
. So, ea h not
row orresponds to a data point (a pi ture). Ea h olumn is a feature, a pixel.
The value of the feature is the value of the pixel in a grays ale image.30 For
fun, try showfa e(evaldata(1,:)), showfa e(evaldata(2,:)), . . . .
osineDistan e.m implements the osine distan e
, a simple distan e fun tion.
It takes two feature ve tors x and y , and omputes a nonnegative, symmetri
distan e between x and y . To he k your data, ompute the distan e between
the rst training example from ea h lass. (It should be 0.2617.)

A. k -NN
a. Implement the k -Nearest Neighbor (k -NN) algorithm in Matlab. : Hint
You might want to pre ompute the distan es between all pairs of points, to
speed up the ross-validation later.

b. Implement n-fold ross validation for k -NN. Your implementation should


partition the training data and labels into n parts of approximately equal size.

. For k = 1, 2, . . . , 100, ompute and plot the 10-fold (i.e., n = 10) ross-
validation error for the training data, the training error, and the test error.
How do you interpret these plots? Does the value of k whi h minimizes the
ross-validation error also minimize the test set error? Does it minimize the
training set error? Either way, an you explain why? Also, what does this
tell us about using the training error to pi k the value of k ?

B. SVM

d. Now download libsvm


using the link from the ourse website and unpa k it
to your working dire tory. It has a Matlab interfa e whi h in ludes binaries
for Windows. It an be used on OS X or Unix but has to be ompiled (requires
g++ and make)  see the README le from the libsvm
zip pa kage.
hw3_matlab.zip, whi h you downloaded earlier, ontains les testSVM.m (an
example demonstration s ript), trainSVM.m (for training) and lassifySVM.m
30 This is an easier version of the dataset presented in Ryan White, Ashley Eden, Mi hael Maire Automati
Predi tion of Human Attra tiveness , CS 280 lass report, De ember 2003 on the proje t website.

113
(for lassi ation), whi h will show you how to use libsvm
for training and
lassifying using an SVM. Run testSVM. This should report a test error of
0.4333.
In order to train an SVM with sla k penalty C on training set data with labels
labels, all
svmModel = trainSVM(data, labels, C)
In order to lassify examples test, all
testLabels = lassifySVM(svmModel, test)
Train an SVM on the training data with C = 500, and report the error on the
test set.

e. Now implement n-fold ross-validation for SVMs.


f. For C = 10, 102, 103 , 104 , 5 · 104 , 105 , 5 · 105 , 106 , ompute and plot the 10-fold
(i.e., n = 10) ross-validation error for the training data, the training error,
and the test error, with the axis for C in log-s ale (try semilogx).
How do you interpret these plots? Does the value of C whi h minimizes the
ross-validation error also minimize the test set error? Does it minimize the
training set error? Either way, an you explain why? Also, what does this
tell us about using the training error to pi k the value of C ?

114
5 Clustering

40. (Hierar hi al lustering and K -means:


xxx yeast gene expression dataset)
appli ation on the
• ◦ CMU, 2010 fall, Carlos Guestrin, HW4, pr. 3
Now that you have been in the Ma hine Learning lass for almost 2 months,
you have earned the rst rank of number run hing ma hine learner. Re ently,
you landed a job at the Nexus lab of Cranberry Melon University. Your rst
task at the lab is to analyze gene expression data (whi h measures the levels
of genes in ells) for some mysterious yeast ells. You are given two datasets:
a set of 12 yeast genes in yeast1.txt, and a set of 52 yeast genes in yeast2.txt.
You are told that these genes are riti al to solve the myster of these ells.
You just learnt lustering so you hope that this te hnique ould help you
pinpoint groups of genes that may explain the mystery. The format of the
les is as follows: the rst olumn lists an identier for ea h gene, the se ond
olumn lists a ommon name for the gene and a des ription of its fun tion
and the remaining olumns list expression values for the gene under various
onditions.
Your program should not use the gene des riptions when performing luste-
ring. However, it may be informative to see them in the output. These genes
belong to four ategories for whi h the genes in ea h should exhibit fairly
similar expression proles.

a. Implement the agglomerative hierar hi al lustering that you learnt in


lass. You annot use Matlab's linkage fun tion or any fun tion that om-
putes the linkage/tree. Use the Eu lidean distan e as the distan e between
expression ve tors of two genes. You only need to implement single linkage
lustering.
Your outputshould be a tab-delimited text le. Ea h line of the le des ribes
one internal node of the tree: the rst olumn is the identier of the rst node,
the se ond olumn is the identier of the se ond node, and the third olumn
is the linkage between the two nodes. Note that ea h agglomeration o urs
at a greater distan e between lusters than the previous agglomeration. For
leaf nodes, use the gene name as the identier and for internal nodes, use the
line number where the node was des ribed.
To test your method, the output using single linkage on the small set is pro-
vided in single1.txt:

YKL145W YGL048C 2.391171


YFL018C YGR183C 3.383814
YLR038C YLR395C 3.461156
3 2 4.144297
4 1 4.163976
YOR369C YPL090C 4.328152
5 YDL066W 4.463093
6 YOR182C 4.837613
7 YPR001W 5.246656
9 YGR270W 5.373565
10 8 6.050942

115
Submit the ode and following output le single2.txt for single linkage lus-
tering of the big set.

b. From the output tree, we an get K lusters by utting the tree at a er-
tain threshold d. That is any internal nodes with the linkage greater than
d are dis arded. The genes are lustered a ording to the remaining nodes.
Implement a fun tion that output K lusters given the value K . Your fun -
tion should nd the threshold d automati ally from the onstru ted tree. The
output le lists the genes belong to ea h luster. Ea h line of the le ontains
two olumns: the gene identier (the rst olumn in the original input le)
and the des ription (the se ond olumn.) A blank line is used to separated the
lusters. For the tree in single1.txt, to get 2 lusters, we use the threshold
6.01 to ut the tree. The le 2single1.txt is an example output le as shown
here:

YPL090C RPS6A PROTEIN SYNTHESIS RIBOSOMAL PROTEIN S6A


YOR182C RPS30B PROTEIN SYNTHESIS RIBOSOMAL PROTEIN S30B
YOR369C RPS12 PROTEIN SYNTHESIS RIBOSOMAL PROTEIN S12

YPR001W CIT3 TCA CYCLE CITRATE SYNTHASE


YLR038C COX12 OXIDATIVE PHOSPHORYLATIO CYTOCHROME-C OXIDASE, SUBUNIT VIB
YGR270W YTA7 PROTEIN DEGRADATION 26S PROTEASOME SUBUNIT; ATPASE
YLR395C COX8 OXIDATIVE PHOSPHORYLATIO CYTOCHROME-C OXIDASE CHAIN VIII
YKL145W RPT1 PROTEIN DEGRADATION, UBI 26S PROTEASOME SUBUNIT
YGL048C RPT6 PROTEIN DEGRADATION 26S PROTEASOME REGULATORY SUBUNIT
YDL066W IDP1 TCA CYCLE ISOCITRATE DEHYDROGENASE (NADP+)
YFL018C LPD1 TCA CYCLE DIHYDROLIPOAMIDE DEHYDROGENASE
YGR183C QCR9 OXIDATIVE PHOSPHORYLATIO UBIQUINOL CYTOCHROME-C REDUCTASE SUBUNIT 9
Submit your ode and the following tab-delimited output les: 2single2.txt,
4single2.txt, 6single2.txt: 2, 4 and 6 lusters using single linkage on the big
dataset.

. Des ribe another way to get K lusters from the onstru ted tree. Try to
be as su in t as possible. Implement your method. Submit the ode and the
3 output les: 2user2.txt, 4user2.txt, 6user2.txt. of running your method on
the big set to get 2, 4, 6 lusters respe tively.

d. Implement K -means to luster these genes. Make sure you use at least 10
random initializations. Submit the ode and the 3 output les: 2kmeans2.txt,
4kmeans2.txt, 6kmeans2.txt of running K -means on the big set to get 2, 4, 6
lusters respe tively.

e. We an quantitatively ompare the lustering as follows. For ea h luster


k , we an al ulate the mean expression values of all genes whi h we all mk .
The residual sum of squares (RSS) is dened as
X
(xi − mci )⊤ (xi − mci )
i

where xi is the gene expression of a gene i and ci is its luster number.


For K = 2, 4, 6 lusters, report the RSS of the method in part b, your proposed
method and K -means on the big dataset. Whi h method is better with respe t
to RSS?

116
f. Qualitatively ompare the result of getting 6 lusters using the method in
part b, your proposed method with K -means on the big dataset. What do you
observe? Hint
: The gene des ription may give you some lues on what these
genes do in the ells.

Solution:

a. The ode is available online in h ust.m and 2single.txt.


b. The ode is available online in uttree.m.
. One way is to ut the K longest internal edges in the tree. The length of
the internal edge = the in rease in linkage when the luster is ombined in
the next step. This tells how far apart this luster from the neighbor luster.
The ode is available online in uttree2.m.

d. The ode is available online in kmean.m.


e. The ode is available in al RSS.m. The RSS is reported in the following
table. K -means performs best in term of RSS.

RSS
K -means CutTree CutTree
part b user
K=2 839.2200 1.6288e+03 1.6583e+03
K=4 638.7983 845.6614 1.6498e+03
K=6 526.3841 854.7895 1.6455e+03

f. By looking at the gene annotation, whi h in ludes a short des ription of


the gene fun tion, we see that the method in part b provides lusters with
more oherent sets of genes.

117
41. (K -means: appli ation to image ompression)

•◦ Stanford, 2012 spring, Andrew Ng, HW9


In this exer ise, you will use K -means to ompress an image by redu ing the
number of olors it ontains. To begin, download ex9Data.zip and unpa k its
ontents into your Matlab/O tave working dire tory.

Photo redit: The bird photo used in this exer ise belongs to Frank Wouters
and is used with his permission.

Image Representation

The data pa k for this exer ise ontains a 538-pixel by 538-pixel TIFF image
named bird_large.tiff. It looks like the pi ture below.

In a straightforward 24-bit olor representation of this image, ea h pixel is


represented as three 8-bit numbers (ranging from 0 to 255) that spe ify red,
green and blue intensity values. Our bird photo ontains thousands of olors,
but we'd like to redu e that number to 16. By making this redu tion, it would
be possible to represent the photo in a more e ient way by storing only the
RGB values of the 16 olors present in the image.

In this exer ise, you will use K -means to redu e the olor ount to K = 16.
That is, you will ompute 16 olors as the luster entroids and repla e ea h
pixel in the image with its nearest luster entroid olor.

118
Be ause omputing luster entroids on a 538 × 538
image would be time- onsuming on a desktop ompu-
ter, you will instead run K -means on the 128 × 128 image
bird_small.tiff.

On e you have omputed the luster entroids on the small image, you will
then use the 16 olors to repla e the pixels in the large image.

K -means in Matlab/O tave

In Matlab/O tave, load the small image into your program with the following
ommand:

A = double(imread('bird_small.tiff'));

This reates a three-dimensional matrix A whose rst two indi es identify


a pixel position and whose last index represents red, green, or blue. For
example, A(50, 33, 3) gives you the blue intensity of the pixel at position y =
50, x = 33. (The y-position is given rst, but this does not matter so mu h
in our example be ause the x and y dimensions have the same size).

Your task is to ompute 16 luster entroids from this image, with ea h en-
troid being a ve tor of length three that holds a set of RGB values. Here is
the K -means algorithm as it applies to this problem:

K -means algorithm

1. For initialization, sample 16 olors randomly from the original small pi ture.
These are your K means µ1 , µ2 , . . . , µK .
2. Go through ea h pixel in the small image and al ulate its nearest mean.
2
c(i) = arg min x(i) − µj

j

3. Update the values of the means based on the pixels assigned to them.
Pm
1{c(i) =j} x(i)
µj = Pim (i) = j}
i 1{c

4. Repeat steps 2 and 3 until onvergen e. This should take between 30 and
100 iterations. You an either run the loop for a preset maximum number of
iterations, or you an de ide to terminate the loop when the lo ations of the
means are no longer hanging by a signi ant amount.

Note : In Step 3, you should update a mean only if there are pixels assigned to
it. Otherwise, you will see a divide-by-zero error. For example, it's possible
that during initialization, two of the means will be initialized to the same olor
(i.e., bla k). Depending on your implementation, all of the pixels in the photo
that are losest to that olor may get assigned to one of the means, leaving
the other mean with no assigned pixels.

Reassigning olors to the large image

119
After K -means has onverged, load the large image into your program and
repla e ea h of its pixels with the nearest of the entroid olors you found
from the small image.
When you have re al ulated the large image, you an display and save it in
the following way:
imshow(uint8(round(large_image)))
imwrite(uint8(round(large_image)), 'bird_kmeans.tiff');
When you are nished, ompare your image to the one in the solutions.

Solution:

Here are the 16 olors appearing in the image:

2010-2012 Andrew Ng, Stanford University. All rights reserved.


120
42. (K -means: how to sele t K
xxx and the initial entroids (the K -means++ algorithm);
xxx the importan e of s aling the data a ross dierent dimensions)
•◦ CMU, 2012 fall, E. Xing, A. Singh, HW3, pr. 1
In K -means lustering, we are given points x1 , . . . , xn ∈ Rd and an integer K > 1,
and our goal is to minimize the within- luster sum of squares (also known as
the K -means obje tive)

n
X
J(C, L) = ||xi − Cli ||2 ,
i=1

where C = (C1 , . . . , CK ) are the luster enters (Cj ∈ Rd ), and L = (l1 , . . . , ln ) are
the luster assignments (li ∈ {1, . . . , K}).
Finding the exa t minimum of this fun tion is omputationally di ult. The
most ommon algorithm for nding an approximate solution is Lloyd's algo-
rithm, whi h takes as input the set of points and some initial luster enters
C , and pro eeds as follows:
i. Keeping C xed, nd luster assignments L to minimize J(C, L). This step

only involves nding nearest neighbors. Ties an be broken using arbitrary


(but onsistent) rules.

ii.Keeping L xed, nd C to minimize J(C, L). This is a simple step that only
involves averaging points within a luster.

If any of the values in L hanged from the previous iteration (or if this was
iii.

the rst iteration), repeat from step i.

iv. Return C and L.

The initial luster enters C given as input to the algorithm are often pi ked
randomly from x1 , . . . , xn . In pra ti e, we often repeat multiple runs of Lloyd's
algorithm with dierent initializations, and pi k the best resulting lustering
in terms of the K -means obje tive. You're about to see why.

a. Briey explain why Lloyd's algorithm is always guaranteed to onverge


(i.e., stop) in a nite number of steps.

b. Implement Lloyd's algorithm. Run it until onvergen e 200 times, ea h


time initializing using K luster enters pi ked at random from the set {x1 , . . . , xn },
with K = 5 lusters, on the 500 two dimensional data points in . . .. Plot in
a single gure the original data (in gray), and all 200 × 5 luster enters (in
bla k) given by ea h run of Lloyd's algorithm. You an play around with
the plotting options su h as point sizes so that the luster enters are learly
visible. Also ompute the minimum, mean, and standard deviation of the
within- luster sums of squares for the lusterings given by ea h of the 200
runs.

b. K -means++ is an initialization algorithm for K -means proposed by David


Arthur and Sergei Vassilvitskii in 2007:

i. Pi k the rst luster enter C1 uniformly at random from the data x1 , . . . , xn .


In other words, we rst pi k an index i uniformly at random from {1, . . . , n},
then set C1 = xi .

121
ii. For j = 2, . . . , K :
• For ea h data point, ompute its distan e Di to the nearest luster enter
pi ked in a previous iteration:

Di = min ||xi − Cj ′ ||.


j ′ =1,...,j−1

• Pi k the luster enter Cj at random from x1 , . . . , xn with probabilities


proportional to D12 , . . . , Dn
2
. Pre isely, we pi k an index i at random from
{1, . . . , n} with probabilities equal to D12 /( ni′ =1 Di2′ ), . . . , Dn2 /( ni′ =1 Di2′ ), and
P P
set Cj = xi .

iii. Return C as the initial luster assignments for Lloyd's algorithm.

Repli ate the gure and al ulations in part b using K -means++ as the initi-
alization algorithm, instead of pi king C uniformly at random.31

Pi king the number of lusters K is a di ult problem. Now we will see one
of the most ommon heuristi s for hoosing K in a tion.

d. Explain how the exa t minimum of the K -means obje tive behaves on any
data set as we in rease K from 1 to n.

A ommon way to pi k K is as follows. For ea h value of K in some range


(e.g., K = 1, . . . , n, or some subset), we nd an approximate minimum of the K -
means obje tive using our favorite algorithm (e.g., multiple runs of randomly
initialized Lloyd's algorithm). Then we plot the resulting values of the K -
means obje tive against the values of K .

Often, if our data set is su h that


there exists a natural value for K ,
we see a knee in this plot, i.e., a
value for K where the rate at whi h
the within- luster sum of squares
is de reasing sharply redu es. This
suggests we should use the value for
K where this knee o urs. In the
toy example in th enearby gure,
this value would be K = 6.

e. Produ e a plot similar to the one in the above gure for K = 1, . . . , 15 using
the data set in part b, and show where the knee is. For ea h value of K , run
K -means with at least 200 initializations and pi k the best resulting lustering
(in terms of the obje tive) to ensure you get lose to the global minimum.

f. Repeat part e with the data set in . . .. Find 2 knees in the resulting plot
(you may need to plot the square root of the within- luster sum of squares
instead, in order to make the se ond knee obvious). Explain why we get 2
knees for this data set ( onsider plotting the data to see what's going on).
31 Hopefully your results make it lear how sensitive Lloyd's algorithm is to initializations, even in su h a
simple, two dimensional data set!

122
We on lude our exploration of K -means lustering with the riti al impor-
tan e of properly s aling the dimensions of your data.

g. Load the data in . . .. Perform K -means lustering on this data with K = 2


with 500 initializations. Plot the original data (in gray), and overplot the 2
luster enters (in bla k).

h. Normalize the features in this data set, i.e., rst enter the data to be
mean 0 in every dimension, then res ale ea h dimension to have unit varian e.
Repeat part g with this modied data.

As you an see, the results are radi ally dierent. You should not take this to
mean that data should always be normalized. In some problems, the relative
values of the dimensions are meaningful and should be preserved (e.g., the o-
ordinates of earthquake epi enters in a region). But in others, the dimensions
are on entirely dierent s ales (e.g., age in years v.s., in ome in thousands of
dollars). Proper pre-pro essing of data for lustering is often part of the art
of ma hine learning.

Solution:

a. The luster assignments L an take nitely many values (K n , to be pre ise).


The luster enters C are uniquely determined by the assignments L, so after
exe uting step ii, the algorithm an be in nitely many possible states. Thus
either the algorithm stops in nitely many steps, or at least one value of
L is repeated more than on e in non- onse utive iterations. However, the
latter ase is not possible, sin e after every iteration we have J(C(t), L(t)) ≥
J(C(t + 1), L(t + 1)), with equality only when L(t) = L(t + 1), whi h oin ides
with the termination ondition. (Note that this statement depends on the
assumption that the tie-breaking rule used in step i is onsistent, otherwise
innite loops are possible.)

b. Minimum: 222.37, mean:


249.66, standard deviation: 65.64.
Plot in the nearby gure.
R ode: see le kmeans.r on the web
ourse page.

123
. Minimum: 222.37, mean: 248.33,
standard deviation: 64.96.
Plot in the nearby gure.
R ode: see le kmeans++.r on the
web ourse page.

d. The exa t minimum de reases (or stays the same) as K in reases, be ause
the set of possible lusterings for K is a subset of the possible lusterings for
K + 1. With K = n, the obje tive of the optimal solution is 0 (every point is
in its own luster, and has 0 distan e to the luster enter).

e. Plot in the nearby gure. The


knee is at K = 5.

f. Plot in the nearby gure (square


root of obje tive plotted).
The knees are at K = 3 and K = 9.
These are two values be ause the
data are omposed of 3 natural lus-
ters, ea h of whi h an further be
divided into 3 smaller lusters.

124
g. h.

125
43. (EM/GMM: implementation in Matlab
xxx and appli ation on data from R1 )
•· CMU, 2010 fall, Aarti Singh, HW4, pr. 2.3-5
a. Implement the EM/GMM algorithm using the update equations derived
in the exer ise 39, the Clustering hapter in Ciortuz et al's book).

b. Download the data set from . . .. Ea h row of this le is a training instan e
xi . Run your EM/GMM implementation on this data, using µ = [1, 2] and
θ = [.33, .67] as your initial parameters. What are the nal values of µ and θ?
Plot a histogram of the data and your estimated mixture density P (X). Is the
mixture density an a urate model for the data?
To plot the density in Matlab, you an use:

density = (x) (< lass 1 prior> * normpdf(x, < lass 1 mean>, 1)) + ...
(< lass 2 prior> * normpdf(x, < lass 2 mean>, 1));
fplot(density, [-5, 6℄);

Re all from lass that EM attempts to maximize the marginal data loglikelihood
Pn i
ℓ(µ, θ) = i=1 log P (X = x ; µ, θ), but that EM an get stu k in lo al optima. In this
part, we will explore the shape of the loglikelihood fun tion and determine if lo al
optima are a problem. For the remainder of the problem, we will assume that both
lasses are equally likely, i.e., θy = 21 for y = 0, 1. In this ase, the data loglikelihood
ℓ only depends on the mean parameters µ.

. Create a ontour plot of the loglikelihood ℓ as a fun tion of the two mean para-
meters, µ. Vary the range of ea h µk from −1 to 4, evaluating the loglikelihood at
intervals of .25. You an reate a ontour plot in Matlab using the ontourf fun tion.
Print out your plot and in lude in with your solution.

Does the loglikelihood have multiple lo al optima? Is it possible for EM to nd a


non-globally optimal solution? Why or why not?

126
44. (K -means and EM/GMM:
2
xxx omparison on data from R )

• · CMU, 2010 spring, E. Xing, T. Mit hell, A. Singh, HW3, pr. 3


Clustering means partitioning your data into natural groups, usually be ause you
suspe t points in a luster have something in ommon. The EM algorithm and K -
means are two ommon algorithms (there are many others). This problem will have
you implement these algorithms, and explore their limitations.

The datasets for you to use are available online, along with a Matlab s ript for
loading them. [Ask me if you're having any trouble with it.℄ You an use any
language for your implementations, but you may not use libraries whi h already
implement these algorithms (you an, however, use fan y built-in mathemati al
fun tions, like Matlab or Mathemati a provide).

a. In K -means lustering, the goal is to pi k your lusters su h that you minimize


2
the sum, over all points x, of |x − cx | , where cx is the mean of the luster ontaining
x. [This should remind you of least-squares line tting.℄ K -means lustering is NP-
hard, but in pra ti e this algoritm, also alled Lloyd's algorithm, works extremely
well.

Implement Lloyd's algorithm, and apply it to the datasets provided. Plot ea h


dataset, indi ating for ea h point whi h luster is was pla ed in. How well do you
think K -means did for ea h dataset? Explain, intuitively, what (if anything) went
badly and why.

b. A disadvantage of K -means is that the lusters annot overlap at all. The


Expe tation-Maximization algorithm deals with this by only probabilisti ally assig-
ning points to lusters.

The thing to understand about the EM algorithm is that it's a spe ial ase of MLE;
you have some data, you assume a parameterized form for the probability distri-
bution (a mixture of Gaussians is, after all, an exoti parameterized probability
distribution), and then you pi k the parameters to maximize the probability of your
∂P (X|θ)
data. But the usual MLE approa h, solving ∂θ
= 0, isn't tra table, so we use
the iterative EM algorithm to nd θ . The EM algorithm is guaranteed to onverge
to a lo al optimum (I'm resisting the temptation to make you prove this :) ).

Implement the EM algorithm, and apply it to the datasets provided. Assume that
the data is a mixture of two Gaussians; you an assume equal mixing ratios. What
parameters do you get for ea h dataset? Plot ea h dataset, indi ating for ea h point
whi h luster it was pla ed in.

. Modeling dataset 2 as a mixture of gaussians is unrealisti , but the EM algorithm


still gives an answer. Is there anything shy about your answers whi h suggests
something is wrong?

We usually do the EM algorithm with mixed Gaussians, but you an use any dis-
tributions; a Gaussian and a Lapla ian, three exponentials, et . Write down the
formula for a parameterized probability density suitable for modeling ring-shaped
lusters in 2D; don't let the density be 0 anywhere. You don't need to work out the
EM al ulations for this density, but you would if this ame up in your resear h.

d. With high-dimensional data we annot perform visual he ks, and problems an


go unnoti ed if we assume ni e round, lled lusters. Des ribe in words a lustering
algorithm whi h works even for weirdly-shaped lusters with unknown mixing ration.
However, you an assume that the lusters do not overlap at all, and that you have
a LOT of training data. Dis uss the weaknesses of your algorithm. Don't work out
the details for this problem; just onvin e me that you know the basi idea and
understand its limitations.

127
45. (EM for mixtures of Gaussians
xxx with independent omponents (along axis):
xxx appli ation to handwritten digit re ognition)

• CMU, 2012 spring, Ziv Bar-Joseph, HW4, pr. 3.2


In this problem we will be implementing Gaussian mixture models and working
with the digits data set. The provided data set is a Matlab le onsisting of 5000
10×10 pixel hand written digits between 0 and 9. Ea h digit is a greys ale image
represented as a 100 dimensional row ve tor (the images have been down sampled
from the original 28×28 pixel images. The variable X is a 5000×100 matrix and the
ve tor Y ontains the true number for ea h image. Please submit your ode and
in lude in your write-up a opy of the plots that you generated for this problem.

a. Implement the Expe tation-Maximization (EM) algorithm for the axis aligned
Gaussian mixture model. Re all that the axis aligned Gaussian mixture model
uses the Gaussian Naive Bayes assumption that, given the lass, all features are
onditionally independent Gaussians. The spe i form of the model is given below:

Zi ∼ Categori al(p1 , . . . , pK )
(σ1z )2
  
 z  0 ... 0
 µ1 .
.
 
  0 (σ2z )2 .
Xi |Zi = z ∼ N  ...  , 
  
 
. ..

  . .

 µz
d
 . 0 
0 ... 0 (σdz )2

b. Run EM to t a Gaussian mixture model with 16 Gaussians on the digits data.


Plot ea h of the means using subplot(4, 4, i) to save paper.

. Evaluating lustering performan e is di ult. However, be ause we have infor-


mation about the ground truth data, we an roughly assess lustering performan e.
One possible metri is to label ea h luster with the majority label for that luster
using the ground truth data. Then, for ea h point we predi t the luster label and
measure the mean 0/1 loss. For the digits data set, report your loss for settings
k = 1, 10 and 16.

128
46. (EM for mixtures of multi-variate Gaussians
xxx appli ation to stylus-written digit re ognition)

• CMU, 2001 fall, Tommi Jaakkola, HW4, pr. 1

129
47. (Apli area algoritmului EM [LC: pentru GMM℄
xxx la lusterizare de do umente,
xxx folosind sistemul WEKA)

• Edingurgh, Chris Williams and Vi tor Lavrenko


xxx Introdu tory Applied Ma hine Learning ourse, 3 Nov. 2008

A. Des ription of the dataset

This assignment is based on the 20 Newsgroups Dataset.32 This dataset is a ol-


le tion of approximately 20,000 newsgroup do uments, partitioned (nearly) evenly
a ross 20 dierent newsgroups, ea h orresponding to a dierent topi . Some of
the newsgroups are very losely related to ea h other (e.g. omp.sys.ibm.p .hardware
,
omp.sys.ma .hardware
), while others are highly unrelated (e.g. , mis .forsale so .religion. hristian).
There are three versions of the 20 Newsgroups Dataset. In this assignment we will
use the bydate Matlab version in whi h do uments are sorted by date into trai-
ning (60%) and test (40%) sets, newsgroup-identifying headers are dropped and
dupli ates are removed. This olle tion omprises roughly 61,000 dierent words,
whi h results in a bag-of-words representation with frequen y ounts. More spe i-
ally, ea h do ument is represented by a 61,000 dimensional ve tor that ontains the
ounts for ea h of the 61,000 dierent words present in the respe tive do ument.

To save you time and to make the problem manageable with limited omputational
resour es, we prepro essed the original dataset. We will use do uments from only 5
out of the 20 newsgroups, whi h results in a 5- lass problem. More spe i ally the 5
lasses orrespond to the following newsgroups 1: alt.atheism omp.sys.ibm.p .hardware
, 2: ,
omp.sys.ma .hardware
3: , 4: re .sport.baseball
and 5: re .sport.ho key
. However, note here
that lasses 2-3 and 4-5 are rather losely related. Additionally, we omputed the
mutual information of ea h word with the lass attribute and sele ted the 520 words
out of 61,000 that had highest mutual information. Therefore, our dataset is a N ×
520 dimensional matrix, where N is the number of do uments.

The resulting representation is mu h more ompa t and an be used dire tly to


perform our experiments in WEKA. There is, however, a potential aveat: The
prepro essed dataset has been prepared by a busy and heavily underpaid tea hing
assistant who might have been a bit areless when preparing the dataset. You
should keep this in mind and be aware of anomalies in the data when answering the
questions below.

B. Clustering

We are interested in lustering the newsgroups do uments using the EM algorithm.


The most ommon measure to evaluate the resulting lusters is the log-likelihood
of the data. We will additionally use the Classes to Clusters evaluation whi h
is straightforward to perform in WEKA, and look at the per entage of orre tly
lustered instan es. Note here that the data likelihood omputed during EM is
a probability density (NOT a probability mass fun tion) and therefore the log-
likelihood an be greater than 0. Use the train_20news_ lean_best_tdf.ar
dataset
and the default seed (100) to train the lusterers.

a. First, train and evaluate an EM lusterer with 5 lusters (you need to hange the
numClusters option) using the Classes to Clusters evaluation option. Report the
log-likelihood and write down the per entage of orre tly lustered instan es (PC),
you will need it in question b. Look at the Classes to Clusters onfusion matrix.
Do the lusters orrespond to lasses? Whi h lasses are more onfused with ea h
other? Interpret your results. Keep the result buer for the lusterer, you will need
it in question c.

32 http://people. sail.mit.edu/jrennie/20Newsgroups/

130
HINT: WEKA outputs the per entage of in orre tly lassied instan es.

b. Now, train and evaluate dierent EM lusterers using 3, 4, 6 and 7 lusters and
the Classes to Clusters evaluation option. Tabulate the PC as a fun tion of the
number of lusters, in lude the PC for 5 lusters from the previous question. What
do you noti e? Why do you think we get higher PC for 3 lusters than for 4? Keep
the result buers for all the lusterers, you will need them in question c.

. Re-evaluate the ve lusterers using the validation set val_20news_best_tdf.ar .


Tabulate the log-likelihood on the validation set as a fun tion of the number of
lusters. If the dataset was unlabeled, how many lusters would you hoose to model
it in light of your results? Is it safe to make this de ision based on experiments with
only one random seed? Why?

HINT: To re-evaluate the models, rst sele t the Supplied test set option and hoose
the appropriate dataset. Then right- li k on the model and sele t Re-evaluate model
on urrent test set.

d. Now onsider the model with 5 lusters learned using EM. After EM onverges,
ea h luster is des ribed in terms of the mean and standard deviation for ea h of
the 500 attributes omputed from the do uments assigned to the respe tive luster.
Sin e the attributes are the normalized tf-idf weights for ea h word in a do ument,
the mean ve tors learned by EM orrespond to the tf-idf weights for ea h word in
ea h luster.

For ea h of the 5 lusters, we sele ted the 20 attributes with the highest mean
values. Open the le luster_means.txt. The 20 attributes for ea h luster are displayed
olumnwise together with their orresponding mean value. By looking at the words
with the highest tf-idf weights per luster, whi h olumn ( luster) would you assign
to ea h lass (newsgroup topi ) and why? Whi h two lusters are losest to ea h
other? Imagine that we want to assign a new do ument to one of the lusters and
that the do ument ontains only the words pit hing and hit. Would this be an
easy task for the lusterer? What about a do ument that ontains only the words
drive and ma ? Write down three examples of 2-word do uments that would be
di ult test ases for the lusterer.

131
48. (EM for GMM: appli ation on
xxx the yeast gene expression dataset)
• ◦ CMU, 2004 fall, Carlos Guestrin, HW2, pr. 3
In this problem you will implement a Gaussian mixture model algorithm and will
apply it to the problem of lustering gene expression data. Gene expression measures
the levels of messenger RNA (mRNA) in the ell. The data you will be working
with is from a model organism alled yeast , and the measurements were taken to
study the ell y le system in that organism. The ell y le system is one of the
most important biologi al systems playing a major role in development and an er.
All implementation should be done in Matlab. At the end of ea h sub-problem where
you need to implement a new fun tion we spe ify the prototype of the fun tion.
The le alphaVals.txt ontains 18 time points (every 7 minutes from 0 to 119)
measuring the log expression ratios of 745 y ling genes. Ea h row in this le
orresponds to one of the genes. The le geneNames.txt ontains the names of these
genes. For some of the genes, we are missing some of their values due to problems
with the mi roarray te hnology (the tools used to measure gene expression). These
ases are represented by values greater than 100.

a. Implement (in Matlab) an EM algorithm for learning a mixture of ve (18-


dimensional) Gaussians. It should learn means, ovarian e matri es and weights for
ea h of the Gaussian. You an assume, however, independen e between the dierent
data points [LC, orre t: features/attributes℄, resulting in a diagonal ovarian e
matrix. How an you deal with the missing data? Why is this orre t?
Plot the enters identied for ea h of the ve lasses. Ea h enter should be plotted
as a time-series of 18 time points.
Here is the prototype of the Matlab fun tion you need to implement:

fun tion[mu; s;w℄ = em luster(x; k; ploton);

where

− x is input data, where ea h row is an 18-dimensional sample. Values above 100


represent missing values;

− k is the number of desired lusters;


− ploton is either 1 or 0. If 1, then before returning the fun tion plots log-likelihood
of the data after ea h EM iteration (the fun tion will have to store the log-likelihood
of the data after ea h iteration, and then plot these values as a fun tion of iteration
number at the end). If 0, the fun tion does not plot anything;

− s is a k by 18 matrix, with ea h row being diagonal elements of the orresponding


ovarian e matrix;

− w is a olumn ve tor of size k, where w(i) is a weight for i-th luster.

The fun tion outputs mu, a matrix with k rows and 18 olumns (ea h row is a enter
of a luster).

b. How many more parameters would you have had to assign if we remove the
independen e assumption above? Explain.

. Suggest and implement a method for determining the number of Gaussians (or
lasses) that are the most appropriate for this data. Please onne the set of hoi es
to values in between 2 and 7. (Hint : The method an use an empiri al evaluation
of lustering results for ea h possible number of lasses). Explain the method.
Here is the prototype of the Matlab fun tion you need to implement:

132
fun tion[k, mu, s, w℄ = lust(x);

where

− x is input data, where ea h row is an 18-dimensional sample. On e again values


above 100 represent missing values;

− k is the number of lasses sele ted by the fun tion;


− mu, s and w are dened as in part a.

d. Use the Gaussians determined in part c to perform hard lustering of your data
by nding, for ea h gene i the Gaussian j that maximizes the likelihood: P (i|j). Use
the fun tion printSele tedGenes.m to write the names of the genes in ea h of the
lusters to a separate le.

Here is the prototype of the matlab fun tion you need to implement:

fun tion[ ℄ = hard lust(x; k; mu; s;w);

where

− x is dened as before;
− k, mu, s, w are the output variables from the fun tion written in part c and are
therefore dened there;

− is a olumn ve tor of the same length as the number of rows in x. For ea h row,
it should indi ate the luster the orresponding gene belongs to.

The fun tion should also write out les as spe ied above. The lenames should be:
lust1, lust2,
. . . , lustk.

e. Use ompSigClust.m to perform the statisti al signi an e test (everything is


already implemented here, so just use the fun tion). Hand in a printout with the
top three ategories for ea h luster (this is the output of ompSigClust.m).

Solution:

a. We have put a student ode online. The implementation is pretty lear in terms of
ea h step of the GMM iteration. The plot of the log-likelihood should be in reasing.
The plots of the enters of ea h luster should look like a sinusoid shape though
with dierent phases (starting at a dierent point in the time series).

b. The number of lusters times the number of ovarian es, whi h is

kd
k((d − 1) + (d − 2) + . . . + 1) = (d − 1),
2
where d = 18 in our ase.

. This is essentially a model sele tion question. You ould use dierent model
sele tion ways to solve it: ross validation, train-test, minimum des ription length ,
BIC.

d. For ea h data point, assign the luster that has the maximum probability for this
point.

e. Just run the ode we provided on the luster les you got above.

133
6 EM Algorithm

49. (EM for Bernoulli MM, using the Naive Bayes assumption,
xxx and a penalty term;
xxx appli ation to handwritten digit re ognition)
• ◦ U. Toronto, Radford Neal,
xxx Statisti al Methods for Ma hine Learning and Data Mining ourse,
xxx 2014 spring, HW 2

In this assignment, you will lassify handwritten digits with mixture models tted
by maximum penalized likelihood using the EM algorithm. The data you will use
onsists of 800 training images and 1000 test images of handwritten digits (from
US zip odes). We derived these images from the well-known MNIST dataset, by
randomly sele ting images from the total 60000 training ases provided, redu ing
the resolution of the images from 28 × 28 to 14 × 14 by averaging 2 × 2 blo ks of pixel
values, and then thresholding the pixel values to get binary values. A data le with
800 lines ea h ontaining 196 pixel values (either 0 or 1) is provided on webpage
asso iated to this book. Another le ontaining the labels for these 800 digits (0 to
9) is also provided. Similarly, there is a le with 1000 test images, and another le
with the labels for these 1000 test images. You should look at the test labels only
at the very end, to see how well the methods do.
In this assignment, you should try to lassify these images of digits using a generative
model, from whi h you an derive the probabilities of the 10 possible lasses given
the observed image of a digit. You should guess that the lass for a test digit is the
one with highest probability (i.e., we will use a loss fun tion in whi h all errors are
equally bad).
The generative model we will use estimates the lass probabilities by their frequ-
en ies in the training set (whi h will be lose to, but not exa tly, uniform over the
10 digits) and estimates the probability distributions of images within ea h lass by
mixture models with K omponents, with ea h omponent modeling the 196 pixel
values as being independent. It will be onvenient to ombine all 10 of these mixture
models into a single mixture model with 10K omponents, whi h model both the
pixel values and the lass label. The probabilities for lass labels in the omponents
will be xed, however, so that K omponents give probability 1 to digit 0, K om-
ponents give probability 1 to digit 1, K omponents give probability 1 to digit 2,
et .
The model for the distribution of the label, yi , and pixel values xi,1 , . . . , xi,196 , for
digit i is therefore as follows:

10K 196
x
X Y
P (yi , xi ) = πk qk,yi i,j
θk,j (1 − θk,j )1−xi,j
k=1 j=1

The data items, (yi , xi ), are assumed to be independent for dierent ases i. The
parameters of this model are the mixing proportions, π1 , . . . , π10K , and the probabi-
lities of pixels being 1 for ea h omponent, θk,j for k = 1, . . . , 10K and j = 1, . . . , 196.
The probabilities of lass labels for ea h omponent are xed, as

1 if k ∈ {Ky + 1, . . . , Ky + K}
qk,y =
0 otherwise

for k = 1, . . . , 10K and y = 0, . . . , 9.


You should write an R fun tion to try to nd the parameter values that maximize the
log-likelihood from the training data plus a penalty. (Note that with this penalty

134
higher values are better.) The EM algorithm an easily be adapted to nd maximum
penalized likelihood estimates rather than maximum likelihood estimates  referring
to the general version of the algorithm, the E step remains the same, but the M
step will now maximize EQ [log P (x, z|θ) + G(θ)], where G(θ) is the penalty.
The penalty to use is designed to avoid estimates for pixel probabilities that are
zero or lose to zero, whi h ould ause problems when lassifying test ases (for
example, zero pixel probabilities ould result in a test ase having zero probability
for every possible digit that it might be). The penalty to add to the log likelihood
should be
10K
XX 196
G(θ) = α [log(θk,j ) + log(1 − θk,j )].
k=1 j=1

Here, α ontrols the magnitude of the penalty. For this assignment, you should x
α to 0.05, though in a real appli ation you would probably need to set it by some
method su h as ross-validation. The resulting formula for the update in the M step
is Pn
α + i=1 ri,k xi,j
θ̂k,j =
2α + n
P
i=1 ri,k
where ri,kis the probability that ase i ame from omponent k , estimated in the E
step. You should write a derivation of this formula from the general form of the EM
algorithm presented in the le ture slides (modied as above to in lude a penalty
term).
Your fun tion implementing the EM algorithm should take as arguments the ima-
ges in the training set, the labels for these training ases, the number of mixture
omponents for ea h digit lass (K ), the penalty magnitude (α), and the number of
iterations of EM to do. It should return a list with the parameter estimates (π and
θ) and responsibilities (r ). You will need to start with some initial values for the
responsibilities (and then start with an M step). The responsibility of omponent
k for item i should be zero if omponent k has qk,yi = 0. Otherwise, you should
randomly set ri,k from the uniform distribution between 1 and 2 and then res ale
these values so that for ea h i, the sum over k of ri,k is one.
After ea h iteration, your EM fun tion should print the value of the log-likelihood
and the value of the log likelihood plus the penalty fun tion. The latter should never
go down  if it does, you have a bug in your EM fun tion. You should use enough
iterations that these values have almost stabilized by the last iteration.
You will also need to write an R fun tion that takes the tted parameter values
from running EM and uses them to predi t the lass of a test image. This fun tion
should use Bayes' Rule to nd the probability that the image ame from ea h of the
10K mixture omponents, and then add up the probabilities for the K omponents
asso iated with ea h digit, to obtain the probabilities of the image being of ea h
digit from 0 to 9. It should return these probabilities, whi h an then be used to
guess what the digit is, by nding the digit with the highest probability.
You should rst run your program EM and predi tion fun tions for K = 1, whi h
should produ e the same results as the naive Bayes method would. (Note that EM
should onverge immediately with K = 1.) You should then do ten runs with K = 5
using dierent random number seeds, and see what the predi tive a ura y is for
ea h run. Finally, for ea h test ase, you should average the lass probabilities ob-
tained from ea h of the ten runs, and then use these averaged probabilities to lassify
the test ases. You should ompare the a ura y of these ensemble predi tions
with the a ura y obtained using the individual runs that were averaged.
You should hand in your derivation of the update formula for θ̂ above, a listing of
the R fun tions you wrote for tting by EM and predi ting digit labels, the R s ripts
you used to apply these fun tions to the data provided, the output of these s ripts,
in luding the lassi ation error rates on the test set you obtained (with K = 1,

135
with K = 5 for ea h of ten initializations, and with the ensemble of ten ts with
K = 5), and a dis ussion of the results. Your dis ussion should onsider how naive
Bayes (K = 1) ompares to using a mixture (with K = 5), and how the ensemble
predi tions ompare with predi ting using a single run of EM, or using the best run
of EM a ording to the log likelihood (with or without the penalty).

Solution:

With K = 1, whi h is equivalent to a naive Bayes model, the lassi ation error rate
on test ases was 0.190.
With K = 5, 80 iterations of EM seemed su ient for all ten random initializations.
The resulting models had the following error rates on the test ases:

0.157 0.151 0.158 0.156 0.166 0.162 0.163 0.159 0.158 0.153
These are all better than the naive Bayes result, showing that using more than one
mixture omponent for ea h digit is bene ial.
I used the show_digit fun tion to display the theta parameters of the 50 mixture
omponents as pi tures (for the run started with the last random seed). It is lear
that the ve omponents for ea h digit have generally aptured reasonable variations
in writing style, ex ept perhaps for a few with small mixing proportion (given as
the number above the plot), su h as the se ond 1 from the top.

136
Using the ensemble predi tions (averaging probabilities of digits over the ten runs
above), the lassi ation error rate on test ases was 0.139. This is substantially
better than the error rate from every one of the individual runs, showing the benets
of using an ensemble when there is substantial random variation in the results.

Note that the individual run with highest log likelihood (and also highest log likeli-
hood + penalty) was the sixth run, whose error rate of 0.162 was a tually the third
worst. So at least in this example, pi king a single run based on log likelihood would
ertainly not do better than using the ensemble.

137
50. (EM for a mixture of two exponential distributions)
• · U. Toronto, Radford Neal,
xxx Statisti al Computation ourse,
xxx 2000 fall, HW 4

Suppose that the time from when a ma hine is manufa tured to when it fails is
is exponentially distributed (a ommon, though simplisti , assumption). However,
suppose that some ma hines have a manufa turing defe t that auses them to be
more likely to fail early than ma hines that don't have the defe t.

Let the probability that a ma hine has the defe t be p, the mean time to failure for
ma hines without the defe t be g , and the mean time to failure for ma hines with
the defe t be d. The probability density for the time to failure will then be the
following mixture density:
   
1 x 1 x
p· · exp − + (1 − p) · exp −
µd µd µg µg

Suppose that you have a number of independent observations of times to failure


for ma hines, and that you wish to nd maximum likelihood estimates for p, µg ,
and µd . Write a program to nd these estimates using the EM algorithm, with
the unobserved variables being the indi ators of whether or not ea h ma hine is
defe tive. Note that the model is not identiable  swapping µd and µg while
repla ing p with 1 − p has no ee t on the density. This isn't really a problem; you
an just interpret whi hever mean is smaller as the mean for the defe tive ma hines.

You may write your program so that it simply runs for however many iterations you
spe ify (i.e., you don't have to ome up with a onvergen e test). However, your
program should have the option of printing the parameter estimates and the log
likelihood at ea h iteration, so that you an manually see whether it has onverged.
(This will also help debugging.)

You should test your program on two data sets ( ass4a.data


and ass4b
), ea h with 1000
observations, whi h are on the web page asso iated to this book. You an read this
data with a ommand like

> x <- s an("ass4a.data")

For both data sets, run your algorithm for as long as you need to to be sure that
you have obtained lose to the orre t maximum likelihood estimates. To be sure,
we re ommend that you run it for hundreds of iterations or more (this shouldn't
take long in R). Dis uss how rapidly the algorithm onverges on the two data sets.
You may nd it useful to generate your own data sets, for whi h you know the true
parameters values, in order to debug your program. You ould start with data sets
where µg and µd are very dierent.
You should hand in your derivation of the formulas neeed for the EM algorithm,
your program, the output of your tests, and your dis ussion of the results.

138
7 Arti ial Neural Networks

51. (Reµele neuronale: hestiuni de baz )

• MPI, 2005 spring, Jörg Rahnenfuhrer, Adrian Alexa, HW2, pr. 5


a. Simulate data by randomly drawing two-dimensional data points uniformly dis-
tributed in [0, 1]2 . The lass Y of a sample X = (X1 , X2 ) is 1 if X1 + X2 > 1 and −1
otherwise. You an add some noise to the data by using the rule X1 + X2 > 1 + ε
with ε ∼ N (0, 0.1). Use 100 samples for the training set (you an use training data
with or without noise).

b. Write a fun tion to train a per eptron for a two lass lassi ation problem.
A per eptron is a lassier whi h onstru ts a linear de ision boundary that tries
to separate the data into dierent lasses as best as possible. Se tion 4.5.1 in The
Elements of Statisti al Learning book by Hastie, Tibshirani, Friedman des ribes
how the per eptron learning algorithm works.

. Write a fun tion to predi t the lass for new data points. Write a fun tion that
performs LOOCV (Leave-One-Out Cross-Validation) for your lassier. Use these
fun tions to estimate the train error and the the predi tion error. Generate a test
set of 1000 samples and ompute the test error of your lassier.

d. Use the data generator xor.data() from the tutorial homepage to generate a new
training ( a. 100 samples) and test set ( a. 1000 samples). Train the per eptron on
this data. Report the train and test errors. Plot the test samples to see how they
are lassied by the per eptron.

e. Comment your ndings. Is the per eptron able to learn the XOR data? What is
the main dieren e between the data generated in part a and the data from part d?

f. Use the nnet R pa kage to train a neural network for the XOR data. The nnet()
fun tion is tting a neural network. Use the predi t() fun tion to asses the train
and test errors.

h. Vary the number of units in the hidden layer and report the train and test errors.
We have seen at part e that a per eptron, a neural network with no unit in the hidden
layer an not orre tly lassify the XOR data. Argue whi h is the minimal number
of units in the hidden layer that a neural network must have to orre tly lassify the
XOR data. Train su h a network and report the train, predi tion and test errors.

139
52. (The Per eptron algorithm:
xxx spam identi ation)
• New York University, 2016 spring, David Sontag, HW1
In this problem set you will implement the Per eptron algorithm and apply it to
the problem of e-mail spam lassi ation.
Instru tions. You may use the programming language of your hoi e (we re ommend
Python, and using matplotlib for plotting). However, you are not permitted to use
or referen e any ma hine learning ode or pa kages not written by yourself.
Data les. We have provided you with two les: spam train.txt, and spam test.txt.
Ea h row of the data les orresponds to a single email. The rst olumn gives the
label (1=spam, 0=not spam).
Pre-pro essing. The dataset in luded for this exer ise is based on a subset of the
SpamAssassin Publi Corpus. Figure 1 shows a sample email that ontains a URL,
an email address (at the end), numbers, and dollar amounts. While many emails
would ontain similar types of entities (e.g., numbers, other URLs, or other email
addresses), the spe i entities (e.g., the spe i URL or spe i dollar amount)
will be dierent in almost every email. Therefore, one method often employed
in pro essing emails is to normalize these values, so that all URLs are treated
the same, all numbers are treated the same, et . For example, we ould repla e
ea h URL in the email with the unique string httpaddr to indi ate that a URL
was present. This has the ee t of letting the spam lassier make a lassi ation
de ision based on whether any URL was present, rather than whether a spe i
URL was present. This typi ally improves the performan e of a spam lassier, sin e
spammers often randomize the URLs, and thus the odds of seeing any parti ular
URL again in a new pie e of spam is very small.
We have already implemented the following email prepro essing steps: lower- asing;
removal of HTML tags; normalization of URLs, e-mail addresses, and numbers.
In addition, words are redu ed to their stemmed form. For example, dis ount,
dis ounts, dis ounted and dis ounting are all repla ed with dis oun. Finally,
we removed all non-words and pun tuation. The result of these prepro essing steps
is shown in Figure 2.

Figure 1: Sample e-mail in SpamAssassin orpus before pre-pro essing.


> Anyone knows how mu h it osts to host a web portal?
> Well, it depends on how many visitors youre expe ting. This an be anywhere from
less than 10 bu ks a month to a ouple of $100. You should he kout http://www.ra kspa e. om/
or perhaps Amazon EC2 if youre running something big.
To unsubs ribe yourself from this mailing list, send an email to: groupname-unsubs ribeegroups. om.

Figure 2: Pre-pro essed version of the sample e-mail from Figure 1.


anyon know how mu h it ost to host a web portal well it depend on how mani
visitor your expe t thi an be anywher from less than number bu k a month to a
oupl of dollarnumb you should he kout httpaddr or perhap amazon e numb if your
run someth big to unsubs rib yourself from thi mail list send an email to emailaddr

a. This problem set will involve your implementing several variants of the Per eptron
algorithm. Before you an build these models and measure their performan e, split
your training data (i.e. spam_train.txt) into a training and validation set, putting the
last 1000 emails into the validation set. Thus, you will have a new training set with
4000 emails and a validation set with 1000 emails. You will not use spam_test.txt
until problem j . Explain why measuring the performan e of your nal lassier
would be problemati had you not reated this validation set.

b. Transform all of the data into feature ve tors. Build a vo abulary list using
only the 4000 e-mail training set by nding all words that o ur a ross the training

140
set. Note that we assume that the data in the validation and test sets is ompletely
unseen when we train our model, and thus we do not use any information ontained
in them. Ignore all words that appear in fewer than X = 30 e-mails of the 4000
e-mail training set  this is both a means of preventing overtting and of improving
s alability. For ea h email, transform it into a feature ve tor x̄ where the ith entry,
xi , is 1 if the ith word in the vo abulary o urs in the email, and 0 otherwise.

. Implement the fun tions per eptron_train(data) and per eptron_test(w, data).
The fun tion per eptron_train(data) trains a per eptron lassier using the examples
provided to the fun tion, and should return w̄ , k , and iter , the nal lassi ation
ve tor, the number of updates (mistakes) performed, and the number of passes
through the data, respe tively. You may assume that the input data provided to
your fun tion is linearly separable (so the stopping riterion should be that all points
are orre tly lassied). For the orner ase of w · x = 0, predi t the +1 (spam) lass.
For this exer ise, you do not need to add a bias feature to the feature ve tor (it turns
out not to improve lassi ation a ura y, possibly be ause a frequently o urring
word already serves this purpose). Your implementation should y le through the
data points in the order as given in the data les (rather than randomizing), so that
results are onsistent for grading purposes.
The fun tion per eptron_test(w, data) should take as input the weight ve tor w̄ (the
lassi ation ve tor to be used) and a set of examples. The fun tion should return
the test error, i.e. the fra tion of examples that are mis lassied by w̄ .

d. Train the linear lassier using your training set. How many mistakes are made
before the algorithm terminates? Test your implementation of per eptron_test by
running it with the learned parameters and the training data, making sure that the
training error is zero. Next, lassify the emails in your validation set. What is the
validation error?

e. To better understand how the spam lassier works, we an inspe t the parame-
ters to see whi h words the lassier thinks are the most predi tive of spam. Using
the vo abulary list together with the parameters learned in the previous question,
output the 15 words with the most positive weights. What are they? Whi h 15
words have the most negative weights?

f. Implement the averaged per eptron algorithm, whi h is the same as your urrent
implementation but whi h, rather than returning the nal weight ve tor, returns
the average of all weight ve tors onsidered during the algorithm (in luding exam-
ples where no mistake was made). Averaging redu es the varian e between the
dierent ve tors, and is a powerful means of preventing the learning algorithm from
overtting (serving as a type of regularization).

g. One should expe t that the test error de reases as the amount of training data
in reases. Using only the rst N rows of your training data, run both the per eptron
and the averaged per eptron algorithms on this smaller training set and evaluate
the orresponding validation error (using all of the validation data). Do this for N =
100, 200, 400, 800, 2000, 4000, and reate a plot of the validation error of both algorithms
as a fun tion of N .

h. Also for N = 100, 200, 400, 800, 2000, 4000, reate a plot of the number of per eptron
iterations as a fun tion of N , where by iteration we mean a omplete pass through
the training data. As the amount of training data in reases, the margin of the trai-
ning set de reases, whi h generally leads to an in rease in the number of iterations
per eptron takes to onverge (although it need not be monotoni ).

i. One onsequen e of this is that the later iterations typi ally perform updates on
only a small subset of the data points, whi h an ontribute to overtting. A way to

141
solve this is to ontrol the maximum number of iterations of the per eptron algori-
thm. Add an argument to both the per eptron and averaged per eptron algorithms
that ontrols the maximum number of passes over the data.

j. Congratulations, you now understand various properties of the per eptron al-
gorithm. Try various ongurations of the algorithms on your own using all 4000
training points, and nd a good onguration having a low error on your validation
set. In parti ular, try hanging the hoi e of per eptron algorithm and the maxi-
mum number of iterations. You ould additionally hange X from question b (this
is optional). Report the validation error for several of the ongurations that you
tried; whi h onguration works best?
You are ready to train on the full training set, and see if it works on ompletely new
data. Combine the training set and the validation set (i.e. use all of spam_train.txt)
and learn using the best of the ongurations previously found. You do not need to
rebuild the vo abulary when re-training on the train+validate set.
What is the error on the test set (i.e., now you nally use spam test.txt)?

Note : This problem set is based partly on an assignment developed by Andrew Ng


of Stanford University and Coursera.

142
53. (The ba kpropagation algorithm:
xxx appli ation on the Breast Can er dataset)

• MPI, 2005 spring, Jörg Rahnenfuhrer, Adrian Alexa, HW2, pr. 6


Download the breast an er data set breast an er.zip from the tutorial homepage.
The data are des ribed in Mike West et al.: Predi ting the lini al status of human
breast an er by using gene expression proles, PNAS 98(20):11462-11467, 2001.
The le ontains expression proles of 46 patients from two dierent lasses: 23
patients are estrogen re eptor positive (ER+) and 23 are estrogen re eptor negative
(ER-). For every patient, the expression values of 7129 genes were measured. Use
the nnet R pa kage to train a neural network.

a. Load breast an er.Rdata and apply summary() to get an overview of this data
obje t: breast an er$x ontains the expression data and breast an er$y the lass
labels. Reformat the data by transposing the gene expression matrix and renaming
the lasses {ER+, ER−} to {+1, −1}.

b. Train a Neural Network using the nnet() fun tion. Che k if the inputs are
standardized (mean zero and standard deviation one) and if this is not the ase,
standardize them.

. Apply the fun tion predi t() to the training data and al ulate the training error.
Perform a LOOCV to estimate the predi tion error (you must implement by yourself
the ross validation pro edure).

d. Predi t the lasses of the three new patients (newpatients). The true lass labels
are stored in (true lasses). Are they orre tly lassied?

e. Try dierent parameters in the nnet() fun tion (the number of units in the hidden
layer, the weights, the a tivation fun tion, the weight de ay parameter, et .) and
report the parameters for whi h you obtained the best result. Comment the way
the parameters ae t the performan e of the network.

143
54. (Arti ial neural networks:
xxx Digit lassi ation ompetition)
• CMU, 2014 fall, William Cohen, Ziv Bar-Joseph, HW3
In this se tion, you are asked to onstru t a neural network using a dataset in
real world. The training samples and training labels are provided in the handout
folder. Ea h sample is a 28 × 28 gray s ale image. Ea h pixel (feature) is a real value
between 0 and 1 denoting the pixel intensity. Ea h label is a integer from 0 to 9
whi h orresponds to the digit in the image.

A. Getting Started

Getting Familiar with Data


As mentioned above, ea h sample is an image with 784 pixels. Load the data using
the following ommand:
load(`digits.mat')
Visualize an image using the following ommand:
imshow(ve 2mat(XTrain(i,:),28)')
n×784
where X ∈ R is the training samples; i is the row index of a training sample.

Neural Network Stru ture


In this ompetition, you are free to use any neural network stru ture. A simple feed
forward neural network with one hidden layer is shown in Figure... . The input
layer has a bias neuron and 784 neurons with ea h orresponding to one pixel in the
image. The output layer has 10 neurons with ea h representing the probability of
ea h digit given the image. You need to de ide the size of the hidden layer.

Code Stru ture


You should implement your training algorithm (typi ally the forward propagation
and ba k propagation) in train_ann.m and testing algorithm (using trained weights
to predi t labels) in test_ann.m. In your training algorithm, you need to store your
initial and nal weights into a mat le. In the simple example below, two weight
matri es Wih and Who are stored into weights.mat:
save(`weights.mat',`Wih',`Who');
Be sure your test ann.m runs fast enough. It is always good to ve torize your ode
in Matlab.

Separating Data
Digits.mat ontains 3000 instan es whi h you used in previous se tion. The num-
ber of instan es are pretty balan ed for ea h digit so you do not need to worry
about skewness of the data. However, you need to handle the overtting problem.
Neural networks are very powerful models whi h are apable to express extremely
ompli ated fun tions but very prone to overt.
The standard approa h for building a model on a dataset an be des ribed as follows:

• Divide your data into three sets: a training set, a validation set, a test set.
You an use any sizes for three sets as long as they are reasonable (e.g. 60%,
20%, 20%). You an also ombine the training set and the validation set and
do k-fold ross-validation. Make sure to have balan ed numbers of instan es
for ea h lass in every set.
• Train your model on the training set and tune your parameters on the validation
set. By tuning the parameters (e.g. number of neurons, number of layers,
regularization, et ...) to a hieve maximum performan e on the validation set,
the overtting problems an be somehow alleviated. The following webpage
provides some reasonable ranges for parameter sele tion:
http://en.wikibooks.org/wiki/Artifi ial_Neural_Networks/Neural_Network_Basi s

144
• If the training a ura y is mu h higher than validation a ura y, the model is
overtting; if the training a ura y and validation a ura y are both very low,
the model is undertting; if both a ura ies are high but test a ura y is low,
the model should be dis arded.

B. Bag of Tri ks for Training a Neural Network

Overtting vs Undertting
This is related to the model sele tion problem [that we are going to dis uss later in
this ourse℄. It is extremely important to determine whether the model is overtting
or undertting. The table below shows several general approa hes to dis over and
alleviate these problems:

Overt Undert
Performan e Training a ura y mu h Both a ura ies are low
higher than validation a ura y
Data Need more data If two a ura ies are lose,
no need for extra data
Model Use a simpler model Use a more ompli ated model
Features Redu e number of features In rease number of features
Regularization In rease regularization Redu e regularization

There are other ways to redu e overtting and undertting problems parti ular for
neural networks, and we will dis uss them in other tri ks.

Early Stopping

A ommon reason for overtting is that the neural net on-


verges to a bad minimum. In the nearby gure, the solid line
orresponds to the error surfa e of a trained neural net while
the dash line orresponds to the true model. Point A is very
likely to be a bad minimum sin e the narrow valley is very
likely to be aused by overtting to training data. Point B is
a better minimum sin e it is mu h smoother and more likely
be the true minimum.

To alleviate overtting, we an stop the trai-


ning pro ess before the network onverges. In
the nearby gure, if the training pro edure
stops when network a hieves best performan e
on validation set, the overtting problem is
somehow redu ed. However, in reality, the
error surfa e may be very irregular. A o-
mmon approa h is to store the weights after
ea h epo h until the network onverges. Pi k
the weights that performs well on the valida-
tion set.

Multiple Initialization
When training a neural net, people typi ally initialize weights to very small numbers
(e.g. a Gaussian random number with 0 mean and 0.005 varian e). This pro ess is
alled symmetry breaking. If all the weights are initialized to zero, all the neurons
will end up learning the same feature. Sin e the error surfa e of neural networks
is highly non- onvex, dierent weight initializations will potentially onverge to
dierent minima. You should store the initialized weights into ini weights.mat.

Momentum

145
Another way to es ape from a bad minimum is adding a momentum term into weight
updates. The momentum term is α∆W (n − 1) in equation 1, where n denotes the
number of epo hs. By adding this term to the update rule, the weights will have
some han e to es ape from minimum. You an set initial momentum to zero.

∆W (n) = ∇W J(W, b) + α∆W (n − 1) (1)

The intuition behind this approa h is the same


as this term in physi s systems. In the nearby
gure, assume weights grow positively during
training, without the momentum term, the ne-
ural net will onverge to point A. If we add the
momentum term, the weights may jump over
the barrier and onverge to a better minimum
at point B.

Bat h Gradient Des ent vs Sto hasti Gradient Des ent


As we dis ussed in the le tures, given enough memory spa e, bat h gradient des ent
usually onverges faster than sto hasti gradient des ent. However, if working on
a large dataset (whi h ex eeds the apa ity of memory spa e), sto hasti gradient
des ent is preferred be ause it uses memory spa e more e iently. Mini-bat h is a
ompromise of these two approa hes.

Change A tivation Fun tion


As we mentioned in the theoreti al questions, there are many other a tivation fun -
tions other than logisti sigmoid a tivation, su h as (but not limited to) re tied
linear fun tion, ar tangent fun tion, hyperboli fun tion, Gaussian fun tion, polyno-
mial fun tion and softmax fun tion. Ea h a tivation has dierent expressiveness and
omputation omplexity. The sele tion of a tivation fun tion is problem dependent.
Make sure to al ulate the gradients orre tly before implementing them.

Pre-training
Autoen oder is a unsupervised learning algorithm to automati ally learn features
from unlabeled data. It has a neural network stru ture with its input being exa tly
the same as output. From input layer to hidden layer(s), the features are abstra ted
to a lower dimensional spa e. Form hidden layer(s) to output layer, the features are
re onstru ted. If the a tivation is linear, the network performs very similar to Prin-
iple Component Analysis (PCA). After training an autoen oder, you should keep
the weights from input layer and hidden layers, and build a lassier on top of hidden
layer(s). For implementation details, please refer to Andrew Ng's s294A ourse han-
dout at Stanford: http://web.stanford.edu/ lass/ s294a/sparseAutoen oder_2011new.pdf

More Neurons vs Less Neurons


As mentioned above, we should use more ompli ated models for undertting ases,
and simpler models for overtting ases. In terms of neural networks, more neurons
mean higher omplexity. You should pi k the size of hidden layer based on training
a ura y and validation a ura y.

More Layers?
Adding one or two hidden layers may be useful, sin e the model expressiveness grows
exponentially with extra hidden layers. You an apply the same ba k propagation
te hnique as training a single hidden layer network. However, if you use even

146
more layers (e.g. 10 layers), you are denitely going to get extremely bad results.
Any networks with more than one hidden layer is alled a deep network. Large
deep network en ounters the vanishing gradient problem using the standard ba k
propagation algorithm (ex ept onvolutional neural nets ). If you are not familiar
with onvolutional neural nets, or training sta ks of Restri ted Boltzmann Ma hines,
you should sti k with a few hidden layers.

Sparsity
Sparsity on weights (LASSO penalty) for es neurons to learn lo alized information.
Sparsity on a tivations (KL-divergen e penalty) for es neurons to learn ompli ated
features.

Other Te hniques
All the tri ks above an be applied to both shallow networks and deep networks.
If you are interested, there are other tri ks whi h an be applied to (usually deep)
neural networks:

• Dropout
• Model Averaging

[You an nd these information in oursera le tures provided by Georey Hinton.℄

147
55. (Per eptronul Rosenblatt:
xxx al ul mistake bounds;
xxx al ul margini; omparaµie u SVM)
• ◦ MIT, 2006 fall, Tommy Jaakkola, HW1, pr. 1-2
Implement a Per eptron lassier in MATLAB. Start by implementing the following
fun tions:
− a fun tion per eptron_train(X, y) where X and y are n × d and n × 1 matri es
respe tively. This fun tion trains a Per eptron lassier on a training set of n
examples, ea h of whi h is a d-dimensional ve tor. The labels for the examples are
in y and are 1 or −1. The fun tion should return [theta, k℄, the nal lassi ation
ve tor and the number of updates performed, respe tively. You may assume that the
input data provided to your fun tion is linearly separable. Training the Per eptron
should stop when it makes no errors at all on the training data.
− a fun tion per eptron_test(theta, X_test, y_test) where theta is the lassi ation
ve tor to be used. X_test and y_test are m × d and m × 1 matri es respe tively,
orresponding to m test examples and their true labels. The fun tion should return
test err, the fra tion of test examples whi h were mis lassied.
For this problem, we have provided you two ustom- reated datasets. The dimension
d of both the datasets is 2, for ease of plotting and visualization.
a. Load data using the load p1 a s ript and train your Per eptron lassier on it.
Using the fun tion per eptron_test, ensure that your lassier makes no errors on
the training data. What is the angle between theta and the ve tor (1, 0)⊤ ? What is
the number of updates ka required before the Per eptron algorithm onverges?

b. Repeat the above steps for data loaded from s ript load_p1_b. What is the angle
between theta and the ve tor (1, 0) now? What is the number of updates kb now?

a b
. For parts a and b, ompute the geometri margins, γgeom and γgeom , of your
lassiers with respe t to their orresponding training datasets. Re all that the
θ ⊤ xt
distan e of a point xt from the θ⊤ x = 0 is | |.
||θ||
d. For parts a and b, ompute Ra and Rb , respe tively. Re all that for any dataset
χ, R = max{||x|||x ∈ χ}.
e. Plot the data (as points in the X-Y plane) from part a, along with de ision
boundary that your Per eptron lassier omputed. Create another plot, this time
using data from part b and the orresponding de ision boundary. Your plots should
learly indi ate the lass of ea h point (e.g., by hoosing dierent olors or symbols
to mark the points from the two lasses). We have a provided a MATLAB fun tion
plot_points_and_ lassifier whi h you may nd useful.

Implement an SVM lassier in MATLAB, arranged like the [above℄ Per eptron al-
gorithm, with fun tions svm_train(X, y) and svm_test(theta, X test, y test). Again,
in lude a printout of your ode for these fun tions.

Hint : Use the built-in quadrati program solver quadprog(H, f, A, b) whi h solves
1
the quadrati program: min x⊤ Hx + f ⊤ x subje t to the onstraint Ax ≤ b.
2
f. Try the SVM on the two datasets from parts a and b. How dierent are the values
of theta from values the Per eptron a hieved? To do this omparison, should you
ompute the dieren e between two ve tors or something else?

g. For the de ision boundaries omputed by SVM, ompute the orresponding


geometri margins (as in part c). How do the margins a hieved using the SVM
ompare with those a hieved by using the Per eptron?

148
56. (Kernelized per eptron)
• ◦ MIT, 2006 fall, Tommy Jaakkola, HW2, pr. 3

Most linear lassiers an be turned into a kernel form. We will fo us here on the
simple per eptron algorithm and use the resulting kernel version to lassify data
that are not linearly separable.

a. First we need to turn the per eptron algorithm into a form that involves only
inner produ ts between the feature ve tors. We will fo us on hyper-planes through
origin in the feature spa e (any oset omponent [LC: is assumed to be℄ provided
as part of the feature ve tors). The mistake driven parameter updates are: θ ←
θ + yt φ(xt ) if yt θ⊤ φ(xt ) ≤ 0, where θ = 0 initially. Show that we an rewrite the
per eptron updates in terms of simple additive updates on the dis riminant fun tion
f (x) = θ⊤ φ(x):
f (x) ← f (x) + yt K(xt , x) if yt f (xt ) ≤ 0,
where K(xt , x) = φ(xt )⊤ φ(x) is any kernel fun tion and f (x) = 0 initially.
b. We an repla e K(xt , x) with any kernel fun tion of our hoi e su h as the radial
basis kernel where the orresponding feature mapping is innite dimensional. Show
that there always is a separating hyperplane if we use the radial basis kernel. hint :
Use the answers to the previous exer ise in this homework (MIT, 2006 fall, Tommy
Jaakkola, HW2, pr. 2).

. With the radial basis kernel we an therefore on lude that the per eptron
algorithm will onverge (stop updating) after a nite number of steps for any dataset
with distin t points. The resulting fun tion an therefore be written as
n
X
f (x) = wi yi K(xi , x)
i=1

where wi is the number of times we made a mistake on example xi . Most of wi 's


are exa tly zero so our fun tion won't be di ult to handle. The same form holds
for any kernel ex ept that we an no longer tell whether the wi 's remain bounded
(problem is separable with the hosen kernel). Implement the new kernel per eptron
algorithm in MATLAB using a radial basis and polynomial kernels. The data and
helpful s ripts are provided in...

Dene fun tions


alpha = train_kernel_per eptron(X, y, kernel_type) and
f = dis riminant_fun tion(alpha, X, kernel_type, X_test)
to train the pereptron and to evaluate the resulting f (x) for test examples, respe -
tively.

d. Load the data using the load_p3_a s ript. When you use a polynomial kernel
to separate the lasses, what degree polynomials do you need? Draw the de ision
boundary (see the provided s ript plot_de _boundary) for the lowest-degree polyno-
mial kernel that separates the data. Repeat the pro ess for the radial basis kernel.
Briey dis uss your observations.

149
57. (Convolutional neural networks:
xxx implementation and appli ation on the MNIST dataset)

• CMU, 2016 fall, N. Bal an, M. Gormley, HW6


xxx CMU, 2016 spring, W. Cohen, N. Bal an, HW7

In this assignment, we are going to implement a Convolution Neural Network (CNN)


33
to lassify hand written digits of MNIST data. Sin e the breakthrough of CNNs
on ImageNet lassi ation (A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012), CNNs
have been widely applied and a hieved the state the art of results in many areas of
omputer vision. The re ent AI programs that an beat humans in playing Atari
game (V. Mnih, K. Kavuk uoglu et al, 2015) and Go (D. Silver, A. Huang et al,
2016) also used CNNs in their models.

We are going to implement the earliest CNN model, LeNet (Y. LeCun, L. Bottou,
Y. Bengio, P. Haner, 1998), that was su essfully applied to lassify hand written
digits. You will get familiar with the workow needed to build a neural network
model after this assignment.
34 35
The Stanford CNN ourse and UFLDL material are ex ellent for beginners to
read. You are en ouraged to read some of them before doing this assignment.

A. We begin by introdu ing the basi stru ture and building blo ks of CNNs. CNNs
are made up of layers that have learnable parameters in luding weights and bias.
Ea h layer takes the output from previous layer, performs some operations and
produ es an output. The nal layer is typi ally a softmax fun tion whi h outputs the
probability of the input being in dierent lasses. We optimize an obje tive fun tion
over the parameters of all the layers and then use sto hasti gradient des ent (SGD)
to update the parameters to train a model.

Depending on the operation in the layers, we an divide the layers into following
types:

1. Inner produ t layer (fully onne ted layer)


As the name suggests, every output neuron of inner produ t layer has full onne tion
to the input neurons. The output is the multipli ation of the input with a weight
matrix plus a bias oset, i.e.:
f (x) = W x + b. (2)

This is simply a linear transformation of the input. The weight parameter W and
bias parameter b are learnable in this layer. The input x is d dimensional olumn
ve tor, and W is a d × n matrix and b is n dimensional olumn ve tor.

2.A tivation layer


We add nonlinear a tivation fun tions after the inner produ t layers to model the
non-linearity of real data. Here are some of the popular hoi es for non-linear
a tivation:

1
• Sigmoid: σ(x) = ;
(1 + e−x )
(e2x − 1)
• tanh: tanh(x) = ;
(e2x + 1)
• ReLU: relu(x) = max(0, x).
33 http://yann.le un. om/exdb/mnist/
34 http:// s231n.github.io/
35 http://ufldl.stanford.edu/tutorial/

150
Re tied Linear Unit (ReLU) has been found to work well in vision related problems.
There is no learnable parameters in the ReLU layer. In this homework, you will use
ReLU, and a re ently proposed modi ation of it alled Exponential Linear Unit
(ELU).

Note that the a tivation is usually ombined with inner produ t layer as a single
layer, but here we separate them in order to make the ode modular.

3.Convolution layer
The onvolution layer is the ore building blo k of CNNs. Unlike the inner produ t
layer, ea h output neuron of a onvolution layer is onne ted only to some input
neurons. As the name suggest, in the onvolution layer, we apply onvolution ope-
rations with lters on input feature maps (or images). In image pro essing, there
are many types of kernels (lters) that an be used to blur, sharpen an image or
36
dete t edges in an image. Read the Wikipedia page page if you are not familiar
with the onvolution operation.

In a onvolution layer, the lter (or kernel) parameters are learnable and we want
to adapt the lters to data. There is also more than one lter at ea h onvolution
layer. The input to the onvolution layer is a three dimensional tensor (and is often
referred to as the input feature map in the rest of this do ument), rather than a
ve tor as in inner produ t layer, and it is of the shape h × w × c, where h is the height
of ea h input image, w is the width and c is the number of hannels. Note that we
represent ea h hannel of the image as a dierent sli e in the input tensor.

The nearby gure shows the


detailed onvolution opera-
tion. The input is a feature
map, i.e., a three dimensional
tensor with size h × w × c. The
onvolution operation involves
applying lters on this input.
Ea h lter is a sliding window,
and the output of the onvo-
lution layer is the sequen e of
outputs produ ed by ea h of
those lters during the sliding
operation.

Let us assume ea h lter has a square window of size k × k per hannel, thus making
lter size k × k × c. We use n lters in a onvolution layer, making the number of
parameters in this layer k ×k ×c×n. In addition to these parameters, the onvolution
layer also has two hyper-parameters: the padding size p and stride step s. In the
sliding window pro ess des ribed above, the output from ea h lter is a fun tion of a
neighborhood of input feature map. Sin e the edges have fewer neighbors, applying
a lter dire tly is not feasible. To avoid this problem, inputs are typi ally padded
(with zeros) on all sides, ee tively making the the height and width of the padded
input h + 2p and w + 2p respe tively, where p is the size of padding. Stride (s) is the
step size of onvolution operation.

As the above gure shows, the red square on the left is a lter applied lo ally on
the input feature map. We multiply the lter weights (of size k × k × c) with a lo al
region of the input feature map and then sum the produ t to get the output feature
map. Hen e, the rst two dimensions of output feature map is [(h + 2p − k)/s + 1] ×
[(w + 2p − k)/s + 1]. Sin e we have n lters in a onvolution layer, the output feature
map is of size [(h + 2p − k)/s + 1] × [(w + 2p − k)/s + 1] × n.

36 https://en.wikipedia.org/wiki/Kernel_(image_pro essing)

151
For more details about the onvolutional layer, see Stanford's ourse on CNNs for
37
visual re ognition.

4. Pooling layer
It is ommon to use pooling layers after onvolutional layers to redu e the spatial
size of feature maps. Pooling layers are also alled down-sample layers, and perform
an aggregation operation on the output of a onvolution layer. Like the onvolution
layer, the pooling operation also a ts lo ally on the feature maps. A popular kind
of pooling is max-pooling, and it simply involves omputing the maximum value
within ea h feature window. This allows us to extra t more salient feature maps
and redu e the number of parameters of CNNs to redu e over-tting. Pooling is
typi ally applied independently within ea h hannel of the input feature map.

5. Loss layer
For lassi ation task, we use a softmax fun tion to assign probability to ea h lass
given the input feature map:

p = softmax(W x + b). (3)

In training, we know the label given the input image, hen e, we want to minimize
the negative log probability of the given label:

l = − log(pj ), (4)

where j is the label of the input. This is the obje tive fun tion we would like
optimize.

B. LeNet Having introdu ed the building omponents of CNNs, we now introdu e


the ar hite ture of LeNet.

Layer Type Conguration


Input size: 28 × 28 × 1
Convolution k = 5, s = 1, p = 0, n = 20
Pooling MAX, k = 2, s = 2, p = 0
Convolution k = 5, s = 1, p = 0, n = 50
Pooling MAX, k = 2, s = 2, p = 0
IP n = 500
ReLU
Loss

The ar hite ture of LeNet is shown in Table. 57. The name of the layer type explains
itself. LeNet is omposed of interleaving of onvolution layers and pooling layers,
followed by an inner produ t layer and nally a loss layer. This is the typi al
stru ture of CNNs.

37 http:// s231n.github.io/ onvolutional-networks/

152
8 Support Ve tor Ma hines

58. (An implementation of SVM using the quadprog MatLab fun tion;
xxx appli ation on dierent datasets from R2 )
• · MIT, 2001 fall, Tommi Jaakkola, HW3, pr. 2

59. (Choosing an SVM kernel)


• · CMU, 2013 fall, A. Smola, B. Po zos, HW3, pr. 2

60. (An implementation of SVM using the quadprog MatLab fun tion;
xxx omparison with the per eptron;
xxx using SVMlight for digit re ognition)
• · MIT, 2006 fall, Tommi Jaakkola, HW1, se tion B

61. (ν -SVM: implementation using the quadprog MatLab fun tion;


xxx train and test it on a given dataset)
• · MIT, 2004 fall, Tommi Jaakkola, HW3, pr. 3.3

62. (Minimum En losing Ball (MEB) / Anomaly dete tion)


• · MIT, 2009 fall, Tommi Jaakkola, HW2, pr. 2. d
• ◦ CMU, 2017 fall, Nina Bal an, HW4, pr. 4.1
Let S = {(x1 , y1 ), . . . (xn , yn )} be n labeled examples in Rd with label set {1, −1}. Re all
that the primal formulation of SVM is given by

Primal
n
X
min kwk2 + C ξi ,
w∈Rd ,ξ1 ,...ξn
i=1

s.t. ∀i, yi hw, Xi i ≥ 1 − ξi , ξi ≥ 0.

It an be proven (see ex. 21) that the primal problem an be rewritten as the
following equivalent problem in the Empiri al Risk Minimization (ERM) framework
n
1X
min λkwk22 + max(1 − yi hw, Xi i , 0), (5)
w∈Rd n i=1

1
where λ= nC
.

We will now optimize the above un onstrained formulation of SVM using Sto hasti
Sub-Gradient Des ent. In this problem you will be using a binary (two lass) version
of mnist dataset. The data and ode template an be downloaded from the lass
website:
https://sites.google. om/site/10715advan edmlintro2017f/homework-exams
.
The data folder has mnist2.mat le whi h ontains the train, test and validation
datasets. The python folder has python ode template (and matlab folder has the

153
matlab ode template) whi h you will use for your implementation. You an either
use python or matlab for this programming question.

We slightly modify Equation (5) and use the following formulation in this problem

n
λ 1X
min kwk22 + max(1 − yi hw, Xi i , 0).
w∈Rd 2 n i=1

This is only done to simplify al ulations. You will optimize this obje tive using
Sto hasti Sub-Gradient Des ent (SSGD).38 This approa h is very simple and s ales
39
well to large datasets. In SSGD we randomly sample a training data point in ea h
iteration and update the weight ve tor by taking a small step along the dire tion of
negative sub-gradient of the loss.40

The SSGD algorithm is given by

• Initialize the weight ve tor w = 0.


• For t = 1 . . . T
* Choose it ∈ {1, . . . n} uniformly at random
1
* Set ηt = λt .
* If yit hw, Xit i < 1 then:
 Set w ← (1 − ληt )w + ηt yit Xit
* Else:
 Set w ← (1 − ληt )w
• Return w

Note that we don't onsider the bias/inter ept term in this problem.

a. Complete the train(w0, Xtrain, ytrain, T, lambda) fun tion in the svm.py le
(matlab users omplete the train.m le).

b. The fun tion train(w0, Xtrain, ytrain, T, lambda) runs the SSGD algorithm,
taking in an initial weight ve tor w0, matrix of ovariates Xtrain, a ve tor of labels
ytrain. T is the number of iterations of SSGD and lambda is the hyper-parameter in
the obje tive. It outputs the learned weight ve tor w.

. Run svm_run.py to perform training and see the performan e on training and test
sets.

d. Use validation dataset for pi king a good lambda(λ) from the set {1e3, 1e2, 1e1,
1, 0.1}.
e. Report the a ura y numbers on train and test datasets obtained using the best
lambda, after running SSGD for 200 epo hs (i.e., T = 200 ∗ n). Generate the training
a ura y vs. training time and test a ura y vs training time plots.

38 See Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated
sub-gradient solver for SVM. Mathemati al programming, 127(1):3-30, 2011.
39 To estimate optimal w , one an also optimize the dual formulation of this problem. Some of the popular
SVM solvers su h as LIBSVM solve the dual problem. Other fast approa hes for solving dual formulation on
large datasets use dual oordinate des ent.
40 Sub-gradient generalizes the notion of gradient to non-dierentiable fun tions.

154

You might also like