hw4 Red
hw4 Red
Homework #4
RELEASE DATE: 10/21/2024
RED CORRECTION: 10/27/2024 10:45
DUE DATE: 11/04/2024, BEFORE 13:00 on GRADESCOPE
QUESTIONS ARE WELCOMED ON DISCORD (INFORMALLY) OR VIA EMAILS (FORMALLY).
You will use Gradescope to upload your scanned/printed solutions. Any programming language/platform
is allowed.
Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail
the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.
Discussions on course materials and homework solutions are encouraged. But you should write the final
solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but
not copied from.
Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework
solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness
in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will
be punished according to the honesty policy.
You should write your solutions in English with the common math notations introduced in class or in the
problems. We do not accept solutions written in any other languages.
This homework set comes with 200 points and 20 bonus points. In general, every home-
work set would come with a full credit of 200 points, with some possible bonus points.
1. (10 points, auto-graded) In class, we introduced our version of the cross-entropy error function
N
1 X
Ein (w) = − ln θ(yn wT xn ).
N n=1
yn +1
based on the definition of yn ∈ {−1, +1}. If we transform yn to yn′ ∈ {0, 1} by yn′ = 2 , which
of the following error function is equivalent to Ein above?
PN
[a] N1 n=1 +yn′ ln θ(+wT xn ) + (1 − yn′ ) ln θ(−wT xn )
PN
[b] N1 n=1 +yn′ ln θ(−wT xn ) + (1 − yn′ ) ln θ(+wT xn )
PN
[c] N1 n=1 −yn′ ln θ(+wT xn ) − (1 − yn′ ) ln θ(−wT xn )
PN
[d] N1 n=1 −yn′ ln θ(−wT xn ) − (1 − yn′ ) ln θ(+wT xn )
[e] none of the other choices
(Note: In the error functions in the choices, there is a form of p log q + (1 − p) log(1 − q), which is
the origin of the name “cross-entropy.”)
1 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
2. (10 points, auto-graded) In the perceptron learning algorithm, we find one example (xn(t) , yn(t) )
that the current weight vector wt mis-classifies, and then update wt by
A variant of the algorithm finds all examples (xn , yn ) that the weight vector wt mis-classifies
(e.g. yn ̸= sign(wtT xn )), and then update wt by
η X
wt+1 ← wt + yn xn .
N
n : yn ̸=sign(wtT xn )
The variant can be viewed as optimizing some Ein (w) that is composed of one of the following point-
wise error functions with a fixed learning rate gradient descent (neglecting any non-differentiable
spots of Ein ). What is the error function?
for some 0 < α < 1. When using stochastic gradient descent to minimize an Ein (w) that is
composed of the asymmetric squared error function, which of the following is the update direction
−∇ errα (w, xn , yn ) for the chosen (xn , yn ) with respect to wt ?
[a] 2(1 + α2 · sign wtT xn − yn ) yn − wtT xn xn
[b] 2(1 − α2 · sign wtT xn − yn ) yn − wtT xn xn
[c] 2(1 + α · sign wtT xn − yn ) yn − wtT xn xn
[d] 2(1 − α · sign wtT xn − yn ) yn − wtT xn xn
[e] none of the other choices
4. (10 points, auto-graded) After “visualizing” the data and noticing that all x1 ,, x2 , . . ., xN are
distinct, Dr. Transformer magically decides the following transform
That is, Φ(x) is a N -dimensional vector whose n-th component is 1 if and only if x = xn . If we
run linear regression (i.e. squared error) after applying this transform, what is the optimal w̃? For
simplicity, please ignore z0 = 1. That is, we do not pad the Z space examples with a constant
feature.
[a] 1N , the N -dimensional vector of all 1’s.
[b] y
[c] 0N , the N -dimensional vector of all 0’s.
[d] −y
1
[e] 2 (1N + y)
(Note: Be sure to also check what Ein (w̃) is, and think about what Eout (w̃) might be!)
2 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
5. (20 points, human-graded) Let E(w) : Rd → R be a function. Denote the gradient bE (w) and the
Hessian AE (w) by
∂2E ∂2E ∂2E
∂E 2 (w) ∂w1 ∂w2 (w) ... ∂w1 ∂wd (w)
∂w1 (w)
∂∂w
2
E
1
∂2E ∂2E
∂E (w)
∂w2 ∂w1 (w) ∂w22
(w) ... ∂w2 ∂wd (w)
∂w2
bE (w) = ∇E(w) = and AE (w) = .
.. .. .. .. ..
.
. . . .
∂E ∂2E 2
∂2E
(w) ∂ E
∂wd d×1 ∂wd ∂w1 (w) ∂wd ∂w2 (w) . . . ∂wd
2 (w)
d×d
An iterative optimization algorithm using the “optimal direction” above for updating w is called
Newton’s method, which can be viewed as “improving” gradient descent by using more infor-
mation about E. Now, consider minimizing Ein (w) in logistic regression problem with Newton’s
method on a data set {(xn , yn )}N
n=1 with the cross-entropy error function for Ein :
N
1 X
Ein (w) = ln(1 + exp(−yn wT xn )).
N n=1
3 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
6. (20 points, human-graded) In Lecture 11, we solve multiclass classification by OVA or OVO de-
compositions. One alternative to deal with multiclass classification is to extend the original logistic
regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem,
we will denote the output space Y = {1, 2, · · · , K}. The hypotheses considered by MLR can be
indexed by a matrix
| | ··· | ··· |
W = w1 w2 · · · wk · · · wK ,
| | ··· | ··· | (d+1)×K
that contains weight vectors (w1 , · · · , wK ), each of length d+1. The matrix represents a hypothesis
exp(wyT x)
hy (x) = PK
T
k=1 exp(wk x)
that can be used to approximate the target distribution P (y|x) for any (x, y). MLR then seeks for
the maximum likelihood solution over all such hypotheses. For a given data set {(x1 , y1 ), . . . , (xN , yN )}
generated i.i.d. from some P (x) and target distribution P (y|x), the likelihood of hy (x) is propor-
QN
tional to n=1 hyn (xn ). That is, minimizing the negative log likelihood is equivalent to minimizing
an Ein (W) that is composed of the following error function
K
X
err(W, x, y) = − ln hy (x) = − Jy = kK ln hk (x).
k=1
When minimizing Ein (W) with SGD, we update the W(t) at the t-th iteration to W(t+1) by
W(t+1) ← W(t) + η · V,
where V is a (d + 1) × K matrix whose k-th column is an update direction for the k-th weight
vector. Assume that an example (xn , yn ) is used for the SGD update above. It can be proved that
V = xn · uT ,
4 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
9. (20 points, human-graded) In Lecture 13, we discussed about adding “virtual examples” (hints)
to help combat overfitting. One way of generating virtual examples is to add a small noise to the
input vector x ∈ Rd+1 (including the 0-th component x0 ) For each (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )
in our training data set, assume that we generate virtual examples (x̃1 , y1 ), (x̃2 , y2 ), . . . , (x̃N , yN )
where x̃n is simply xn + ϵ and the noise vector ϵ ∈ Rd+1 is generated i.i.d. from a multivariate
normal distribution N (0d+1 , σ 2 · Id+1 ). The vector ϵ is a random vector that varies for each virtual
example. Here 0d+1 ∈ Rd+1 denotes the all-zero vector and Id+1 is an identity matrix of size d + 1.
Recall that when training the linear regression model, we need to calculate XT X first. Define the
hinted input matrix
T
| ... | | ... |
Xh = x1 . . . xN x̃1 ... x̃N .
| ... | | ... |
The expected value E(XTh Xh ), where the expectation is taken over the (Gaussian)-noise generating
process above, is of the form
αXT X + βσ 2 Id+1 .
Derive the correct (α, β).
(Note: The form here “hints” you that the expected value is related to the matrix being inverted
for regularized linear regression—see Lecture 14. That is, data hinting “by noise” is closely related
to regularization. If x contains the pixels of an image, the virtual example is a Gaussian-noise-
contaminated image with the same label, e.g. https://en.wikipedia.org/wiki/Gaussian_noise.
Adding such noise is a very common technique to generate virtual examples for images.)
10. (20 points, code needed, human-graded) Next, we use a real-world data set to study linear and
polynomial regression. We will reuse the cpusmall scale data set at
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/cpusmall_scale
5 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
11. (20 points, code needed, human-graded) Next, consider the following homogeneous order-Q poly-
nomial transform
Transform the training and testing data according to Φ(x) with Q = 3, and again implement steps
1-4 in the previous problem on zn = Φ(xn ) instead of xn to get wpoly . Repeat the four steps above
sqr sqr
1126 times, each with a different random seed. Plot a histogram of Ein (wlin )−Ein (wpoly ). What
sqr sqr
is the average Ein (wlin ) − Ein (wpoly )? This is the Ein gain for using (homogeneous) polynomial
transform. Describe your findings.
Finally, provide the first page of the snapshot of your code as a proof that you have written the
code.
12. (20 points, code needed, human-graded) Following from the 1126 experiments in the previous prob-
sqr sqr sqr sqr
lem, plot a histogram of Eout (wlin ) − Eout (wpoly ). What is the average Eout (wlin ) − Eout (wpoly )?
This is the Eout change for using (homogeneous) polynomial transform. Describe your findings.
Finally, provide the first page of the snapshot of your code as a proof that you have written the
code.
6 of 6