0% found this document useful (0 votes)
13 views6 pages

hw4 Red

Uploaded by

chonleda777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

hw4 Red

Uploaded by

chonleda777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

Homework #4
RELEASE DATE: 10/21/2024
RED CORRECTION: 10/27/2024 10:45
DUE DATE: 11/04/2024, BEFORE 13:00 on GRADESCOPE
QUESTIONS ARE WELCOMED ON DISCORD (INFORMALLY) OR VIA EMAILS (FORMALLY).

You will use Gradescope to upload your scanned/printed solutions. Any programming language/platform
is allowed.
Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail
the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.
Discussions on course materials and homework solutions are encouraged. But you should write the final
solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but
not copied from.
Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework
solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness
in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will
be punished according to the honesty policy.
You should write your solutions in English with the common math notations introduced in class or in the
problems. We do not accept solutions written in any other languages.

This homework set comes with 200 points and 20 bonus points. In general, every home-
work set would come with a full credit of 200 points, with some possible bonus points.

1. (10 points, auto-graded) In class, we introduced our version of the cross-entropy error function
N
1 X
Ein (w) = − ln θ(yn wT xn ).
N n=1

yn +1
based on the definition of yn ∈ {−1, +1}. If we transform yn to yn′ ∈ {0, 1} by yn′ = 2 , which
of the following error function is equivalent to Ein above?
PN  
[a] N1 n=1 +yn′ ln θ(+wT xn ) + (1 − yn′ ) ln θ(−wT xn )
PN  
[b] N1 n=1 +yn′ ln θ(−wT xn ) + (1 − yn′ ) ln θ(+wT xn )
PN  
[c] N1 n=1 −yn′ ln θ(+wT xn ) − (1 − yn′ ) ln θ(−wT xn )
PN  
[d] N1 n=1 −yn′ ln θ(−wT xn ) − (1 − yn′ ) ln θ(+wT xn )
[e] none of the other choices
(Note: In the error functions in the choices, there is a form of p log q + (1 − p) log(1 − q), which is
the origin of the name “cross-entropy.”)

1 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

2. (10 points, auto-graded) In the perceptron learning algorithm, we find one example (xn(t) , yn(t) )
that the current weight vector wt mis-classifies, and then update wt by

wt+1 ← wt + yn(t) xn(t) .

A variant of the algorithm finds all examples (xn , yn ) that the weight vector wt mis-classifies
(e.g. yn ̸= sign(wtT xn )), and then update wt by
η X
wt+1 ← wt + yn xn .
N
n : yn ̸=sign(wtT xn )

The variant can be viewed as optimizing some Ein (w) that is composed of one of the following point-
wise error functions with a fixed learning rate gradient descent (neglecting any non-differentiable
spots of Ein ). What is the error function?

[a] err(w, x, y) = −ywT x


[b] err(w, x, y) = max(0, 1 − ywT x)
[c] err(w, x, y) = max(0, −ywT x)
[d] err(w, x, y) = min(0, 1 − ywT x)
[e] err(w, x, y) = min(0, −ywT x)
3. (10 points, auto-graded) In regression, we sometimes want to deal with the situation where an
over-estimate is worse than an under-estimate. This calls for the following asymmetric squared
error
q y q y
errα (w, x, y) = (1 + α) wT x ≥ y (wT x − y)2 + (1 − α) wT x < y (wT x − y)2

for some 0 < α < 1. When using stochastic gradient descent to minimize an Ein (w) that is
composed of the asymmetric squared error function, which of the following is the update direction
−∇ errα (w, xn , yn ) for the chosen (xn , yn ) with respect to wt ?
 
[a] 2(1 + α2 · sign wtT xn − yn ) yn − wtT xn xn
 
[b] 2(1 − α2 · sign wtT xn − yn ) yn − wtT xn xn
 
[c] 2(1 + α · sign wtT xn − yn ) yn − wtT xn xn
 
[d] 2(1 − α · sign wtT xn − yn ) yn − wtT xn xn
[e] none of the other choices

4. (10 points, auto-graded) After “visualizing” the data and noticing that all x1 ,, x2 , . . ., xN are
distinct, Dr. Transformer magically decides the following transform

Φ(x) = (Jx = x1 K , Jx = x2 K , . . . , Jx = xN K).

That is, Φ(x) is a N -dimensional vector whose n-th component is 1 if and only if x = xn . If we
run linear regression (i.e. squared error) after applying this transform, what is the optimal w̃? For
simplicity, please ignore z0 = 1. That is, we do not pad the Z space examples with a constant
feature.
[a] 1N , the N -dimensional vector of all 1’s.
[b] y
[c] 0N , the N -dimensional vector of all 0’s.
[d] −y
1
[e] 2 (1N + y)

(Note: Be sure to also check what Ein (w̃) is, and think about what Eout (w̃) might be!)

2 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

5. (20 points, human-graded) Let E(w) : Rd → R be a function. Denote the gradient bE (w) and the
Hessian AE (w) by
∂2E ∂2E ∂2E
 
∂E 2 (w) ∂w1 ∂w2 (w) ... ∂w1 ∂wd (w)
∂w1 (w)
 
 ∂∂w
2
E
1
∂2E ∂2E
 ∂E (w)

 ∂w2 ∂w1 (w) ∂w22
(w) ... ∂w2 ∂wd (w)
 
 ∂w2
bE (w) = ∇E(w) =  and AE (w) =  .

..  .. .. .. .. 
 .  
 . . . . 

∂E ∂2E 2
∂2E
(w) ∂ E
∂wd d×1 ∂wd ∂w1 (w) ∂wd ∂w2 (w) . . . ∂wd
2 (w)
d×d

Then, the second-order Taylor’s expansion of E(w) around u is:


1
E(w) ≈ E(u) + bE (u)T (w − u) + (w − u)T AE (u)(w − u).
2
Suppose AE (u) is positive definite. The optimal direction v such that w ← u + v minimizes the
right-hand-side of the Taylor’s expansion above is simply −(AE (u))−1 bE (u).

Hint: Homework 0! :-)

An iterative optimization algorithm using the “optimal direction” above for updating w is called
Newton’s method, which can be viewed as “improving” gradient descent by using more infor-
mation about E. Now, consider minimizing Ein (w) in logistic regression problem with Newton’s
method on a data set {(xn , yn )}N
n=1 with the cross-entropy error function for Ein :

N
1 X
Ein (w) = ln(1 + exp(−yn wT xn )).
N n=1

For any given wt , let


ht (x) = θ(wtT x).
Express the Hessian AE (wt ) with E = Ein as XT DX, where D is an N by N diagonal matrix.
Derive what D should be in terms of ht , wt , xn , and yn .

3 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

6. (20 points, human-graded) In Lecture 11, we solve multiclass classification by OVA or OVO de-
compositions. One alternative to deal with multiclass classification is to extend the original logistic
regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem,
we will denote the output space Y = {1, 2, · · · , K}. The hypotheses considered by MLR can be
indexed by a matrix  
| | ··· | ··· |
W = w1 w2 · · · wk · · · wK  ,
| | ··· | ··· | (d+1)×K

that contains weight vectors (w1 , · · · , wK ), each of length d+1. The matrix represents a hypothesis

exp(wyT x)
hy (x) = PK
T
k=1 exp(wk x)

that can be used to approximate the target distribution P (y|x) for any (x, y). MLR then seeks for
the maximum likelihood solution over all such hypotheses. For a given data set {(x1 , y1 ), . . . , (xN , yN )}
generated i.i.d. from some P (x) and target distribution P (y|x), the likelihood of hy (x) is propor-
QN
tional to n=1 hyn (xn ). That is, minimizing the negative log likelihood is equivalent to minimizing
an Ein (W) that is composed of the following error function
K
X
err(W, x, y) = − ln hy (x) = − Jy = kK ln hk (x).
k=1

When minimizing Ein (W) with SGD, we update the W(t) at the t-th iteration to W(t+1) by

W(t+1) ← W(t) + η · V,

where V is a (d + 1) × K matrix whose k-th column is an update direction for the k-th weight
vector. Assume that an example (xn , yn ) is used for the SGD update above. It can be proved that

V = xn · uT ,

where u is a vector in RK . What is u? List your derivation steps.


7. (20 points, human-graded) Following the previous problem, consider a data set with K = 2 and
obtain the optimal solution from MLR as (w1∗ , w2∗ ). Now, relabel the same data set by replacing
yn with yn′ = 2yn − 3 to form a binary classification data set, and run logistic regression (what we
have learned in class) to get an optimal solution wlr . Express wlr as a function of (w1∗ , w2∗ ). List
your derivation steps.
8. (20 points, human-graded) Consider the target function f (x) = 1 − 2x2 . Sample x uniformly from
[0, 1], and use all linear hypotheses h(x) = w0 + w1 · x to approximate the target function with
respect to the squared error. Then, sample two examples x1 and x2 uniformly from [0, 1] to form
the training set D = {(x1 , f (x1 )), (x2 , f (x2 ))}, and use linear regression to get g for approximating
the target function with respect to the squared error. You can neglect the degenerate cases where
x1 and x2 are the same. What is ED (|Ein (g) − Eout (g)|)? List your derivation steps.

4 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

9. (20 points, human-graded) In Lecture 13, we discussed about adding “virtual examples” (hints)
to help combat overfitting. One way of generating virtual examples is to add a small noise to the
input vector x ∈ Rd+1 (including the 0-th component x0 ) For each (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )
in our training data set, assume that we generate virtual examples (x̃1 , y1 ), (x̃2 , y2 ), . . . , (x̃N , yN )
where x̃n is simply xn + ϵ and the noise vector ϵ ∈ Rd+1 is generated i.i.d. from a multivariate
normal distribution N (0d+1 , σ 2 · Id+1 ). The vector ϵ is a random vector that varies for each virtual
example. Here 0d+1 ∈ Rd+1 denotes the all-zero vector and Id+1 is an identity matrix of size d + 1.
Recall that when training the linear regression model, we need to calculate XT X first. Define the
hinted input matrix
 T
| ... | | ... |
Xh =  x1 . . . xN x̃1 ... x̃N  .
| ... | | ... |
The expected value E(XTh Xh ), where the expectation is taken over the (Gaussian)-noise generating
process above, is of the form
αXT X + βσ 2 Id+1 .
Derive the correct (α, β).
(Note: The form here “hints” you that the expected value is related to the matrix being inverted
for regularized linear regression—see Lecture 14. That is, data hinting “by noise” is closely related
to regularization. If x contains the pixels of an image, the virtual example is a Gaussian-noise-
contaminated image with the same label, e.g. https://en.wikipedia.org/wiki/Gaussian_noise.
Adding such noise is a very common technique to generate virtual examples for images.)

10. (20 points, code needed, human-graded) Next, we use a real-world data set to study linear and
polynomial regression. We will reuse the cpusmall scale data set at
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/cpusmall_scale

to save you some efforts.


First, execute the following. The first four steps are similar to what you have done in Homework 3.
The last step is new. We will take N = 64 (instead of 32).
The data set contains 8192 examples. In each experiment, you are asked to
(1) randomly sample N out of 8192 examples as your training data.
(2) add x0 = 1 to each example, as always.
(3) run linear regression on the N examples, using any reasonable implementations of X† on the
input matrix X, to get wlin
(4) evaluate Ein (wlin ) by averaging the squared error over the N examples; estimate Eout (wlin )
by averaging the squared error over the rest (8192 − N ) examples
(5) implement the SGD algorithm for linear regression using the results on pages 10 and 12 of
Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.01 and
initialize w with w0 = 0. Run the algorithm for 100000 iterations, and record Ein (wt ) and
Eout (wt ) whenever t is a multiple of 200.
Repeat the steps above 1126 times, each with a different random seed. Calculate the average
(Ein (wlin ), Eout (wlin )) over the 1126 experiments. Also, for every t = 200, 400, . . ., calculate the
average (Ein (wt ), Eout (wt )) over the 1126 experiments. Plot the average Ein (wt ) as a function of t.
On the same figure, plot the average Eout (wt ) as a function of t. Then, show the average Ein (wlin )
as a horizontal line, and the average Eout (wlin ) as another horizontal line. Describe your findings.
Finally, provide the first page of the snapshot of your code as a proof that you have written the
code.

5 of 6
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin

11. (20 points, code needed, human-graded) Next, consider the following homogeneous order-Q poly-
nomial transform

Φ(x) = (1, x1 , x2 , ..., x12 , x21 , x22 , ..., x212 , ..., xQ Q Q


1 , x2 , ..., x12 ).

Transform the training and testing data according to Φ(x) with Q = 3, and again implement steps
1-4 in the previous problem on zn = Φ(xn ) instead of xn to get wpoly . Repeat the four steps above
sqr sqr
1126 times, each with a different random seed. Plot a histogram of Ein (wlin )−Ein (wpoly ). What
sqr sqr
is the average Ein (wlin ) − Ein (wpoly )? This is the Ein gain for using (homogeneous) polynomial
transform. Describe your findings.
Finally, provide the first page of the snapshot of your code as a proof that you have written the
code.
12. (20 points, code needed, human-graded) Following from the 1126 experiments in the previous prob-
sqr sqr sqr sqr
lem, plot a histogram of Eout (wlin ) − Eout (wpoly ). What is the average Eout (wlin ) − Eout (wpoly )?
This is the Eout change for using (homogeneous) polynomial transform. Describe your findings.
Finally, provide the first page of the snapshot of your code as a proof that you have written the
code.

13. (Bonus 20 points, human-graded) Define the multiplicative hypotheses as


( d
!)
Y ui
H2 = h̃u (x) : h̃u (x) = sign u0 + (|xi | + 1) .
i=1

Compared with the linear hypotheses


( d
!)
X
H1 = hw (x) : hw (x) = sign w0 + wi xi ,
i=1

the multiplicative hypotheses appear to be implementing much more complicated boundaries.


Prove or disprove that dvc (H2 ) > dvc (H1 ).

6 of 6

You might also like