hw3 Red
hw3 Red
Homework #3
RELEASE DATE: 10/07/2024
RED CORRECTION: 10/12/2024 16:30
DUE DATE: 10/21/2024, BEFORE 13:00 on GRADESCOPE
QUESTIONS ARE WELCOMED ON DISCORD (INFORMALLY) OR VIA EMAILS (FORMALLY).
You will use Gradescope to upload your scanned/printed solutions. For problems marked with (*), please
follow the guidelines on the course website and upload your source code to Gradescope as well. Any
programming language/platform is allowed.
Any form of cheating, lying, or plagiarism will not be tolerated. Students can get zero scores and/or fail
the class and/or be kicked out of school and/or receive other punishments for those kinds of misconducts.
Discussions on course materials and homework solutions are encouraged. But you should write the final
solutions alone and understand them fully. Books, notes, and Internet resources can be consulted, but
not copied from.
Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework
solutions and/or source codes to your classmates at any time. In order to maximize the level of fairness
in this class, lending and borrowing homework solutions are both regarded as dishonest behaviors and will
be punished according to the honesty policy.
You should write your solutions in English with the common math notations introduced in class or in the
problems. We do not accept solutions written in any other languages.
This homework set comes with 200 points and 20 bonus points. In general, every home-
work set would come with a full credit of 200 points, with some possible bonus points.
1. (10 points, auto-graded) Which of the following hypothesis set, each parameterized by one param-
eter only, is of the largest dvc ?
[a] {cs (x) : s ∈ {−1, +1}} where cs (x) = s
[b] {rθ (x) : θ ∈ R} where rθ (x) = sign(x1 − θ)
[c] {qi (x) : i ∈ {1, 2, . . . , d}} where qi (x) = sign(xi )
[d] {uα (x) : α ∈ R} where uα (x) = sign(sin(αx1 ))
[e] {vβ (x) : β ∈ R} where vβ (x) = sign(βx1 )
2. (10 points, auto-graded) Consider a hypothesis set that contains hypotheses of the form h(x) = wx
for x ∈ R. Combine the hypothesis set with the squared error function to minimize
N
1 X
Ein (w) = (h(xn ) − yn )2
N n=1
1 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
3. (10 points, auto-graded) In Lecture 9, we introduced the hat matrix H = XX† for linear regression.
The matrix projects the label vector y to the “predicted” vector ŷ = Hy and helps us analyze the
error of linear regression. Assume that XT X is invertible, which makes H = X(XT X)−1 XT . Now,
consider the following operations on X. Which operation can possibly change H?
1
[a] multiplying each of the n-th row of X by n (which is equivalent to scaling the n-th example
by n1 )
[b] multiplying each of the i-th column of X by i2 (which is equivalent to scaling the i-th feature
by i2 )
[c] multiplying the whole matrix X by 2 (which is equivalent to scaling all input vectors by 2)
[d] adding three randomly-chosen columns i, j, k to column 1 of X
(i.e., xn,1 ← xn,1 + xn,i + xn,j + xn,k )
[e] none of the other choices (i.e. all other choices are guaranteed to keep H unchanged.)
4. (10 points, auto-graded) Let y1 , y2 , . . . , yN be N values generated i.i.d. from a uniform distribution
[θ, 1] with some unknown θ. For any θ̂ ≤ min(y1 , y2 , . . . , yN ), what is its likelihood?
N
[a] θ̂1
QN yn
[b] n=1 1− θ̂
N
1
[c] 1− θ̂
max(y1 ,...,yN )
[d] θ̂
min(y1 ,...,yN )
[e] 1−θ̂
(Hint: Those who are interested in more math [who isn’t? :-)] are encouraged to try to derive the
maximum-likelihood estimator.)
5. (20 points, human-graded) Prove or disprove that for any two non-empty hypothesis sets H1 and H2
for binary classification that operate on the same input space, dvc (H1 ∪ H2 ) ≤ dvc (H1 ) + dvc (H2 ).
Note that the ∪ operation represents set-union. That is, {h1 , h2 , h3 } ∪ {h2 , h4 } = {h1 , h2 , h3 , h4 }.
6. (20 points, human-graded) Consider a binary classification problem, where Y = {−1, +1}. Assume
a noisy scenario where the data is generated i.i.d. from some P (x, y). In class, we discussed
that when the 0/1 error function (i.e. classification error) is considered, calculating the “ideal
mini-target” on each x reveals the hidden target function of
1
f0/1 (x) = argmaxy∈{−1,+1} P (y|x) = sign P (y = +1|x) − .
2
Instead of the 0/1 error, if we consider the super-market error function, where a false negative
(classifying a positive example as a negative one) is 10 times more important than a false positive,
the hidden target should be changed to
fmkt (x) = sign P (y = +1|x) − α .
2 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
7. (20 points, human-graded) In class, we had two definitions of Eout (h) for binary classification. The
first definition compares the hypothesis h against the target function f .
(1)
Eout (h) = Ex∼P (x) Jh(x) ̸= f (x)K .
The second definition extends from the first definition, and compares the hypothesis h against the
noisy distribution P (y | x).
(2)
Eout (h) = Ex∼P (x),y∼P (y|x) Jh(x) ̸= yK .
Note that when considering the 0/1 error, we know that the target function f (x) hides itself within
P (y | x) by
1
f (x) = argmaxy∈{−1,+1} P (y|x) = sign P (y = +1|x) − .
2
With all the definitions above, prove that for any hypothesis h,
(2) (1) (2)
Eout (h) ≤ Eout (h) + Eout (f ).
(2)
(Hint: Technically, Eout (f ) is a constant that represents the irreducible error (i.e. noise) of the
learning problem.)
8. (20 points, human-graded) Consider running linear regression on {(xn , yn )}N
n=1 , where xn includes
the constant dimension x0 = 1 as usual. For simplicity, you can assume that XT X is invertible.
Assume that the unique (why :-)) solution wlin is obtained after running linear regression on the
data above. Then, if every x0 is changed to 1126 instead of 1, run linear regression again to get the
unique solution wlucky . Prove that wlin = Dwlucky , where D is some diagonal matrix, by deriving
the correct D.
9. (20 points, human-graded) In logistic regression, we consider the logistic hypotheses
1
h(x) =
1 + exp(−wT x)
to approximate the target function f (x) = P (+1 | x). We use the property that the hypotheses
are sigmoid (s-shaped) to simplify the likelihood function and then take maximum likelihood to
derive the error function Ein . Now, consider another family of sigmoid hypotheses,
!
1 wT x
h̃(x) = p +1 .
2 1 + (wT x)2
Follow the same derivation steps to obtain the corresponding Ẽin when using h̃. What is ∇Ẽin (w)?
10. (20 points, code needed, human-graded) Next, we use a real-world data set to study linear regres-
sion. Please download the cpusmall scale data set at
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/cpusmall_scale
We use the column-scaled version instead of the original version so you’d likely encounter fewer
numerical issues.
The data set contains 8192 examples. In each experiment, you are asked to
3 of 4
Machine Learning (NTU, Fall 2024) instructor: Hsuan-Tien Lin
For N = 32, run the experiment above for 1126 times, and plot a scatter plot of (Ein (wlin ), Eout (wlin ))
in each experiment. Describe your findings.
Then, provide the first page of the snapshot of your code as a proof that you have written the code.
11. (20 points, code needed, human-graded) For each of N = 25, 50, 75, 100, . . . , 2000, calculate Ēin (N )
and Ēout (N ) by averaging Ein and Eout over 16 experiments. Then, plot the learning curves that
show Ēin (N ) and Ēout (N ) as functions of N on the same figure. Describe your findings.
Then, provide the first page of the snapshot of your code as a proof that you have written the code.
12. (20 points, code needed, human-graded) Repeat Problem 11, but using the first 2 features for each
example instead of all 12 features. That is, run linear regression with x = [x0 , x1 , x2 ] instead.
Describe your findings. In particular, compare your results here to those of Problem 11.
Then, provide the first page of the snapshot of your code as a proof that you have written the code.
13. (Bonus 20 points, human graded) Please note that this part is related to the “optional” lecture
6 of the course. If you want to get the bonus, you need to do something “extra” to at least
understand the definitions below. We hope that this reminds everyone that you do not always
k−1
X N
need to solve the bonus problem! In Lecture 6, we proved B(N, k) ≤ . Now, prove that
i=0
i
k−1
X N k−1
X N
B(N, k) ≥ . Thus, B(N, k) = .
i=0
i i=0
i
4 of 4