Tut1 Questions
Tut1 Questions
a) Show that Bob’s procedure will fit the same function as Alice’s original procedure.
b) Could Bob’s procedure be better than Alice’s if the matrix A is not invertible?
[If you need a hint, it may help to remind yourself of the discussion involving
invertible matrices in the pre-test answers.]
c) Alice becomes worried about overfitting, adds a regularizer λw> w to the least-
squares error function, and refits the model. Assuming A is invertible, can Bob
choose a regularizer so that he will still always obtain the same function as Alice?
d) Bonus part: Only do this part this week if you have time. Otherwise review it later.
Suppose we wish to find the vector v that minimizes the function
(y − Φv)> (y − Φv) + v> Mv.
i) Show that v> Mv = v> ( 12 M + 12 M> )v, and hence that we can assume with-
out loss of generality that M is symmetric.
iii) Assume we can find a factorization M = AA> . Can we minimize the function
above using a standard routine that can minimize (z − Xw)> (z − Xw) with
respect to w?
i) Sketch — with pen and paper — a contour plot of the sigmoidal function
φ ( x ) = σ ( v > x + b ),
Indicate the precise location of the φ = 0.5 contour on your sketch, and give at
least a rough indication of some other contours. Also mark the vector v on your
diagram, and indicate how its direction is related to the contours.
ii) If x and v were three-dimensional, what would the contours of φ look like, and
how would they relate to v? (A sketch is not expected.)
share a common user-specified bandwidth h, while the√positions of the centers are set
to make the basis functions overlap: ck = (k − 51)h/ 2, with k = 1 . . . 101. The free
parameters of the model are the bandwidth h and weights w.
The model is used to fit a dataset with N = 70 observations each with inputs x ∈
[−1, +1]. Assume each of the observations has outputs y ∈ [−1, +1] also. The model
is fitted for any particular h by transforming the inputs using that bandwidth into a
feature matrix Φ, then minimizing the regularized least squares cost:
a) Explain why many of the weights will be close to zero when h = 0.2, and why
even more weights will probably be close to zero when h = 1.0.
c) Another data set with inputs x ∈ [−1, +1] arrives, but now you notice that all
of the observed outputs are larger, y ∈ [1000, 1010]. What problem would we
encounter if we performed linear regression as above to this data? How could
this problem be fixed?