Solns Recitation5-6 Fall24
Solns Recitation5-6 Fall24
1. Go through the demo from lecture 7 on visualizing the SVD basis: https://canvas.uchicago.
edu/courses/59515/files/11965539?wrap=1. Note that the 3D plots’ aspect ratios may be
a bit off, so you can try inserting ax.set box aspect(aspect=(1, 1, 1)) into the plotting
code so that the columns of U from the SVD look more obviously orthogonal.
2. Pseudoinverse and the minimum-norm least squares solution. Prove that for the
case of X ∈ Rn×p , p >> n, if X has linearly-independent rows then the choice ŵ = V Σ† U T y
has the smallest ℓ2 norm of any w solving w = arg minw ∥Xw − y∥22 . Recall that Σ† =
−1 −1
ΣT ΣΣT , so ŵ = X T XX T y
where the second-to-last substitution occurs because both w and ŵ are solutions to Xw = y.
Note that the perfect solution with zero residual exists because we assumed X has full row-
rank, so y lives in the column space of X. However, even when y does not live in the column
space of X, it is still the case that Xw = ŷ for all solutions w, where ŷ is the projection of
y onto the column space of X. This means this term is zero and thus
1 of 6
Even though there are possible degeneracies in the value of w responsible for a given value of
Xw, the best value for Xw = ŷ = projcol(X) y is the projection of y into the column space
of X (in this specific case y = ŷ, though we do not necessarily need to rely on this). So,
X(w − ŵ) = X∆w = 0. With that, we can expand the squared norm of w to find
In particular, the last inequality can be made into ∥w∥22 > ∥ŵ∥22 when ∆w ̸= 0.
3. Frobenius matrix norm. The Frobenius matrix norm is defined on any matrix A ∈ Rn×p
as
Xn,p
||A||F = ( A2ij )1/2 .
i,j
Here, we will show that the squared Frobenius norm of a matrix is equal to the sum of its
squared singular values, in other words
r
X
||A||2F = σi2 .
i
a) The trace of a square matrix is defined as the sum of the elements on its diagonal.
n
X
tr(X) = Xii .
i
Show that the trace is cyclic, that for matrices X ∈ Rn×p and Y ∈ Rp×n , tr(XY ) =
tr(Y X). (Note that X and Y need not be square, but XY and Y X are square).
SOLUTION:
x11 . . . x1n y11 . . . y1n
. .. .. ..
.. ..
..
XY = . . . . .
2 of 6
An alternate explanation that does not rely on the index-based definition of matrix
multiplication is to take A = X T , B = Y so that we are interested in tr(AT B). We can
write the matrices in terms of their columns:
A = a1 a2 · · · an and B = b1 b2 · · · bn
so that
T
a1 b1 aT1 b2 · · · aT1 bn
T
a2 b1 aT2 b2 · · · aT2 bn
tr(AT B) = tr . T T T
.. = a1 b1 + a2 b2 + · · · + an bn
. .. . .
. . . .
aTn b1 aTn b2 · · · aTn bn
which basically corresponds to the standard dot product of A and B if we had first
stacked each of the columns on top of each other to make each into a tall np × 1 vector.
= tr(AA⊤ )
c) Let A = U ΣV ⊤ be the SVD of A. Use parts a and b to conclude that the squared
Frobenius norm of A is the sum of its squared singular values.
3 of 6
SOLUTION:
X
||A||2F = A2ij
i,j
= tr(A⊤ A)
= tr(V Σ⊤ U ⊤ U ΣV ⊤ )
= tr(V Σ⊤ ΣV ⊤ )
= tr(V ⊤ V Σ⊤ Σ)
= tr(Σ⊤ Σ)
2
σ1 . . . 0 ...
.. ..
. .
= tr
..
0 . . . σr2 .
0 ... ... 0
r
X
= σi2
i
4. Work through the basics of PCA with the students from Eldén 6.4.
4 of 6
5. PCR. In class and above, we looked at PCA and the closely related idea of using the SVD to
perform dimensionality reduction. When solving a regression problem, one approach you can
use is called Principal Components Regression (PCR). The idea here is that you are given
training samples (xi , yi ) where xi ∈ Rp for i = 1, . . . , n. We are going to find a reduced-
dimension version of each xi , denoted zi ∈ Rk for some k < p, and then we are going to solve
least squares on the (zi , yi ) pairs instead of on the (xi , yi ) pairs. You will experiment with
PCR on the MNIST dataset, a well known handwritten digit recognition dataset. A small
data subset is provided in mnist.mat. It contains two types of digits (1 and 4), and each
digit has 100 training images and 100 testing images. Each image xi has 28 × 28 pixels, which
we can flatten to form a p = 784-dimensional feature vector, and target yi ∈ {−1, 1}. Since
raw data is in a high dimensional space, we’re trying to use PCA to first compress data into
a lower-dimensional representation where the decision boundary is easy to draw.
a) Let zi be the 2-dimensional representation of xi (found using the truncated SVD), and
find a weight w so that Zw ≈ y. What is ∥y − Zw∥22 (training loss)? Given test
samples x′i and yi′ , compute 2-dimensional zi′ ’s, and measure ∥y ′ − Z ′ w∥22 (test loss).
SOLUTION:
w = np.linalg.inv(z.T@z)@z.T@train_target[0]
y_pred = ((z@w) >= 0).astype(np.int32)
y_pred[y_pred <= 0] = -1
print("training␣accuracy:{}".
\format((y_pred == train_target[0]).mean()))
z_test = vh[:2, :] @ test_data.T
z_test = z_test.T
y_pred_test = ((z_test@w) >= 0).astype(np.int32)
y_pred_test[y_pred_test <= 0] = -1
print("testing␣accuracy:{}".
\format((y_pred_test == test_target[0]).mean()))
b) Plot the classification decision boundary in 2d space. (Hint: the plot should look some-
thing like figure 2.)
SOLUTION:
5 of 6
Figure 2: Reference for what the plot should roughly look like. Here, the decision boundary is not required
to pass through the origin (i.e., we have added a bias term to the data matrix). Each point is one sample,
and the color is determined by its digit label. The axes are given by the first two principal component
directions, and the coordinates are given by the principal components.
point_num = 1000
zmin = z.min(axis = 0)
zmax= z.max(axis = 0)
point_x = (np.arange(point_num).astype(np.float32) /
point_num * (zmax[0] - zmin[0]) + zmin[0])
point_y = -1 * point_x * w[0] / w[1]
idx = np.logical_and(point_y < zmax[1], point_y >= zmin[1])
plt.scatter(z[:100,0],z[:100,1],c=’b’)
plt.scatter(z[100:,0],z[100:,1], c=’y’)
plt.scatter(point_x[idx],point_y[idx], s = 1, color= ’red’)
plt.savefig(’so4.pdf’)
6 of 6