0% found this document useful (0 votes)
10 views6 pages

Solns Recitation5-6 Fall24

The document outlines the structure of recitations for CMSC 25300/35300, focusing on the singular value decomposition (SVD) and its applications over two weeks. It includes detailed instructions for exercises related to the pseudoinverse, least-squares solutions, Frobenius matrix norm, principal component analysis (PCA), and principal components regression (PCR) using the MNIST dataset. The document also provides solutions and hints for visualizing SVD and plotting decision boundaries in PCA.

Uploaded by

Lauren Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Solns Recitation5-6 Fall24

The document outlines the structure of recitations for CMSC 25300/35300, focusing on the singular value decomposition (SVD) and its applications over two weeks. It includes detailed instructions for exercises related to the pseudoinverse, least-squares solutions, Frobenius matrix norm, principal component analysis (PCA), and principal components regression (PCR) using the MNIST dataset. The document also provides solutions and hints for visualizing SVD and plotting decision boundaries in PCA.

Uploaded by

Lauren Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CMSC 25300 / 35300

Recitations 5-6, Fall 2024


The singular value decomposition is a big topic, so we will be splitting the questions across two
weeks. In week 5 (10/31 and 11/1), we will discuss the SVD’s relationship to the pseudoinverse
and least-squares problems (questions 1-3). In week 6 (11/7 and 11/8), we will discuss the PCA
(questions 4-5), which is a specific method built on the SVD. We include it in this document in
case you would like to review PCA before it is covered in recitation section, as it may be helpful
for your geometric intuition of the SVD.

1. Go through the demo from lecture 7 on visualizing the SVD basis: https://canvas.uchicago.
edu/courses/59515/files/11965539?wrap=1. Note that the 3D plots’ aspect ratios may be
a bit off, so you can try inserting ax.set box aspect(aspect=(1, 1, 1)) into the plotting
code so that the columns of U from the SVD look more obviously orthogonal.

2. Pseudoinverse and the minimum-norm least squares solution. Prove that for the
case of X ∈ Rn×p , p >> n, if X has linearly-independent rows then the choice ŵ = V Σ† U T y
has the smallest ℓ2 norm of any w solving w = arg minw ∥Xw − y∥22 . Recall that Σ† =
−1 −1
ΣT ΣΣT , so ŵ = X T XX T y

SOLUTION: Proof: Consider another w which is a solution to y = Xw. Then

||w||22 = ||w − ŵ + ŵ||22


= ||(w − ŵ) + ŵ||22
= ||w − ŵ||22 + 2(w − ŵ)T ŵ + ||ŵ||22

What is (w − ŵ)T ŵ?

(w − ŵ)T ŵ = (w − ŵ)T X T (XX T )−1 y


= (X(w − ŵ))T (XX T )−1 y
= (Xw − X ŵ)T (XX T )−1 y
= (y − y)T (XX T )−1 y
=0

where the second-to-last substitution occurs because both w and ŵ are solutions to Xw = y.
Note that the perfect solution with zero residual exists because we assumed X has full row-
rank, so y lives in the column space of X. However, even when y does not live in the column
space of X, it is still the case that Xw = ŷ for all solutions w, where ŷ is the projection of
y onto the column space of X. This means this term is zero and thus

||w||22 = ||w − ŵ||22 + ||ŵ||22 (1)

and then for any w, ||w − ŵ||22 ≥ 0 so ||w||22 ≥ ||ŵ||22 .


Here is an alternate way of setting things up that may be slightly more straightforward:
consider some arbitrary weight vector w = ŵ + ∆w satisfying ∥Xw − y∥22 = ∥X ŵ − y∥22 .

1 of 6
Even though there are possible degeneracies in the value of w responsible for a given value of
Xw, the best value for Xw = ŷ = projcol(X) y is the projection of y into the column space
of X (in this specific case y = ŷ, though we do not necessarily need to rely on this). So,
X(w − ŵ) = X∆w = 0. With that, we can expand the squared norm of w to find

∥w∥22 = ∥ŵ + ∆w∥22


= ∥ŵ∥22 + ∥∆w∥22 + 2∆wT ŵ
= ∥ŵ∥22 + ∥∆w∥22 + 2∆wT X T (XX T )−1 y
= ∥ŵ∥22 + ∥∆w∥22 + 2(X∆w)T (XX T )−1 y
= ∥ŵ∥22 + ∥∆w∥22 since X∆w = 0
≥ ∥ŵ∥22 since ∥∆w∥22 ≥ 0.

In particular, the last inequality can be made into ∥w∥22 > ∥ŵ∥22 when ∆w ̸= 0.

3. Frobenius matrix norm. The Frobenius matrix norm is defined on any matrix A ∈ Rn×p
as
Xn,p
||A||F = ( A2ij )1/2 .
i,j

Here, we will show that the squared Frobenius norm of a matrix is equal to the sum of its
squared singular values, in other words
r
X
||A||2F = σi2 .
i

a) The trace of a square matrix is defined as the sum of the elements on its diagonal.
n
X
tr(X) = Xii .
i

Show that the trace is cyclic, that for matrices X ∈ Rn×p and Y ∈ Rp×n , tr(XY ) =
tr(Y X). (Note that X and Y need not be square, but XY and Y X are square).
SOLUTION:
  
x11 . . . x1n y11 . . . y1n
 . .. ..   ..
  .. .. 
 ..
XY =  . .  . . . 

xn1 . . . xnn yn1 . . . ynn


 
x11 y11 + x12 y21 + · · · + x1n yn1 ...

= .. 
 . 

... xn1 y1n + · · · + xnn ynn
Pn Pp Pp Pn
Therefore, tr(XY ) = i j xij yji = j i yji xij = tr(Y X).

2 of 6
An alternate explanation that does not rely on the index-based definition of matrix
multiplication is to take A = X T , B = Y so that we are interested in tr(AT B). We can
write the matrices in terms of their columns:
   
A = a1 a2 · · · an and B = b1 b2 · · · bn

so that
 T 
a1 b1 aT1 b2 · · · aT1 bn
 T
a2 b1 aT2 b2 · · · aT2 bn 

tr(AT B) = tr  . T T T
..  = a1 b1 + a2 b2 + · · · + an bn
 
. .. . .
 . . . . 
 
aTn b1 aTn b2 · · · aTn bn

which basically corresponds to the standard dot product of A and B if we had first
stacked each of the columns on top of each other to make each into a tall np × 1 vector.

b) Show that ||A||2F = tr(AA⊤ ) = tr(A⊤ A).


SOLUTION:
n,p
X
||A||2F = A2ij
i,j
Xn Xp
= Aij Aij
i j
n
XXp
= Aij A⊤
ji
i j

= tr(AA⊤ )

c) Let A = U ΣV ⊤ be the SVD of A. Use parts a and b to conclude that the squared
Frobenius norm of A is the sum of its squared singular values.

3 of 6
SOLUTION:
X
||A||2F = A2ij
i,j

= tr(A⊤ A)
= tr(V Σ⊤ U ⊤ U ΣV ⊤ )
= tr(V Σ⊤ ΣV ⊤ )
= tr(V ⊤ V Σ⊤ Σ)
= tr(Σ⊤ Σ)
 2 
σ1 . . . 0 ...
 .. ..  
 . .  
= tr 
 
.. 

 0 . . . σr2 . 
 

0 ... ... 0
r
X
= σi2
i

An alternate last step is to say tr(ΣT Σ) = ∥Σ∥2F = σ12 + · · · + σr2 .

4. Work through the basics of PCA with the students from Eldén 6.4.

4 of 6
5. PCR. In class and above, we looked at PCA and the closely related idea of using the SVD to
perform dimensionality reduction. When solving a regression problem, one approach you can
use is called Principal Components Regression (PCR). The idea here is that you are given
training samples (xi , yi ) where xi ∈ Rp for i = 1, . . . , n. We are going to find a reduced-
dimension version of each xi , denoted zi ∈ Rk for some k < p, and then we are going to solve
least squares on the (zi , yi ) pairs instead of on the (xi , yi ) pairs. You will experiment with
PCR on the MNIST dataset, a well known handwritten digit recognition dataset. A small
data subset is provided in mnist.mat. It contains two types of digits (1 and 4), and each
digit has 100 training images and 100 testing images. Each image xi has 28 × 28 pixels, which
we can flatten to form a p = 784-dimensional feature vector, and target yi ∈ {−1, 1}. Since
raw data is in a high dimensional space, we’re trying to use PCA to first compress data into
a lower-dimensional representation where the decision boundary is easy to draw.

Figure 1: Image samples in the training set

a) Let zi be the 2-dimensional representation of xi (found using the truncated SVD), and
find a weight w so that Zw ≈ y. What is ∥y − Zw∥22 (training loss)? Given test
samples x′i and yi′ , compute 2-dimensional zi′ ’s, and measure ∥y ′ − Z ′ w∥22 (test loss).
SOLUTION:

w = np.linalg.inv(z.T@z)@z.T@train_target[0]
y_pred = ((z@w) >= 0).astype(np.int32)
y_pred[y_pred <= 0] = -1
print("training␣accuracy:{}".
\format((y_pred == train_target[0]).mean()))
z_test = vh[:2, :] @ test_data.T
z_test = z_test.T
y_pred_test = ((z_test@w) >= 0).astype(np.int32)
y_pred_test[y_pred_test <= 0] = -1
print("testing␣accuracy:{}".
\format((y_pred_test == test_target[0]).mean()))

training accuracy: 0.95


testing accuracy: 0.96

b) Plot the classification decision boundary in 2d space. (Hint: the plot should look some-
thing like figure 2.)
SOLUTION:

5 of 6
Figure 2: Reference for what the plot should roughly look like. Here, the decision boundary is not required
to pass through the origin (i.e., we have added a bias term to the data matrix). Each point is one sample,
and the color is determined by its digit label. The axes are given by the first two principal component
directions, and the coordinates are given by the principal components.

point_num = 1000
zmin = z.min(axis = 0)
zmax= z.max(axis = 0)
point_x = (np.arange(point_num).astype(np.float32) /
point_num * (zmax[0] - zmin[0]) + zmin[0])
point_y = -1 * point_x * w[0] / w[1]
idx = np.logical_and(point_y < zmax[1], point_y >= zmin[1])
plt.scatter(z[:100,0],z[:100,1],c=’b’)
plt.scatter(z[100:,0],z[100:,1], c=’y’)
plt.scatter(point_x[idx],point_y[idx], s = 1, color= ’red’)
plt.savefig(’so4.pdf’)

6 of 6

You might also like