0% found this document useful (0 votes)
24 views15 pages

CS189:289 ML hw1

Homework 1 for CS 189 / 289A focuses on linear algebra and machine learning concepts, requiring both written and coding solutions to be submitted in a specific format. Students are advised to use LATEX for typesetting and must show their work for all problems. The assignment includes a variety of topics such as isomorphism theorems, positive semidefiniteness, matrix calculus, and linear neural networks.

Uploaded by

panyalu2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views15 pages

CS189:289 ML hw1

Homework 1 for CS 189 / 289A focuses on linear algebra and machine learning concepts, requiring both written and coding solutions to be submitted in a specific format. Students are advised to use LATEX for typesetting and must show their work for all problems. The assignment includes a variety of topics such as isomorphism theorems, positive semidefiniteness, matrix calculus, and linear neural networks.

Uploaded by

panyalu2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CS 189 / 289A Introduction to Machine Learning

Fall 2024 Jennifer Listgarten, Saeed Saremi HW1


Due 09/11/24 11:59 PM PT

• Homework 1 consists of both written and coding questions.

• We prefer that you typeset your answers using LATEX or other word processing software.
If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a
good time! Neatly handwritten and scanned solutions will also be accepted for the written
questions.
• In all of the questions, show your work, not just the final answer.
• Start early. This is a long assignment. Most of the material is prerequisite material not
covered in lecture; you are responsible for finding resources to understand it.

Deliverables:

1. Submit a PDF of your homework to the Gradescope assignment entitled “HW 1 Write-Up”.
Please start each question on a new page. If there are graphs, include those graphs in the
correct sections. Do not just stick the graphs in the appendix. We need each solution to be
self-contained on pages of its own.

• Replicate all of your code in an appendix. Begin code for each coding question on
a fresh page. Do not put code from multiple questions in the same page. When you
upload this PDF on Gradescope, make sure that you assign the relevant pages of your
code from the appendix to correct questions.
2. Submit all the code needed to reproduce your results to the Gradescope assignment entitled
“Homework 1 Code”. Yes, you must submit your code twice: in your PDF write-up following
the directions as described above so the readers can easily read it, and once in the format
described below for ease of reproducibility.
• You must set random seeds for all random utils to ensure reproducibility.
• Do NOT submit any data files that we provided.
• Please also include a short file named README listing your name, student ID, and
instructions on how to reproduce your results.
• Please take care that your code doesn’t take up inordinate amounts of time or memory.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
1 Linear Algebra Review
1. First isomorphism theorem. The isomorphism theorems are an important class of results
with versions for various algebraic structures. Here, we are concerned about the first isomor-
phism theorem for vector spaces–one of the most fundamental results in linear algebra.
Theorem. Let V, W be vector spaces, and let T : V → W be a linear map. Then the following
are true:

(a) ker T is a subspace of V.


(b) Im T is a subspace of W.
(c) Im T is isomorphic to V/ ker T .

Prove parts (a) and (b) of the theorem. (The interesting result is part (c), so, if you’re
inclined, try it out! We promise it’s a very rewarding proof :) If you are interested but
unfamiliar with the language, try looking up “isomorphism” and “quotient space.”)
2. First we review some basic concepts of rank. Recall that elementary matrix operations do not
change a matrix’s rank. Let A ∈ Rm×n and B ∈ Rn×p . Let In denote the n × n identity matrix.
   
 In 0   B In 
(a) Perform elementary row and column operations1 to transform   to  .
  
0 AB 0 A
(b) Let’s find lower and upper bounds on rank(AB). Use part (a) to prove that rank A +
rank B − n ≤ rank(AB). Then use what you know about the relationship between the
column space (range) and/or row space of AB and the column/row spaces for A and B
to argue that rank(AB) ≤ min{rank A, rank B}.
(c) If a matrix A has rank r, then some r × r submatrix M of A has a nonzero determinant.
Use this fact to show the standard facts that the dimension of A’s column space is at
least r, and the dimension of A’s row space is at least r. (Note: You will not receive
credit for other ways of showing those same things.)
(d) It is a fact that rank(A⊤ A) = rank A; here’s one way to see that. We’ve already seen in
part (b) that rank(A⊤ A) ≤ rank A. Suppose that rank(A⊤ A) were strictly less than rank A.
What would that tell us about the relationship between the column space of A and the
null space of A⊤ ? What standard fact about the fundamental subspaces of A says that
relationship is impossible?
(e) Given a set of vectors S ⊆ Rn , let AS = {Av : v ∈ s} denote the subset of Rm found
by applying A to every vector in S . In terms of the ideas of the column space (range)
and row space of A: What is ARn , and why? (Hint: what are the definitions of column
space and row space?) What is A⊤ ARn , and why? (Your answer to the latter should be
purely in terms of the fundamental subspaces of A itself, not in terms of the fundamental
subspaces of A⊤ A.)
1
If you’re not familiar with these, https://stattrek.com/matrix-algebra/elementary-operations is a decent introduction.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 2
3. Let A ∈ Rn×n be a symmetric matrix. Prove equivalence between these three different def-
initions of positive semidefiniteness (PSD). Note that when we talk about PSD matrices in
this class, they are defined to be symmetric matrices. There are nonsymmetric matrices that
exhibit PSD properties, like the first definition below, but not all three.
(a) For all x ∈ Rn , x⊤ Ax ≥ 0.
(b) All the eigenvalues of A are nonnegative.
(c) There exists a matrix U ∈ Rn×n such that A = UU ⊤ .

Positive semidefiniteness will be denoted as A ⪰ 0.


4. The Frobenius inner product between two matrices of the same dimensions A, B ∈ Rm×n is
m X
X n
⟨A, B⟩ = trace (A B) =⊤
Ai j Bi j ,
i=1 j=1

where trace M denotes the trace of M, which you should look up if you don’t already know
it. (The norm is sometimes written ⟨A, B⟩F to be clear.) The Frobenius norm of a matrix is
v
t m n
p XX
∥A∥F = ⟨A, A⟩ = |Ai j |2 .
i=1 j=1

Prove the following. The Cauchy–Schwarz inequality, the cyclic property of the trace, and
the definitions in part 3 above may be helpful to you.
(a) x⊤ Ay = ⟨A, xy⊤ ⟩ for all x ∈ Rm , y ∈ Rn , A ∈ Rm×n .
(b) If A and B are symmetric PSD matrices, then trace (AB) ≥ 0.
(c) [OPTIONAL] If A, B ∈ Rn×n are real symmetric matrices with λmax (A) ≥ 0 and B being

PSD, then ⟨A, B⟩ ≤ nλmax (A)∥B∥F .
Hint: Construct a PSD matrix using λmax (A)
5. Let A ∈ Rm×n
√ be an arbitrary
√ matrix. The maximum singular value of A is defined to be
σmax (A) = λmax (A⊤ A) = λmax (AA⊤ ). Prove that

σmax (A) = max (u⊤ Av).


u ∈ Rm , v ∈ Rn
∥u∥=1, ∥v∥=1

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
2 Matrix/Vector Calculus and Norms
 
A11 A12 
1. Consider a 2 × 2 matrix A, written in full as  , and two arbitrary 2-dimensional
A21 A22 
vectors x, y. Calculate the gradient of

sin A211 + eA11 +A22 + x⊤ Ay


 

with respect to the matrix A.


Hint: The gradient has the same dimensions as A. Use the chain rule.
2. Aside from norms on vectors, we can also impose norms on matrices. Besides the Frobenius
norm, the most common kind of norm on matrices is called the induced norm. Induced norms
are defined as
∥Ax∥ p
∥A∥ p = sup
x,0 ∥x∥ p
where the notation ∥ · ∥ p on the right-hand side denotes the vector ℓ p -norm. Please give the
closed-form (or the most simple) expressions for the following induced norms of A ∈ Rm×n .
(a) ∥A∥2 . Hint: use the singular value decomposition.
(b) ∥A∥∞
n
∂α
3. (a) Let α = yi ln (1 + eβi ) for y, β ∈ Rn . What are the partial derivatives
P
∂βi
?
i=1
∂(Ax)
(b) Given x ∈ Rn , A ∈ Rmxn . Write the partial derivative ∂x
.
(c) Given z ∈ Rm . Write the gradient ∇z (z⊤ z).
∂z
(d) Given x ∈ Rn , z ∈ Rm , and z = g(x). Write the gradient ∇ x z⊤ z in terms of ∂x
and z.
(e) Given x ∈ Rn , y, z ∈ Rm , A ∈ Rmxn , and z = Ax − y. Write the gradient ∇ x (z z). ⊤

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 4
3 Linear Neural Networks
Let’s apply the multivariate chain rule to a “simple” type of neural network called a linear neural
network. They’re not very powerful, as they can learn only linear regression functions or decision
functions, but they’re a good stepping stone for understanding more complicated neural networks.
We are given an n × d design matrix X. Each row of X is a training point, so X represents n training
points with d features each. We are also given an n × k matrix Y. Each row of Y is a set of k labels
for the corresponding training point in X. Our goal is to learn a k × d matrix W of weights2 such
that
Y ≈ XW ⊤ .
If n is larger than d, typically there is no W that achieves equality, so we seek an approximate
answer. We do that by finding the matrix W that minimizes the cost function

RSS(W) = ∥XW ⊤ − Y∥2F . (1)

This is a classic least-squares linear regression problem; most of you have seen those before. But
we are solving k linear regression problems simultaneously, which is why Y and W are matrices
instead of vectors.

Linear neural networks. Instead of optimizing W over the space of k × d matrices directly, we
write the W we seek as a product of multiple matrices. This parameterization is called a linear
neural network.
W = µ(WL , WL−1 , . . . , W2 , W1 ) = WL WL−1 · · · W2 W1 .
Here, µ is called the matrix multiplication map (hence the Greek letter mu) and each W j is a real-
valued d j × d j−1 matrix. Recall that W is a k × d matrix, so dL = k and d0 = d. L is the number of
layers of “connections” in the neural network. You can also think of the network as having L + 1
layers of units: d0 = d units in the input layer, d1 units in the first hidden layer, dL−1 units in the
last hidden layer, and dL = k units in the output layer.
We collect all the neural network’s weights in a weight vector θ = (WL , WL−1 , . . . , W1 ) ∈ Rdθ , where
dθ = dL dL−1 + dL−1 dL−2 + . . . + d1 d0 is the total number of real-valued weights in the network. Thus
we can write µ(θ) to mean µ(WL , WL−1 , . . . , W1 ). But you should imagine θ as a column vector: we
take all the components of all the matrices WL , WL−1 , . . . , W1 and just write them all in one very
long column vector. Given a fixed weight vector θ, the linear neural network takes an input vector
x ∈ Rd0 and returns an output vector y = WL WL−1 · · · W2 W1 x = µ(θ)x ∈ RdL .
Now our goal is to find a weight vector θ that minimizes the composition RSS ◦ µ—that is, it
minimizes the cost function
J(θ) = RSS(µ(θ)).
We are substituting a linear neural network for W and optimizing the weights in θ instead of directly
optimizing the components of W. This makes the optimization problem harder to solve, and you
2
The reason for the transpose on W ⊤ is because we think in terms of applying W to an individual training point. Indeed, if
Xi ∈ Rd and Yi ∈ Rk respectively denote the i-th rows of X and Y transposed to be column vectors, then we can write Yi ≈ WXi . For
historical reasons, most papers in the literature use design matrices whose rows are sample points, rather than columns.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5
would never solve least-squares linear regression problems this way in practice; but again, it is a
good exercise to work toward understanding the behavior of “real” neural networks in which µ is
not a linear function.
We would like to use a gradient descent algorithm to find θ, so we will derive ∇θ J as follows.

1. The gradient G = ∇W RSS(W) is a k × d matrix whose entries are Gi j = ∂RSS(W)/∂Wi j ,


where RSS(W) is defined by Equation (1). Knowing that the simple formula for ∇W RSS(W)
in matrix notation can be written as the following:

∇W RSS(W) = 2(WX ⊤ − Y ⊤ )X

prove this fact by deriving a formula for each Gi j using summations, simplified as much
as possible. Hint: To break down RSS(W) into its component summations, start with the
relationship between the Frobenius norm and the trace of a matrix.
2. Directional derivatives are closely related to gradients. The notation RSS′∆W (W) denotes
the directional derivative of RSS(W) in the direction ∆W, and the notation µ′∆θ (θ) denotes
the directional derivative of µ(θ) in the direction ∆θ.3 Informally speaking, the directional
derivative RSS′∆W (W) tells us how much RSS(W) changes if we increase W by an infinites-
imal displacement ∆W ∈ Rk×d . (However, any ∆W we can actually specify is not actually
infinitesimal; RSS′∆W (W) is a local linearization of the relationship between W and RSS(W)
at W. To a physicist, RSS′∆W (W) tells us the initial velocity of change of RSS(W) if we start
changing W with velocity ∆W.)
Show how to write RSS′∆W (W) as a Frobenius inner product of two matrices, one related to
part 3.1.
3. In principle, we could take the gradient ∇θ µ(θ), but we would need a 3D array to express
it! As we don’t know a nice way to write it, we’ll jump directly to writing the directional
derivative µ′∆θ (θ). Here, ∆θ ∈ Rdθ is a weight vector whose matrices we will write ∆θ =
(∆WL , ∆WL−1 , . . . , ∆W1 ). Show that
L
X
µ′∆θ (θ) = W> j ∆W j W< j
j=1

where W> j = WL WL−1 · · · W j+1 , W< j = W j−1 W j−2 · · · W1 , and we use the convention that W>L
is the dL × dL identity matrix and W<1 is the d0 × d0 identity matrix.
Hint: although µ is not a linear function of θ, µ is linear in any single W j ; and any directional
derivative of the form µ′∆θ (θ) is linear in ∆θ (for a fixed θ).
4. Recall the chain rule for scalar functions, dxd f (g(x))| x=x0 = dyd f (y)|y=g(x0 ) · dxd g(x)| x=x0 . There
is a multivariate version of the chain rule, which we hope you remember from some class
you’ve taken, and the multivariate chain rule can be used to chain directional derivatives.
3
“∆W” and “∆θ” are just variable names that remind us to think of these as small displacements of W or θ; the Greek letter delta
is not an operator nor a separate variable.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 6

Write out the chain rule that expresses the directional derivative J∆θ (θ)|θ=θ0 by composing
your directional derivatives for RSS and µ, evaluated at a weight vector θ0 . (Just write the
pure form of the chain rule without substituting the values of those directional derivatives;
we’ll substitute the values in the next part.)

5. Now substitute the values you derived in parts 3.2 and 3.3 into your expression for J∆θ (θ) and
use it to show that

∇θ J(θ) = (2 (µ(θ) X ⊤ − Y ⊤ )XW<L



,
...,
2W>⊤j (µ(θ) X ⊤ − Y ⊤ )XW<⊤j ,
...,

2W>1 (µ(θ) X ⊤ − Y ⊤ )X).

This gradient is a vector in Rdθ written in the same format as (WL , . . . , W j , . . . , W1 ). Note that
the values W> j and W< j here depend on θ.
Hint: you might find the cyclic property of the trace handy.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 7
4 Probability Potpourri
1. Recall the covariance of two scalar random variables X and Y is defined as Cov(X, Y) =
E[(X − E[X])(Y − E[Y])]. For a multivariate random variable Z ∈ Rn , (i.e., Z is a column
vector where each element Zi is a scalar random variable), we define the covariance matrix Σ
such that Σi j = Cov(Zi , Z j ). Concisely, Σ = E[(Z − µ)(Z − µ)⊤ ], where µ is the mean value of
Z. Prove that the covariance matrix is always positive semidefinite (PSD).
Hint: Use linearity of expectation.
2. Suppose a pharmaceutical company is developing a diagnostic test for a rare disease that has
a prevalence of 1 in 1,000 in the population. Let x be the true positive rate of the test, and let
y be the false positive rate. Determine the minimum value that x must have, expressed as a
function of y, such that a patient who tests positive actually has the disease with probability
greater than 0.5.
√ √
3. An archery target is made of 3 concentric circles of radii 1/ 3, 1 and 3 feet. Arrows strik-
ing within the inner circle are awarded 4 points, arrows within the middle ring are awarded
3 points, and arrows within the outer ring are awarded 2 points. Shots outside the target are
awarded 0 points.
Consider a random variable X, the distance of the strike from the center (in feet), and let the
probability density function of X be

 π(1+x2 ) x > 0

 2
f (x) = 

0

 otherwise

What is the expected value of the score of a single strike?


4. Let X ∼ Pois(λ), Y ∼ Pois(µ). Given that X ⊥⊥ Y, derive an expression for P(X = k | X + Y = n)
where k = 0, . . . , n. What well-known probability distribution is this? What are its parame-
ters?
5. Consider a coin that may be biased, where the probability of the coin landing heads on any
single flip is θ. If the coin is flipped n times and heads is observed k times, what is the
maximum likelihood estimate (MLE) of θ?
6. Consider a family of distributions parameterized by θ ∈ R with the following probability
density function:

eθ−x when x ≥ θ


fθ (x) = 

0

 when x < θ

(a) Prove that f is a valid probability density function by showing that it integrates to 1 for
all θ.
(b) Suppose that you observe n samples distributed according to f : x1 , x2 , . . . , xn . Find the
maximum likelihood estimate of θ.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 8
5 The Multivariate Normal Distribution
The multivariate normal distribution with mean µ ∈ Rd and positive definite covariance matrix
Σ ∈ Rd×d , denoted N(µ, Σ), has the probability density function
!
1 1
f (x; µ, Σ) = p exp − (x − µ) Σ (x − µ) .
⊤ −1

(2π)d |Σ| 2

Here |Σ| denotes the determinant of Σ. You may use the following facts without proof.

• The volume under the normal PDF is 1.


Z Z ( )
1 1
f (x) dx = p exp − (x − µ) Σ (x − µ) dx = 1.
⊤ −1
Rd Rd (2π)d |Σ| 2

• The change-of-variables formula for integrals: let f be a smooth function from Rd → R, let
A ∈ Rd×d be an invertible matrix, and let b ∈ Rd be a vector. Then, performing the change of
variables x 7→ z = Ax + b,
Z Z
f (x) dx = f (A−1 z − A−1 b) |A−1 | dz.
Rd Rd

All throughout this question, we take X ∼ N(µ, Σ).

1. Use a suitable change of variables to show that E[X] = µ. You must utilize the definition of
expectation.
2. Use a suitable change of variables to show that Var(X) = Σ, where the variance of a vector-
valued random variable X is

Var(X) = Cov(X, X) = E[(X − µ) (X − µ)⊤ ] = E[XX ⊤ ] − µ µ⊤ .

Hints: Every symmetric, positive definite matrix Σ has a symmetric, positive definite square
root Σ1/2 such that Σ = Σ1/2 Σ1/2 . Note that Σ and Σ1/2 are invertible. After the change of
variables, you will have to find another variance Var(Z); if you’ve chosen the right change of
variables, you can solve that by solving the integral for each diagonal component of Var(Z)
and a second integral for each off-diagonal component. The diagonal components will re-
quire integration by parts. You cannot assume anything about Var(Z)–you must compute it
via integration.
3. Compute the moment generating function (MGF) of X: MX (λ) = E[eλ X ], where λ ∈ Rd .

Note: moment generating functions have several interesting and useful properties, one being
that MX characterizes the distribution of X: if MX = MY , then X and Y have the same
distribution.
Hints:

• You should try “completing the square” in the exponent of the Gaussian PDF.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 9
• You should arrive at ( )
1 ⊤
MX (λ) = exp λ µ + λ Σλ .

2

4. Using the fact that MGFs determine distributions, given A ∈ Rk×d and b ∈ Rk identify the
distribution of AX + b (don’t worry about covariance matrices being invertible).

5. Show that there exists an affine transformation of X that is distributed as the standard multi-
variate Gaussian, N(0, Id ). (Assume Σ is invertible.)

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 10
6 Real Analysis
1. Limit of a Sequence. A sequence {xn } is said to converge to a limit L if, for every measure
of closeness ϵ ∈ R, the sequence’s terms n ∈ N after a point n0 ∈ N converge upon that limit.
More formally, if limn→∞ xn = L then ∀ϵ > 0, ∃n0 ∈ Z+ such that ∀n ≥ n0 :

|xn − L| < ϵ

(a) Consider the sequence {xn } defined by the recurrence relation xn+1 = 12 xn . Treat x0 as
some constant that is the first element of the sequence. Prove that {xn } converges by
evaluating limn→∞ xn . You must use the formal definition of the limit of a sequence.
(b) [OPTIONAL] Consider  a sequence {xn } of non-zero real numbers and suppose that
L = limn→∞ n 1 − |xn | exists. Prove that {|xn |} converges when L > 1 by evaluating
|xn+1 |

limn→∞ |xn |.  
Hint: Use the Bernoulli Inequality 1 − nq < exp − qn and the approximation for the
Harmonic Series nk=1 1k ≈ ln(n) for sufficiently large n.
P

2. Taylor Series. Taylor series expansions are a method of approximating a function near a
point using polynomial terms. The Taylor expansion for a function f (x) at point a is given
by:
f ′′ (a) f ′′′ (a)
f (x) = f (a) + f ′ (a)(x − a) + (x − a)2 + (x − a)3 + · · ·
2! 3!
P∞ f (n) (a)
This can also be rewritten as f (x) = n=0 n! (x − a)n .
(a) Calculate the first three terms of the Taylor series for f (x) = ln(1 + x) centered at a = 0.
(b) [OPTIONAL] The gamma function is defined as

Z∞
Γ(z) = tz−1 e−t dt.
0

Calculate and use a first-order Taylor expansion of the gamma function centered at 1 to
approximate Γ(1.1). You should express your answer in terms of the Euler-Mascheroni
constant γ.
You may use the fact that Γ(x + 1) interpolates the factorial function without proof.
3. Consider a twice continuously differentiable function f : Rn 7→ R. Suppose this function
admits a unique global optimum x∗ ∈ Rn . Suppose that for some spherical region X =
{x | ∥x − x∗ ∥2 ≤ D} around x∗ for some constant D, the Hessian matrix H of the function f (x)
is PSD and its maximum eigenvalue is 1. Prove that
D
f (x) − f (x∗ ) ≤
2
for every x ∈ X. Hint: Look up Taylor’s Theorem with Remainder. Use Mean Value Theorem
on the second order term instead of the first order term, which is what is usually done.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 11
7 Hands-on with data
In the following problem, you will use two simple datasets to walk through the steps of a standard
machine learning workflow: inspecting your data, choosing a model, implementing it, and veri-
fying its accuracy. We have provided two datasets in the form of numpy arrays: dataset 1.npy
and dataset 2.npy. You can load each using NumPy’s np.load method4 . You can plot figures
using Matplotlib’s plt.plot method5 .
Each dataset is a two-column array with the first column consisting of n scalar inputs X ∈ Rn×1 and
the second column consisting of n scalar labels Y ∈ Rn×1 . We denote each entry of X and Y with
subscripts:    
 x1  y1 
   
 x2  y2 
X =  ..  Y =  .. 
 .   . 
x  y 
n n

and assume that yi is a (potentially stochastic) function of xi .

(a) It is often useful to visually inspect your data and calculate simple statistics; this can detect
dataset corruptions or inform your method. For both datasets:

(i) Plot the data as a scatter plot.


(ii) Calculate the correlation coefficient between X and Y:
Cov(X, Y)
ρX,Y =
σX σY
in which Cov(X, Y) is the covariance between X and Y and σX is the standard deviation
of X.
Your solution may make use of the NumPy library only for arithmetic operations, matrix-
vector or matrix-matrix multiplications, matrix inversion, and elementwise exponentiation. It
may not make use of library calls for calculating means, standard deviations, or the correlation
coefficient itself directly.
(b) We would like to design a function that can predict yi given xi and then apply it to new inputs.
This is a recurring theme in machine learning, and you will soon learn about a general-purpose
framework for thinking about such problems. As a preview, we will now explore one of the
simplest instantiations of this idea using the class of linear functions:

Ŷ = Xw. (2)

The parameters of our function are denoted by w ∈ R. It is common to denote predicted


variants of quantities with a hat, so Ŷ is a predicted label whereas Y is a ground truth label.
4
https://numpy.org/doc
5
https://matplotlib.org/stable/users/explain/quick start.html

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 12
We would like to find a w∗ that minimizes the squared error JSE between predictions and
labels:
w∗ = argmin JSE (w) = argmin ∥Xw − Y∥22 .
w w

Derive ∇w JSE (w) and set it equal to 0 to solve for w . (Note that this procedure for finding an
optimum relies on the convexity of JSE . You do not need to show convexity here, but it is a
useful exercise to convince yourself this is valid.)
(c) Your solution w∗ should be a function of X and Y. Implement it and report its mean squared
error (MSE) for dataset 1. The mean squared error is the objective JSE from part (b) divided
by the number of datapoints:
1
JMSE (w) = ∥Xw − Y∥22 .
n
Also visually inspect the model’s quality by plotting a line plot of predicted ŷ for uniformly-
spaced x ∈ [0, 10]. Keep the scatter plot from part (a) in the background so that you can
compare the raw data to your linear function. Does the function provide a good fit of the data?
Why or why not?
(d) We are now going to experiment with constructing new features for our model. That is, instead
of considering models that are linear in the inputs, we will now consider models that are linear
in some (potentially nonlinear) transformation of the data:
ϕ(x1 )⊤ 
 

ϕ(x2 )⊤ 
 
Ŷ = Φw =  ..  w,
 . 
ϕ(xn )⊤ 
 

where ϕ(xi ), w ∈ Rm . Repeat part (c), providing both the mean squared error of your predictor
and a plot of its predictions, for the following features on dataset 1:
 
 x 
ϕ(xi ) =  i  .
1

How do the plotted function and mean squared error compare? (A single sentence will suffice.)
Hint: the general form of your solution for w∗ is still valid, but you will now need to use
features Φ where you once used raw inputs X.
(e) Now consider the quadratic features:
 
 xi2 
ϕ(xi ) =  xi  .
 
 
1

Repeat part (c) with these features on dataset 1, once again providing short commentary on
any changes.
(f) Repeat parts (c)-(e) with dataset 2.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 13
(g) Finally, we would like to understand which features Φ provide us with the best model. To
that end, you will implement a method known as k-fold cross validation. The following are
instructions for this method; deliverables for part (g) are at the end.
(i) Split dataset 2 randomly into k = 4 equal sized subsets. Group the dataset into 4 distinct
training / validation splits by denoting each subset as the validation set and the remaining
subsets as the training set for that split.
(ii) On each of the 4 training / validation splits, fit linear models using the following 5 poly-
nomial feature sets:
 5
   xi 
 3  x4   4 
 
2
 xi   i3   xi 
   xi   2   xi   3 
 x  x  x 
ϕ1 (xi ) =  i  ϕ2 (xi ) =  xi  ϕ3 (xi ) =  i  ϕ4 (xi ) =  xi2  ϕ5 (xi ) =  i2 
   
1    xi   xi 
1  xi 
 
1
 xi 
 
1
1

This step will produce 20 distinct w∗ vectors: one for each dataset split and featurization
ϕ j.
(iii) For each feature set ϕ j , average the training and validation mean squared errors over all
training splits.
It is worth thinking about what this extra effort has bought us: by splitting the dataset into
subsets, we were able to use all available datapoints for model fitting while still having held-
out datapoints for evaluation for any particular model.
Deliverables for part (g): Plot the training mean squared error and the validation mean
squared error on the same plot as a function of the largest exponent in the feature set. Use
a log scale for the y-axis. Which model does the training mean squared error suggest is best?
Which model does the validation mean squared error suggest is best?

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 14
8 Honor Code
1. List all collaborators. If you worked alone, then you must explicitly state so.
2. Declare and sign the following statement:
“I certify that all solutions in this document are entirely my own and that I have not looked
at anyone else’s solution. I have given credit to all external sources I consulted.”
Signature :
While discussions are encouraged, everything in your solution must be your (and only your)
creation. Furthermore, all external material (i.e., anything outside lectures and assigned read-
ings, including figures and pictures) should be cited properly. We wish to remind you that
the consequences of academic misconduct are particularly severe!

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 15

You might also like