0% found this document useful (0 votes)

24 views15 pages

CS189:289 ML hw1

Homework 1 for CS 189 / 289A focuses on linear algebra and machine learning concepts, requiring both written and coding solutions to be submitted in a specific format. Students are advised to use LATEX for typesetting and must show their work for all problems. The assignment includes a variety of topics such as isomorphism theorems, positive semidefiniteness, matrix calculus, and linear neural networks.

Uploaded by

panyalu2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views15 pages

CS189:289 ML hw1

Uploaded by

panyalu2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CS 189 / 289A Introduction to Machine Learning

Fall 2024 Jennifer Listgarten, Saeed Saremi HW1

Due 09/11/24 11:59 PM PT

• Homework 1 consists of both written and coding questions.

• We prefer that you typeset your answers using LATEX or other word processing software.
If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a
good time! Neatly handwritten and scanned solutions will also be accepted for the written
questions.
• In all of the questions, show your work, not just the final answer.
• Start early. This is a long assignment. Most of the material is prerequisite material not
covered in lecture; you are responsible for finding resources to understand it.

Deliverables:

1. Submit a PDF of your homework to the Gradescope assignment entitled “HW 1 Write-Up”.
Please start each question on a new page. If there are graphs, include those graphs in the
correct sections. Do not just stick the graphs in the appendix. We need each solution to be
self-contained on pages of its own.

• Replicate all of your code in an appendix. Begin code for each coding question on
a fresh page. Do not put code from multiple questions in the same page. When you
upload this PDF on Gradescope, make sure that you assign the relevant pages of your
code from the appendix to correct questions.
2. Submit all the code needed to reproduce your results to the Gradescope assignment entitled
“Homework 1 Code”. Yes, you must submit your code twice: in your PDF write-up following
the directions as described above so the readers can easily read it, and once in the format
described below for ease of reproducibility.
• You must set random seeds for all random utils to ensure reproducibility.
• Do NOT submit any data files that we provided.
• Please also include a short file named README listing your name, student ID, and
instructions on how to reproduce your results.
• Please take care that your code doesn’t take up inordinate amounts of time or memory.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
1 Linear Algebra Review
1. First isomorphism theorem. The isomorphism theorems are an important class of results
with versions for various algebraic structures. Here, we are concerned about the first isomor-
phism theorem for vector spaces–one of the most fundamental results in linear algebra.
Theorem. Let V, W be vector spaces, and let T : V → W be a linear map. Then the following
are true:

(a) ker T is a subspace of V.

(b) Im T is a subspace of W.
(c) Im T is isomorphic to V/ ker T .

Prove parts (a) and (b) of the theorem. (The interesting result is part (c), so, if you’re
inclined, try it out! We promise it’s a very rewarding proof :) If you are interested but
unfamiliar with the language, try looking up “isomorphism” and “quotient space.”)
2. First we review some basic concepts of rank. Recall that elementary matrix operations do not
change a matrix’s rank. Let A ∈ Rm×n and B ∈ Rn×p . Let In denote the n × n identity matrix.
   
 In 0   B In 
(a) Perform elementary row and column operations1 to transform   to  .
  
0 AB 0 A
(b) Let’s find lower and upper bounds on rank(AB). Use part (a) to prove that rank A +
rank B − n ≤ rank(AB). Then use what you know about the relationship between the
column space (range) and/or row space of AB and the column/row spaces for A and B
to argue that rank(AB) ≤ min{rank A, rank B}.
(c) If a matrix A has rank r, then some r × r submatrix M of A has a nonzero determinant.
Use this fact to show the standard facts that the dimension of A’s column space is at
least r, and the dimension of A’s row space is at least r. (Note: You will not receive
credit for other ways of showing those same things.)
(d) It is a fact that rank(A⊤ A) = rank A; here’s one way to see that. We’ve already seen in
part (b) that rank(A⊤ A) ≤ rank A. Suppose that rank(A⊤ A) were strictly less than rank A.
What would that tell us about the relationship between the column space of A and the
null space of A⊤ ? What standard fact about the fundamental subspaces of A says that
relationship is impossible?
(e) Given a set of vectors S ⊆ Rn , let AS = {Av : v ∈ s} denote the subset of Rm found
by applying A to every vector in S . In terms of the ideas of the column space (range)
and row space of A: What is ARn , and why? (Hint: what are the definitions of column
space and row space?) What is A⊤ ARn , and why? (Your answer to the latter should be
purely in terms of the fundamental subspaces of A itself, not in terms of the fundamental
subspaces of A⊤ A.)
1
If you’re not familiar with these, https://stattrek.com/matrix-algebra/elementary-operations is a decent introduction.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 2
3. Let A ∈ Rn×n be a symmetric matrix. Prove equivalence between these three different def-
initions of positive semidefiniteness (PSD). Note that when we talk about PSD matrices in
this class, they are defined to be symmetric matrices. There are nonsymmetric matrices that
exhibit PSD properties, like the first definition below, but not all three.
(a) For all x ∈ Rn , x⊤ Ax ≥ 0.
(b) All the eigenvalues of A are nonnegative.
(c) There exists a matrix U ∈ Rn×n such that A = UU ⊤ .

Positive semidefiniteness will be denoted as A ⪰ 0.

4. The Frobenius inner product between two matrices of the same dimensions A, B ∈ Rm×n is
m X
X n
⟨A, B⟩ = trace (A B) =⊤
Ai j Bi j ,
i=1 j=1

where trace M denotes the trace of M, which you should look up if you don’t already know
it. (The norm is sometimes written ⟨A, B⟩F to be clear.) The Frobenius norm of a matrix is
v
t m n
p XX
∥A∥F = ⟨A, A⟩ = |Ai j |2 .
i=1 j=1

Prove the following. The Cauchy–Schwarz inequality, the cyclic property of the trace, and
the definitions in part 3 above may be helpful to you.
(a) x⊤ Ay = ⟨A, xy⊤ ⟩ for all x ∈ Rm , y ∈ Rn , A ∈ Rm×n .
(b) If A and B are symmetric PSD matrices, then trace (AB) ≥ 0.
(c) [OPTIONAL] If A, B ∈ Rn×n are real symmetric matrices with λmax (A) ≥ 0 and B being
√
PSD, then ⟨A, B⟩ ≤ nλmax (A)∥B∥F .
Hint: Construct a PSD matrix using λmax (A)
5. Let A ∈ Rm×n
√ be an arbitrary
√ matrix. The maximum singular value of A is defined to be
σmax (A) = λmax (A⊤ A) = λmax (AA⊤ ). Prove that

σmax (A) = max (u⊤ Av).

u ∈ Rm , v ∈ Rn
∥u∥=1, ∥v∥=1

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
2 Matrix/Vector Calculus and Norms
 
A11 A12 
1. Consider a 2 × 2 matrix A, written in full as  , and two arbitrary 2-dimensional
A21 A22 
vectors x, y. Calculate the gradient of

sin A211 + eA11 +A22 + x⊤ Ay

with respect to the matrix A.

Hint: The gradient has the same dimensions as A. Use the chain rule.
2. Aside from norms on vectors, we can also impose norms on matrices. Besides the Frobenius
norm, the most common kind of norm on matrices is called the induced norm. Induced norms
are defined as
∥Ax∥ p
∥A∥ p = sup
x,0 ∥x∥ p
where the notation ∥ · ∥ p on the right-hand side denotes the vector ℓ p -norm. Please give the
closed-form (or the most simple) expressions for the following induced norms of A ∈ Rm×n .
(a) ∥A∥2 . Hint: use the singular value decomposition.
(b) ∥A∥∞
n
∂α
3. (a) Let α = yi ln (1 + eβi ) for y, β ∈ Rn . What are the partial derivatives
P
∂βi
?
i=1
∂(Ax)
(b) Given x ∈ Rn , A ∈ Rmxn . Write the partial derivative ∂x
.
(c) Given z ∈ Rm . Write the gradient ∇z (z⊤ z).
∂z
(d) Given x ∈ Rn , z ∈ Rm , and z = g(x). Write the gradient ∇ x z⊤ z in terms of ∂x
and z.
(e) Given x ∈ Rn , y, z ∈ Rm , A ∈ Rmxn , and z = Ax − y. Write the gradient ∇ x (z z). ⊤

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 4
3 Linear Neural Networks
Let’s apply the multivariate chain rule to a “simple” type of neural network called a linear neural
network. They’re not very powerful, as they can learn only linear regression functions or decision
functions, but they’re a good stepping stone for understanding more complicated neural networks.
We are given an n × d design matrix X. Each row of X is a training point, so X represents n training
points with d features each. We are also given an n × k matrix Y. Each row of Y is a set of k labels
for the corresponding training point in X. Our goal is to learn a k × d matrix W of weights2 such
that
Y ≈ XW ⊤ .
If n is larger than d, typically there is no W that achieves equality, so we seek an approximate
answer. We do that by finding the matrix W that minimizes the cost function

RSS(W) = ∥XW ⊤ − Y∥2F . (1)

This is a classic least-squares linear regression problem; most of you have seen those before. But
we are solving k linear regression problems simultaneously, which is why Y and W are matrices
instead of vectors.

Linear neural networks. Instead of optimizing W over the space of k × d matrices directly, we
write the W we seek as a product of multiple matrices. This parameterization is called a linear
neural network.
W = µ(WL , WL−1 , . . . , W2 , W1 ) = WL WL−1 · · · W2 W1 .
Here, µ is called the matrix multiplication map (hence the Greek letter mu) and each W j is a real-
valued d j × d j−1 matrix. Recall that W is a k × d matrix, so dL = k and d0 = d. L is the number of
layers of “connections” in the neural network. You can also think of the network as having L + 1
layers of units: d0 = d units in the input layer, d1 units in the first hidden layer, dL−1 units in the
last hidden layer, and dL = k units in the output layer.
We collect all the neural network’s weights in a weight vector θ = (WL , WL−1 , . . . , W1 ) ∈ Rdθ , where
dθ = dL dL−1 + dL−1 dL−2 + . . . + d1 d0 is the total number of real-valued weights in the network. Thus
we can write µ(θ) to mean µ(WL , WL−1 , . . . , W1 ). But you should imagine θ as a column vector: we
take all the components of all the matrices WL , WL−1 , . . . , W1 and just write them all in one very
long column vector. Given a fixed weight vector θ, the linear neural network takes an input vector
x ∈ Rd0 and returns an output vector y = WL WL−1 · · · W2 W1 x = µ(θ)x ∈ RdL .
Now our goal is to find a weight vector θ that minimizes the composition RSS ◦ µ—that is, it
minimizes the cost function
J(θ) = RSS(µ(θ)).
We are substituting a linear neural network for W and optimizing the weights in θ instead of directly
optimizing the components of W. This makes the optimization problem harder to solve, and you
2
The reason for the transpose on W ⊤ is because we think in terms of applying W to an individual training point. Indeed, if
Xi ∈ Rd and Yi ∈ Rk respectively denote the i-th rows of X and Y transposed to be column vectors, then we can write Yi ≈ WXi . For
historical reasons, most papers in the literature use design matrices whose rows are sample points, rather than columns.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5
would never solve least-squares linear regression problems this way in practice; but again, it is a
good exercise to work toward understanding the behavior of “real” neural networks in which µ is
not a linear function.
We would like to use a gradient descent algorithm to find θ, so we will derive ∇θ J as follows.

1. The gradient G = ∇W RSS(W) is a k × d matrix whose entries are Gi j = ∂RSS(W)/∂Wi j ,

where RSS(W) is defined by Equation (1). Knowing that the simple formula for ∇W RSS(W)
in matrix notation can be written as the following:

∇W RSS(W) = 2(WX ⊤ − Y ⊤ )X

prove this fact by deriving a formula for each Gi j using summations, simplified as much
as possible. Hint: To break down RSS(W) into its component summations, start with the
relationship between the Frobenius norm and the trace of a matrix.
2. Directional derivatives are closely related to gradients. The notation RSS′∆W (W) denotes
the directional derivative of RSS(W) in the direction ∆W, and the notation µ′∆θ (θ) denotes
the directional derivative of µ(θ) in the direction ∆θ.3 Informally speaking, the directional
derivative RSS′∆W (W) tells us how much RSS(W) changes if we increase W by an infinites-
imal displacement ∆W ∈ Rk×d . (However, any ∆W we can actually specify is not actually
infinitesimal; RSS′∆W (W) is a local linearization of the relationship between W and RSS(W)
at W. To a physicist, RSS′∆W (W) tells us the initial velocity of change of RSS(W) if we start
changing W with velocity ∆W.)
Show how to write RSS′∆W (W) as a Frobenius inner product of two matrices, one related to
part 3.1.
3. In principle, we could take the gradient ∇θ µ(θ), but we would need a 3D array to express
it! As we don’t know a nice way to write it, we’ll jump directly to writing the directional
derivative µ′∆θ (θ). Here, ∆θ ∈ Rdθ is a weight vector whose matrices we will write ∆θ =
(∆WL , ∆WL−1 , . . . , ∆W1 ). Show that
L
X
µ′∆θ (θ) = W> j ∆W j W< j
j=1

where W> j = WL WL−1 · · · W j+1 , W< j = W j−1 W j−2 · · · W1 , and we use the convention that W>L
is the dL × dL identity matrix and W<1 is the d0 × d0 identity matrix.
Hint: although µ is not a linear function of θ, µ is linear in any single W j ; and any directional
derivative of the form µ′∆θ (θ) is linear in ∆θ (for a fixed θ).
4. Recall the chain rule for scalar functions, dxd f (g(x))| x=x0 = dyd f (y)|y=g(x0 ) · dxd g(x)| x=x0 . There
is a multivariate version of the chain rule, which we hope you remember from some class
you’ve taken, and the multivariate chain rule can be used to chain directional derivatives.
3
“∆W” and “∆θ” are just variable names that remind us to think of these as small displacements of W or θ; the Greek letter delta
is not an operator nor a separate variable.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 6
′
Write out the chain rule that expresses the directional derivative J∆θ (θ)|θ=θ0 by composing
your directional derivatives for RSS and µ, evaluated at a weight vector θ0 . (Just write the
pure form of the chain rule without substituting the values of those directional derivatives;
we’ll substitute the values in the next part.)
′
5. Now substitute the values you derived in parts 3.2 and 3.3 into your expression for J∆θ (θ) and
use it to show that

∇θ J(θ) = (2 (µ(θ) X ⊤ − Y ⊤ )XW<L

⊤
,
...,
2W>⊤j (µ(θ) X ⊤ − Y ⊤ )XW<⊤j ,
...,
⊤
2W>1 (µ(θ) X ⊤ − Y ⊤ )X).

This gradient is a vector in Rdθ written in the same format as (WL , . . . , W j , . . . , W1 ). Note that
the values W> j and W< j here depend on θ.
Hint: you might find the cyclic property of the trace handy.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 7
4 Probability Potpourri
1. Recall the covariance of two scalar random variables X and Y is defined as Cov(X, Y) =
E[(X − E[X])(Y − E[Y])]. For a multivariate random variable Z ∈ Rn , (i.e., Z is a column
vector where each element Zi is a scalar random variable), we define the covariance matrix Σ
such that Σi j = Cov(Zi , Z j ). Concisely, Σ = E[(Z − µ)(Z − µ)⊤ ], where µ is the mean value of
Z. Prove that the covariance matrix is always positive semidefinite (PSD).
Hint: Use linearity of expectation.
2. Suppose a pharmaceutical company is developing a diagnostic test for a rare disease that has
a prevalence of 1 in 1,000 in the population. Let x be the true positive rate of the test, and let
y be the false positive rate. Determine the minimum value that x must have, expressed as a
function of y, such that a patient who tests positive actually has the disease with probability
greater than 0.5.
√ √
3. An archery target is made of 3 concentric circles of radii 1/ 3, 1 and 3 feet. Arrows strik-
ing within the inner circle are awarded 4 points, arrows within the middle ring are awarded
3 points, and arrows within the outer ring are awarded 2 points. Shots outside the target are
awarded 0 points.
Consider a random variable X, the distance of the strike from the center (in feet), and let the
probability density function of X be

 π(1+x2 ) x > 0

 2
f (x) = 

0

 otherwise

What is the expected value of the score of a single strike?

4. Let X ∼ Pois(λ), Y ∼ Pois(µ). Given that X ⊥⊥ Y, derive an expression for P(X = k | X + Y = n)
where k = 0, . . . , n. What well-known probability distribution is this? What are its parame-
ters?
5. Consider a coin that may be biased, where the probability of the coin landing heads on any
single flip is θ. If the coin is flipped n times and heads is observed k times, what is the
maximum likelihood estimate (MLE) of θ?
6. Consider a family of distributions parameterized by θ ∈ R with the following probability
density function:

eθ−x when x ≥ θ


fθ (x) = 

0

 when x < θ

(a) Prove that f is a valid probability density function by showing that it integrates to 1 for
all θ.
(b) Suppose that you observe n samples distributed according to f : x1 , x2 , . . . , xn . Find the
maximum likelihood estimate of θ.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 8
5 The Multivariate Normal Distribution
The multivariate normal distribution with mean µ ∈ Rd and positive definite covariance matrix
Σ ∈ Rd×d , denoted N(µ, Σ), has the probability density function
!
1 1
f (x; µ, Σ) = p exp − (x − µ) Σ (x − µ) .
⊤ −1

(2π)d |Σ| 2

Here |Σ| denotes the determinant of Σ. You may use the following facts without proof.

• The volume under the normal PDF is 1.

Z Z ( )
1 1
f (x) dx = p exp − (x − µ) Σ (x − µ) dx = 1.
⊤ −1
Rd Rd (2π)d |Σ| 2

• The change-of-variables formula for integrals: let f be a smooth function from Rd → R, let
A ∈ Rd×d be an invertible matrix, and let b ∈ Rd be a vector. Then, performing the change of
variables x 7→ z = Ax + b,
Z Z
f (x) dx = f (A−1 z − A−1 b) |A−1 | dz.
Rd Rd

All throughout this question, we take X ∼ N(µ, Σ).

1. Use a suitable change of variables to show that E[X] = µ. You must utilize the definition of
expectation.
2. Use a suitable change of variables to show that Var(X) = Σ, where the variance of a vector-
valued random variable X is

Var(X) = Cov(X, X) = E[(X − µ) (X − µ)⊤ ] = E[XX ⊤ ] − µ µ⊤ .

Hints: Every symmetric, positive definite matrix Σ has a symmetric, positive definite square
root Σ1/2 such that Σ = Σ1/2 Σ1/2 . Note that Σ and Σ1/2 are invertible. After the change of
variables, you will have to find another variance Var(Z); if you’ve chosen the right change of
variables, you can solve that by solving the integral for each diagonal component of Var(Z)
and a second integral for each off-diagonal component. The diagonal components will re-
quire integration by parts. You cannot assume anything about Var(Z)–you must compute it
via integration.
3. Compute the moment generating function (MGF) of X: MX (λ) = E[eλ X ], where λ ∈ Rd .
⊤

Note: moment generating functions have several interesting and useful properties, one being
that MX characterizes the distribution of X: if MX = MY , then X and Y have the same
distribution.
Hints:

• You should try “completing the square” in the exponent of the Gaussian PDF.

4. Using the fact that MGFs determine distributions, given A ∈ Rk×d and b ∈ Rk identify the
distribution of AX + b (don’t worry about covariance matrices being invertible).

5. Show that there exists an affine transformation of X that is distributed as the standard multi-
variate Gaussian, N(0, Id ). (Assume Σ is invertible.)

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 10
6 Real Analysis
1. Limit of a Sequence. A sequence {xn } is said to converge to a limit L if, for every measure
of closeness ϵ ∈ R, the sequence’s terms n ∈ N after a point n0 ∈ N converge upon that limit.
More formally, if limn→∞ xn = L then ∀ϵ > 0, ∃n0 ∈ Z+ such that ∀n ≥ n0 :

|xn − L| < ϵ

(a) Consider the sequence {xn } defined by the recurrence relation xn+1 = 12 xn . Treat x0 as
some constant that is the first element of the sequence. Prove that {xn } converges by
evaluating limn→∞ xn . You must use the formal definition of the limit of a sequence.
(b) [OPTIONAL] Consider a sequence {xn } of non-zero real numbers and suppose that
L = limn→∞ n 1 − |xn | exists. Prove that {|xn |} converges when L > 1 by evaluating
|xn+1 |

limn→∞ |xn |.
Hint: Use the Bernoulli Inequality 1 − nq < exp − qn and the approximation for the
Harmonic Series nk=1 1k ≈ ln(n) for sufficiently large n.
P

2. Taylor Series. Taylor series expansions are a method of approximating a function near a
point using polynomial terms. The Taylor expansion for a function f (x) at point a is given
by:
f ′′ (a) f ′′′ (a)
f (x) = f (a) + f ′ (a)(x − a) + (x − a)2 + (x − a)3 + · · ·
2! 3!
P∞ f (n) (a)
This can also be rewritten as f (x) = n=0 n! (x − a)n .
(a) Calculate the first three terms of the Taylor series for f (x) = ln(1 + x) centered at a = 0.
(b) [OPTIONAL] The gamma function is defined as

Z∞
Γ(z) = tz−1 e−t dt.
0

Calculate and use a first-order Taylor expansion of the gamma function centered at 1 to
approximate Γ(1.1). You should express your answer in terms of the Euler-Mascheroni
constant γ.
You may use the fact that Γ(x + 1) interpolates the factorial function without proof.
3. Consider a twice continuously differentiable function f : Rn 7→ R. Suppose this function
admits a unique global optimum x∗ ∈ Rn . Suppose that for some spherical region X =
{x | ∥x − x∗ ∥2 ≤ D} around x∗ for some constant D, the Hessian matrix H of the function f (x)
is PSD and its maximum eigenvalue is 1. Prove that
D
f (x) − f (x∗ ) ≤
2
for every x ∈ X. Hint: Look up Taylor’s Theorem with Remainder. Use Mean Value Theorem
on the second order term instead of the first order term, which is what is usually done.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 11
7 Hands-on with data
In the following problem, you will use two simple datasets to walk through the steps of a standard
machine learning workflow: inspecting your data, choosing a model, implementing it, and veri-
fying its accuracy. We have provided two datasets in the form of numpy arrays: dataset 1.npy
and dataset 2.npy. You can load each using NumPy’s np.load method4 . You can plot figures
using Matplotlib’s plt.plot method5 .
Each dataset is a two-column array with the first column consisting of n scalar inputs X ∈ Rn×1 and
the second column consisting of n scalar labels Y ∈ Rn×1 . We denote each entry of X and Y with
subscripts:    
 x1  y1 
   
 x2  y2 
X =  ..  Y =  .. 
 .   . 
x  y 
n n

and assume that yi is a (potentially stochastic) function of xi .

(a) It is often useful to visually inspect your data and calculate simple statistics; this can detect
dataset corruptions or inform your method. For both datasets:

(i) Plot the data as a scatter plot.

(ii) Calculate the correlation coefficient between X and Y:
Cov(X, Y)
ρX,Y =
σX σY
in which Cov(X, Y) is the covariance between X and Y and σX is the standard deviation
of X.
Your solution may make use of the NumPy library only for arithmetic operations, matrix-
vector or matrix-matrix multiplications, matrix inversion, and elementwise exponentiation. It
may not make use of library calls for calculating means, standard deviations, or the correlation
coefficient itself directly.
(b) We would like to design a function that can predict yi given xi and then apply it to new inputs.
This is a recurring theme in machine learning, and you will soon learn about a general-purpose
framework for thinking about such problems. As a preview, we will now explore one of the
simplest instantiations of this idea using the class of linear functions:

Ŷ = Xw. (2)

The parameters of our function are denoted by w ∈ R. It is common to denote predicted

variants of quantities with a hat, so Ŷ is a predicted label whereas Y is a ground truth label.
4
https://numpy.org/doc
5
https://matplotlib.org/stable/users/explain/quick start.html

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 12
We would like to find a w∗ that minimizes the squared error JSE between predictions and
labels:
w∗ = argmin JSE (w) = argmin ∥Xw − Y∥22 .
w w
∗
Derive ∇w JSE (w) and set it equal to 0 to solve for w . (Note that this procedure for finding an
optimum relies on the convexity of JSE . You do not need to show convexity here, but it is a
useful exercise to convince yourself this is valid.)
(c) Your solution w∗ should be a function of X and Y. Implement it and report its mean squared
error (MSE) for dataset 1. The mean squared error is the objective JSE from part (b) divided
by the number of datapoints:
1
JMSE (w) = ∥Xw − Y∥22 .
n
Also visually inspect the model’s quality by plotting a line plot of predicted ŷ for uniformly-
spaced x ∈ [0, 10]. Keep the scatter plot from part (a) in the background so that you can
compare the raw data to your linear function. Does the function provide a good fit of the data?
Why or why not?
(d) We are now going to experiment with constructing new features for our model. That is, instead
of considering models that are linear in the inputs, we will now consider models that are linear
in some (potentially nonlinear) transformation of the data:
ϕ(x1 )⊤ 
 

ϕ(x2 )⊤ 
 
Ŷ = Φw =  ..  w,
 . 
ϕ(xn )⊤ 
 

where ϕ(xi ), w ∈ Rm . Repeat part (c), providing both the mean squared error of your predictor
and a plot of its predictions, for the following features on dataset 1:
 
 x 
ϕ(xi ) =  i  .
1

How do the plotted function and mean squared error compare? (A single sentence will suffice.)
Hint: the general form of your solution for w∗ is still valid, but you will now need to use
features Φ where you once used raw inputs X.
(e) Now consider the quadratic features:
 
 xi2 
ϕ(xi ) =  xi  .
 
 
1

Repeat part (c) with these features on dataset 1, once again providing short commentary on
any changes.
(f) Repeat parts (c)-(e) with dataset 2.

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 13
(g) Finally, we would like to understand which features Φ provide us with the best model. To
that end, you will implement a method known as k-fold cross validation. The following are
instructions for this method; deliverables for part (g) are at the end.
(i) Split dataset 2 randomly into k = 4 equal sized subsets. Group the dataset into 4 distinct
training / validation splits by denoting each subset as the validation set and the remaining
subsets as the training set for that split.
(ii) On each of the 4 training / validation splits, fit linear models using the following 5 poly-
nomial feature sets:
 5
   xi 
 3  x4   4 
 
2
 xi   i3   xi 
   xi   2   xi   3 
 x  x  x 
ϕ1 (xi ) =  i  ϕ2 (xi ) =  xi  ϕ3 (xi ) =  i  ϕ4 (xi ) =  xi2  ϕ5 (xi ) =  i2 
   
1    xi   xi 
1  xi 
 
1
 xi 
 
1
1

This step will produce 20 distinct w∗ vectors: one for each dataset split and featurization
ϕ j.
(iii) For each feature set ϕ j , average the training and validation mean squared errors over all
training splits.
It is worth thinking about what this extra effort has bought us: by splitting the dataset into
subsets, we were able to use all available datapoints for model fitting while still having held-
out datapoints for evaluation for any particular model.
Deliverables for part (g): Plot the training mean squared error and the validation mean
squared error on the same plot as a function of the largest exponent in the feature set. Use
a log scale for the y-axis. Which model does the training mean squared error suggest is best?
Which model does the validation mean squared error suggest is best?

HW1, ©UCB CS 189 / 289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 14
8 Honor Code
1. List all collaborators. If you worked alone, then you must explicitly state so.
2. Declare and sign the following statement:
“I certify that all solutions in this document are entirely my own and that I have not looked
at anyone else’s solution. I have given credit to all external sources I consulted.”
Signature :
While discussions are encouraged, everything in your solution must be your (and only your)
creation. Furthermore, all external material (i.e., anything outside lectures and assigned read-
ings, including figures and pictures) should be cited properly. We wish to remind you that
the consequences of academic misconduct are particularly severe!

Mathematics For A+ Students
No ratings yet
Mathematics For A+ Students
5 pages
Scientic Computing II: Complements and Exercises
No ratings yet
Scientic Computing II: Complements and Exercises
77 pages
Lecture 2: Background: - Linear Algebra
No ratings yet
Lecture 2: Background: - Linear Algebra
36 pages
Assignment1 MFML
No ratings yet
Assignment1 MFML
2 pages
Solutions
No ratings yet
Solutions
72 pages
Linear Algebra Review: Introduction To Machine Learning (CSC 311) Spring 2020
No ratings yet
Linear Algebra Review: Introduction To Machine Learning (CSC 311) Spring 2020
28 pages
MIT18 065S18PSets
No ratings yet
MIT18 065S18PSets
36 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
Linear Guest (328 436)
No ratings yet
Linear Guest (328 436)
109 pages
Httpsinst Fs Dub Prod - Inscloudgate.netfiles101dc610 c726 4027 8af1 Ba3be536c427homework 2DBI00.Pdfdownload 1&token EyJ0eX
No ratings yet
Httpsinst Fs Dub Prod - Inscloudgate.netfiles101dc610 c726 4027 8af1 Ba3be536c427homework 2DBI00.Pdfdownload 1&token EyJ0eX
15 pages
Math Homework Assignment Linear Equations
No ratings yet
Math Homework Assignment Linear Equations
8 pages
Instructor's Solution Manual For "Linear Algebra and Optimization For Machine Learning"
No ratings yet
Instructor's Solution Manual For "Linear Algebra and Optimization For Machine Learning"
115 pages
Linear Algebra Qual Prep
No ratings yet
Linear Algebra Qual Prep
32 pages
Chapter1 - Numerical Analysis II 2023-2024
No ratings yet
Chapter1 - Numerical Analysis II 2023-2024
30 pages
Finalexam 2
No ratings yet
Finalexam 2
172 pages
Final
No ratings yet
Final
20 pages
Selected Theory Is Found at The Final Part of The Set All Answers Are To Be Explained
No ratings yet
Selected Theory Is Found at The Final Part of The Set All Answers Are To Be Explained
5 pages
Assignments Problem
No ratings yet
Assignments Problem
3 pages
CS209 Practice Problems 1 ML
No ratings yet
CS209 Practice Problems 1 ML
4 pages
2023F MAS109 Final Solution
No ratings yet
2023F MAS109 Final Solution
16 pages
2022F MAS109 Final Exam Solutions
No ratings yet
2022F MAS109 Final Exam Solutions
18 pages
HW 5
No ratings yet
HW 5
7 pages
Foundations Lecture 1 Printout
No ratings yet
Foundations Lecture 1 Printout
17 pages
Linear Algebra Endsem 1
No ratings yet
Linear Algebra Endsem 1
20 pages
Ps0 Template
No ratings yet
Ps0 Template
5 pages
Final Exam Linear Algebra 2024 Corrigé
No ratings yet
Final Exam Linear Algebra 2024 Corrigé
5 pages
Ass 1 609
No ratings yet
Ass 1 609
2 pages
Finalsamplesolved
No ratings yet
Finalsamplesolved
7 pages
Math Data
No ratings yet
Math Data
117 pages
LAforAIML 2
No ratings yet
LAforAIML 2
3 pages
408 Note
No ratings yet
408 Note
57 pages
(MATH2111) (2017) (F) Final In5mue 14501
No ratings yet
(MATH2111) (2017) (F) Final In5mue 14501
12 pages
Matrix 3
No ratings yet
Matrix 3
14 pages
MAST30025: Linear Statistical Models: Week 2 Lab
No ratings yet
MAST30025: Linear Statistical Models: Week 2 Lab
7 pages
2023 - Test 2 - Solutions
No ratings yet
2023 - Test 2 - Solutions
11 pages
March2023 Solutions
No ratings yet
March2023 Solutions
6 pages
Homework 4 MATH2050
No ratings yet
Homework 4 MATH2050
7 pages
(2022) 30407 - Exam - Solution
No ratings yet
(2022) 30407 - Exam - Solution
4 pages
Math 313 (Linear Algebra) Final Exam Practice KEY
No ratings yet
Math 313 (Linear Algebra) Final Exam Practice KEY
13 pages
Mathematical Methods (Second Year) MT 2009 Problem Set 1: Linear Algebra I
No ratings yet
Mathematical Methods (Second Year) MT 2009 Problem Set 1: Linear Algebra I
2 pages
Linear Algebra With Applications 8th Edition Leon Solutions Manual
33% (3)
Linear Algebra With Applications 8th Edition Leon Solutions Manual
15 pages
Mathematics Paper 3 Important Questions
No ratings yet
Mathematics Paper 3 Important Questions
4 pages
NLA10
No ratings yet
NLA10
66 pages
ML Solutions
No ratings yet
ML Solutions
40 pages
Numerical Linear Algebra: Course Material Networkmaths Graduate Programme Maynooth 2010
No ratings yet
Numerical Linear Algebra: Course Material Networkmaths Graduate Programme Maynooth 2010
66 pages
MATH 304 Linear Algebra Review For Test 1
No ratings yet
MATH 304 Linear Algebra Review For Test 1
30 pages
Ma1490 A Exam Paper
No ratings yet
Ma1490 A Exam Paper
5 pages
Imc Linear Algebra
No ratings yet
Imc Linear Algebra
11 pages
Ma1500 A Exam Paper
No ratings yet
Ma1500 A Exam Paper
4 pages
Ma1500 Exam Paper
No ratings yet
Ma1500 Exam Paper
4 pages
Math 22A Final Review Sheet: Be Able To State (With Words) The Definition of
No ratings yet
Math 22A Final Review Sheet: Be Able To State (With Words) The Definition of
3 pages
4AI Final Exam Review Problems
No ratings yet
4AI Final Exam Review Problems
4 pages
SFU MACM 409 Chapter 1 Notes
No ratings yet
SFU MACM 409 Chapter 1 Notes
11 pages
Problem 1: CS205 Homework #3 Solutions
No ratings yet
Problem 1: CS205 Homework #3 Solutions
7 pages
線性代數113B v2
No ratings yet
線性代數113B v2
5 pages
1895 - Mathematics - Linear Algebra - Xi - 6 - (11-05-23 08 - 00 - 10 - 040 Am)
No ratings yet
1895 - Mathematics - Linear Algebra - Xi - 6 - (11-05-23 08 - 00 - 10 - 040 Am)
4 pages
MAE 200A - Homework Set #1
No ratings yet
MAE 200A - Homework Set #1
2 pages
Solutions: Problem Set 1: January 17, 2013
No ratings yet
Solutions: Problem Set 1: January 17, 2013
9 pages
116 hw1
No ratings yet
116 hw1
5 pages
Vectors and Matrices
No ratings yet
Vectors and Matrices
8 pages
Mathematics Art Integration Matrix
100% (1)
Mathematics Art Integration Matrix
9 pages
Puc II Maths Forum New Pattern-1
No ratings yet
Puc II Maths Forum New Pattern-1
28 pages
Mathematics-IA (BS-M101) (Ultimate Materials)
No ratings yet
Mathematics-IA (BS-M101) (Ultimate Materials)
4 pages
Math
No ratings yet
Math
16 pages
Problem SET 9.2
No ratings yet
Problem SET 9.2
2 pages
Mid 1 Psuc
No ratings yet
Mid 1 Psuc
2 pages
Chapter 1 Module Matrices
No ratings yet
Chapter 1 Module Matrices
41 pages
RevisedQB DAS301-1
No ratings yet
RevisedQB DAS301-1
6 pages
(Rosenberger, 1997) - Functional Analysis Introduction To Spectral Theory in Hilbert Spaces
No ratings yet
(Rosenberger, 1997) - Functional Analysis Introduction To Spectral Theory in Hilbert Spaces
62 pages
Maths. Matrix Algebra
No ratings yet
Maths. Matrix Algebra
59 pages
Determinan 4x4
No ratings yet
Determinan 4x4
44 pages
Multivariate Statistical Methods: Abiyot Negash (Assi. Prof)
No ratings yet
Multivariate Statistical Methods: Abiyot Negash (Assi. Prof)
28 pages
7 - 7-1 - Matrices - Basic Operations
No ratings yet
7 - 7-1 - Matrices - Basic Operations
11 pages
Problem Set 1
No ratings yet
Problem Set 1
4 pages
Mat202 Linear-Algebra TH 1.10 Ac26 PDF
No ratings yet
Mat202 Linear-Algebra TH 1.10 Ac26 PDF
2 pages
1 Bilinear Maps
No ratings yet
1 Bilinear Maps
6 pages
JEE Main Matrices and Determinants Practice Paper With Solutions Download PDF
No ratings yet
JEE Main Matrices and Determinants Practice Paper With Solutions Download PDF
20 pages
The Matrix of A Linear Transformation
No ratings yet
The Matrix of A Linear Transformation
29 pages
MAMT01
No ratings yet
MAMT01
237 pages
PU II Maths Mid-Term QB-1
No ratings yet
PU II Maths Mid-Term QB-1
24 pages
Ca1 13030822065 Bscaiml301
No ratings yet
Ca1 13030822065 Bscaiml301
6 pages
Linear Algebra and Its Applications - 2-2
No ratings yet
Linear Algebra and Its Applications - 2-2
20 pages
Unit 7 - Week 6: Assignment 6
No ratings yet
Unit 7 - Week 6: Assignment 6
4 pages
A B A B A B: Transformation by 2 X 1 Matrix
No ratings yet
A B A B A B: Transformation by 2 X 1 Matrix
10 pages
Minor 2 Takehome
No ratings yet
Minor 2 Takehome
2 pages
Afaq Science School and College Shakardara
No ratings yet
Afaq Science School and College Shakardara
3 pages
Composition of Functions of Several Variables
No ratings yet
Composition of Functions of Several Variables
2 pages
Assignment 8 Answers Math 130 Linear Algebra
No ratings yet
Assignment 8 Answers Math 130 Linear Algebra
3 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet

CS189:289 ML hw1

Uploaded by

CS189:289 ML hw1

Uploaded by

CS 189 / 289A Introduction to Machine Learning

Fall 2024 Jennifer Listgarten, Saeed Saremi HW1

• Homework 1 consists of both written and coding questions.

(a) ker T is a subspace of V.

Positive semidefiniteness will be denoted as A ⪰ 0.

σmax (A) = max (u⊤ Av).

sin A211 + eA11 +A22 + x⊤ Ay

with respect to the matrix A.

RSS(W) = ∥XW ⊤ − Y∥2F . (1)

1. The gradient G = ∇W RSS(W) is a k × d matrix whose entries are Gi j = ∂RSS(W)/∂Wi j ,

∇θ J(θ) = (2 (µ(θ) X ⊤ − Y ⊤ )XW<L

What is the expected value of the score of a single strike?

• The volume under the normal PDF is 1.

All throughout this question, we take X ∼ N(µ, Σ).

Var(X) = Cov(X, X) = E[(X − µ) (X − µ)⊤ ] = E[XX ⊤ ] − µ µ⊤ .

and assume that yi is a (potentially stochastic) function of xi .

(i) Plot the data as a scatter plot.

The parameters of our function are denoted by w ∈ R. It is common to denote predicted

You might also like