HW01 Sol - Math Recap
HW01 Sol - Math Recap
The machine learning lecture relies on your knowledge of undergraduate mathematics, especially linear
algebra and probability theory. You should think of this homework as a test to see if you meet the
prerequisites for taking this course.
Homework
Reading
We strongly recommend that you review the following documents to refresh your knowledge. You should
already be familiar with most of their content from your previous studies.
Linear Algebra
Notation. We use the following notation in this lecture:
f (x, y, Z) = xT Ay + Bx − xT CZD − y T E T y + F .
What should be the dimensions (shapes) of the matrices A, B, C, D, E, F for the expression above to be a
valid mathematical expression?
CS251/CS340 Machine Learning Page 2
PN PN
Problem 2: Let x ∈ RN , M ∈ RN ×N . Express the function f (x) = i=1 j=1 xi xj Mij using only
matrix-vector multiplications. Show your work and briefly explain your steps.
N X
X N N
X
f (x) = xi xj Mij = xi Mi x = xT M x
i=1 j=1 i=1
Ax = b (1)
a) Under what conditions does the system of linear equations have a unique solution x for any choice
of b?
M < N or rank(A) < N ? No, the solution x would not always be unique.
M > N or rank(A) < M ? No, a solution x would not exist for every b ∈ RM .
Hence, M = N and A has full rank and is therefore invertible.
b) Assume that M = N = 4 and that A has the following eigenvalues: {−1, 0, 4, 4}. Does Equation 1
have a unique solution x for any choice of b? Justify your answer.
No, because A has an eigenvalue 0 and therefore does not have full rank.
By definition, B is the inverse of A. Therefore, A is invertible, i.e. the determinant of A is not equal
to zero, i.e. none of the eigenvalues of A are equal to zero.
Problem 5: A symmetric matrix A ∈ RN ×N is positive semi-definite (PSD) if and only if for any x ∈
RN it holds that xT Ax ≥ 0. Prove that a symmetric matrix A is PSD if and only if it has no negative
eigenvalues.
N X
X N N X
X N N X
X N N
X
xT Ax = wi wj viT Avj = wi wj viT λj vj = wi wj λj δij = wi2 λi ≥ 0,
i=1 j=1 i=1 j=1 i=1 j=1 i=1
0 if i ̸= j,
since the eigenvalues λi ≥ 0. δij = denotes the Kronecker delta.
1 if i = j
CS251/CS340 Machine Learning Page 3
0 ≤ v T Av = xT λx = λ∥v∥22
Problem 6: Let A ∈ RM ×N . Prove that the matrix B = AT A is positive semi-definite for any A.
Calculus
Problem 7: Consider the following function f : R → R
1 2
f (x) = ax + bx + c
2
min f (x)
x∈R
a) Under what conditions does this optimization problem have (i) a unique solution, (ii) infinitely many
solutions or (iii) no solution? Justify your answer.
We obtain a solution by setting the derivative to zero, i.e. f ′ (x) = ax + b = 0, and checking if the sec-
ond derivative is positive, i.e. f ′′ (x) = a > 0. Hence, we obtain
b) Assume that the optimization problem has a unique solution. Write down the closed-form expression
for x∗ that minimizes the objective function, i.e. find x∗ = arg minx∈R f (x).
This is an important question that we will encounter in multiple disguises throughout the lecture.
Since we know that a > 0, we can solve it by simply setting the derivative to zero:
b
f ′ (x) = ax + b = 0 ⇒ x=−
a
CS251/CS340 Machine Learning Page 4
1 T
g(x) = x Ax + bT x + c
2
min g(x)
x∈RN
a) Compute the Hessian ∇2 g(x) of the objective function. Under what conditions does this optimiza-
tion problem have a unique solution?
The Hessian is defined as the matrix of partial derivatives (see Section 4.2 of the Stanford Linear
Algebra Review and Reference).
∂ 2 g(x) ∂2 1 X X X
= xi Aij xj + bi xi + c
∂xl ∂xk ∂xl ∂xk 2 i j i
∂ 1 X 1X
= Aik xi + Alj xj + bk
∂xl 2 i 2 j
∂ X
= Akj xj + bk = Akl
∂xl j
∇2 g(x) = A
A differentiable function has a unique minimum if the Hessian ∇2 g(x) is positive definite for all x.
For our case, this means that g(x) has a unique minimum if A is positive definite. Note that if we
extend the definition of PSD to non-symmetric matrices, we would obtain
1
∇2 g(x) = (A + AT ),
2
which is the symmetric part of the matrix A and PSD if A is PSD. The rest of exercise would work
exactly the same way.
b) Why is it necessary for a matrix A to be PSD for the optimization problem to be well-defined?
What happens if A has a negative eigenvalue?
with the eigenvalue λ < 0, the corresponding eigenvector v, and a ∈ R. As we take a → ∞, the first
CS251/CS340 Machine Learning Page 5
term will dominate at some point a > a0 since abT v + c ∈ O(a), with O(g(x)) = {f |∀c > 0, ∃x0 >
0, ∀x > x0 : |f (x)| ≤ c|g(x)|}. Because λ < 0, this function will thus keep decreasing and therefore
does not have a minimum.
In summary, if A and therefore the Hessian is not PSD, the problem does not have a solution, i.e. g
does not have a minimum. Hence, the problem is not well-defined.
c) Assume that the matrix A is positive definite (PD). Write down the closed-form expression for x∗
that minimizes the objective function, i.e. find x∗ = arg minx∈RN g(x).
We solve this by setting the gradient to zero (as in the previous exercise). For this, we first calculate
the gradient:
∂g(x) ∂ 1 XX X
= xi Aij xj + bi xi + c
∂xk ∂xk 2 i j i
1X 1X
= Aik xi + Akj xj + bk
2 i 2 j
X X
= Akj xj + bk = Akj xj + bk ,
j j
∇g(x) = Ax + b.
Probability Theory
Notation. We use the following notation in our lecture:
• For conciseness and to avoid clutter, we use p(x) to denote multiple things
1. If X is a discrete random variable, p(x) denotes the probability mass function (PMF) of X at
point x (usually denoted as pX (x)orp(X = x) in the statistics literature).
2. If X is a continuous random variable, p(x) denotes the probability density function (PDF) of X
at point x (usually denoted as fX (x) in the statistics literature).
3. If A ∈ Ω is an event, p(A) denotes the probability of this event (usually denoted as P r({A}) or
P ({A}) in the statistics literature)
You will mostly encounter (1) and (2) throughout the lecture. Usually, the meaning is clear from the
context.
• Given the distribution p(x), we may be interested in computing the expected value Ep(x) [f (x)] or,
equivalently, EX [f (x)]. Usually, it is clear with respect to which distribution we are computing the
expectation, so we omit the subscript and simply write E[f (x)].
• x ∼ p means that x is distributed (sampled) according to the distribution p. For example, x ∼
N (µ, σ 2 ) (or equivalently p(x) = N (µ, σ 2 ) means that x is distributed according to the normal distri-
bution with mean µ and variance σ 2 .
Problem 10: Exponential families include many of the most common distributions, such as normal,
exponential, Bernoulli, categorical, etc. You are given the general form of the PDF (PMF in the discrete
case) pθ (x) (also written as p(x | θ)) of the distributions from the exponential family below:
" k #
X
p(x | θ) = h(x) c(θ) exp wi (θ) ti (x) , θ ∈ Θ,
i=1
where Θ is the parameter space, h(x) ≥ 0, and the ti (x)-s only depend on x, and similarly, c(θ) ≥ 0 and
the wi (θ)-s only depend on the (possibly vector-valued) parameter θ.
Your task is to express the Binomial distribution as an exponential family distribution. Also express the
Beta distribution is an exponential family distribution. Show that the product of the Beta and the Bino-
mial distribution is also a member of the exponential family.
The Binomial distribution with parameters n (number of trials) and θ (probability of success) has the
PMF:
n x
pBinomial (X = x|θ) = θ (1 − θ)n−x
x
To bring it to the exponential family form we will first take the logarithm of it and then exponentiate
it.
n x n−x n
log pBinomial (X = x|θ) = log θ (1 − θ) = log + log θx + log(1 − θ)n−x
x x
n n θ
= log + x log θ + (n − x) log(1 − θ) = log + x log + n log(1 − θ)
x x 1−θ
CS251/CS340 Machine Learning Page 7
The final part of the equation is now in the exponential family form. Where:
• h(x) = nx
• c(θ) = (1 − θ)n
• w1 (θ) = log 1−θ
θ
• t1 (x) = x
xα−1 (1 − x)β−1 1
pBeta (X = x|α, β) = = xα−1 (1 − x)β−1
B(α, β) B(α, β)
Finally:
1
pBeta (X = x|α, β) = exp log + α log x + β log(1 − x) − log[x(1 − x)]
B(α, β)
1 1
= exp [α log x + β log(1 − x)]
x(1 − x) B(α, β)
• h(x) = 1
x(1−x)
• c(θ) = c(α, β) = 1
B(α,β)
It’s important to note that the form is not strictly unique. Depending on how you open up the equa-
tion you may get different versions of the same formula all of which will also be in the exponential
family form.
Finally, the product of the Binomial and Beta distributions,
The statement is false, we can disprove it by showing that either one of the directions is not true in
general.
We have a coin and do not know if it is fair, i.e., p(A = T ) = 0.5, or unfair, i.e., p(A = T ) = 1. C
denotes the event whether the coin is fair C = F or unfair C = U . Both of these events have equal
probability p(C = F ) = p(C = U ) = 0.5. We perform two coin tosses A and B. When we know which
coin we have, coin tosses A and B are of course independent, i.e., p(a|b, c) = p(a|c). However, if we do
not observe C, then
3
p(A = T ) = p(A = T |C = F )p(C = F ) + p(A = T |C = U )p(C = U ) = ,
4
p(A = T, B = T )
p(A = T |B = T ) =
p(B = T )
p(A = T, B = T |C = F )p(C = F ) + p(A = T, B = T |C = U )p(C = U )
=
p(B = T )
p(A = T |B = T, C = F )p(B = T |C = F )p(C = F )
=
p(B = T )
p(A = T |B = T, C = U )p(B = T |C = U )p(C = U )
+
p(B = T )
1/2 · 1/2 · 1/2 1 · 1 · 1/2 5/8 5
= + = = .
3/4 3/4 3/4 6
Let the random variables A and B denote two independent dice rolls. C = A + B denotes the sum of
these dice rolls. Clearly, A and B are independent and therefore p(a|b) = p(a). However, when we ob-
serve their sum the two become dependent. E.g. if we observe C = 3, then p(A = 1) = 1/2 and p(A =
2) = 1/2. However, if we observe B = 2, then p(A = 1|B = 2) = 1.
This fact is quite interesting. It means that two independent random variables can become dependent
when observing a different variable, which is known as the explaining away effect (or selection bias,
Berkson’s paradox) in the literature. For example, if students are admitted to a university either by
having good school grades or by showing excellent athletic performance, then these two attributes will
be negatively correlated in the student population, even though they are independent in general.
Problem 12: Consider the following bivariate distribution p(x, y) of two discrete random variables X
and Y .
Compute:
3
X
p(X = x1 ) = p(X = x1 , Y = yi ) = 0.02 + 0.01 + 0.05 = 0.08
i=1
3
X
p(X = x2 ) = p(X = x2 , Y = yi ) = 0.04 + 0.09 + 0.1 = 0.23
i=1
3
X
p(X = x3 ) = p(X = x3 , Y = yi ) = 0.14 + 0.06 + 0.05 = 0.25
i=1
3
X
p(X = x4 ) = p(X = x4 , Y = yi ) = 0.12 + 0.01 + 0.08 = 0.21
i=1
3
X
p(X = x5 ) = p(X = x5 , Y = yi ) = 0.2 + 0.02 + 0.01 = 0.23
i=1
5
P
We can check and see that the sum p(X = xi ) = 0.08 + 0.23 + 0.25 + 0.21 + 0.23 = 1, indeed.
i=1
CS251/CS340 Machine Learning Page 10
5
X
p(Y = y1 ) = p(X = xi , Y = y1 ) = 0.02 + 0.04 + 0.14 + 0.12 + 0.2 = 0.52
i=1
5
X
p(Y = y2 ) = p(X = xi , Y = y2 ) = 0.01 + 0.09 + 0.06 + 0.01 + 0.02 = 0.19
i=1
5
X
p(Y = y3 ) = p(X = xi , Y = y2 ) = 0.05 + 0.1 + 0.05 + 0.08 + 0.01 = 0.29
i=1
3
P
We can check and see that the sum p(Y = yi ) = 0.52 + 0.19 + 0.29 = 1, indeed.
i=1
p(X = x1 , Y = y1 ) 0.02
p(X = x1 | Y = y1 ) = =
p(Y = y1 ) 0.52
p(X = x2 , Y = y1 ) 0.04
p(X = x2 | Y = y1 ) = =
p(Y = y1 ) 0.52
p(X = x3 , Y = y1 ) 0.14
p(X = x3 | Y = y1 ) = =
p(Y = y1 ) 0.52
p(X = x4 , Y = y1 ) 0.12
p(X = x4 | Y = y1 ) = =
p(Y = y1 ) 0.52
p(X = x5 , Y = y1 ) 0.2
p(X = x5 | Y = y1 ) = =
p(Y = y1 ) 0.52
5
P
We can check and see that the sum p(X = xi | Y = y1 ) = 1, indeed, so p(x | Y = y1 ) is a valid
i=1
distribution.
The conditional distribution p(y | X = x3 ) is given by:
p(Y = y1 , X = x3 ) 0.14
p(Y = y1 | X = x3 ) = =
p(X = x3 ) 0.25
p(Y = y2 , X = x3 ) 0.06
p(Y = y2 | X = x3 ) = =
p(X = x3 ) 0.25
p(Y = y3 , X = x3 ) 0.05
p(Y = y3 | X = x3 ) = =
p(X = x3 ) 0.25
3
P
We can check and see that the sum p(Y = yi | X = x3 ) = 1, indeed, so p(y | X = x3 ) is a valid
i=1
distribution.
CS251/CS340 Machine Learning Page 11
Problem 13: You are given the joint PDF p(a, b, c) of three continuous random variables. Show how the
following expressions can be obtained using the rules of probability
1. p(a)
2. p(c | a, b)
3. p(b | c)
Z∞ Z∞
1. p(a) = p(a, b, c) db dc
−∞ −∞
p(a, b, c) p(a, b, c)
2. p(c | a, b) = = Z∞
p(a, b)
p(a, b, c) dc
−∞
Z∞
p(a, b, c) da
p(b, c) −∞
3. p(b | c) = = Z∞
p(c)
p(a, b, c) da db
−∞
Problem 14: In this problem, there are two bowls. The first bowl holds three pineapples and three or-
anges, while the second bowl has three pineapples and five oranges. Additionally, there’s a biased coin,
which lands on ”tails” with a 0.7 probability and ”heads” with a 0.3 probability. If the coin lands on ”heads”,
a piece of fruit is randomly selected from the first bowl; if it lands on ”tails”, the fruit is chosen from the
second bowl. Your friend flips the coin (which you can’t see), selects a piece of fruit from the correspond-
ing bowl, and hands you a pineapple. Determine the likelihood that the pineapple was selected from the
second bowl.
3 3
P (pineapple | B1 ) = p(orange | B1 ) =
6 6
3 5
P (pineapple | B2 ) = P (orange | B1 ) =
8 8
and that
P (pineapple | B2 ) · P (B2 )
P (T | pineapple) = P (B2 | pineapple) =
P (pineapple | B1 ) · P (B1 ) + P (pineapple | B2 ) · P (B2 )
3/8 · 0.7
=
3/6 · 0.3 + 3/8 · 0.7
≈ 0.63636.
CS251/CS340 Machine Learning Page 12
Problem 15: (Iterated Expectations) Consider two random variables X, Y with joint distribution p(x, y).
Show that
EX [ x ] = EY [EX [ x | y ]].
Here, EX [x|y] denotes the expected value of x under the conditional distribution p(x|y).
R∞
• EX [ x ] = x p(x) dx
−∞
R∞
• EY [ y ] = y p(y) dy
−∞
R∞
• EX [ x | y ] = x p(x | y) dx
−∞
Using these we get that
Z
EY [EX [ x | y ]] = EX [ x | y ] p(y) dy
Z Z Z Z
= x p(x|y) dx p(y) dy = x p(x|y) p(y) dx dy
Z Z Z Z
= x p(x, y) dx dy = x p(x, y) dy dx
Z Z
= x p(x, y) dy dx
Z
= x p(x) dx
= EX [ x ]
Problem 16: Let X ∼ N (µ, σ 2 ), and f (x) = ax + bx2 + c. What is E[f (X)]?
Problem 17: Let p(x) = N (x|µ, Σ), and g(x) = Ax (where A ∈ RN ×N ). What are the values of the
following expressions:
• E[g(x)],
• E[g(x)g(x)T ],
• E[g(x)T g(x)],
• the covariance matrix Cov[g(x)].
• E[g(x)] = AE[x] = Aµ