Homework1_solution
Homework1_solution
Linear algebra
Q1. To find the eigenvectors and eigenvaulues, we first find the characteristic polynomial:
1−λ 2 3
pA (λ) = det 3 1−λ 2 = −λ3 + 3λ2 + 15λ + 18
2 3 1−λ
√
3
The eigenvalues of A are exactly the roots of pA , which we compute to be 6 and − 32 ± 2 i.
To find the eigenvector v for the eigenvalue λ, we need to find a unit vector solution to the
linear system of equations Av = λv. For λ1 = 6, we find that this is
1
v 1 = √ (1, 1, 1)T
3
√
3
for λ2 = − 23 + 2 i we find
1 √ √
v 2 = √ (−1 − 3i, −1 + 3i, 2)T ,
2 3
√
3
and for λ3 = − 32 − 2 i we find
1 √ √
v 2 = √ (−1 + 3i, −1 − 3i, 2)T .
2 3
Next we recall that the determinant and the trace are just the product and the sum of the
eigenvalues, respectively, hence
√ ! √ !
3 3 3 3
det(A) = 6 − + i − − i = 18,
2 2 2 2
√ ! √ !
3 3 3 3
tr(A) = 6 + − + i + − − i = 3.
2 2 2 2
(As a sanity check, recall that tr(A) is also the sum of the diagonal elements of A.)
The inverse of A is given by adj(A)T / det(A), where adj(A) is the adjugate matrix : the
matrix whose i, j entry is (−1)i+j times the determinant of the 2x2 submatrix which arises
by deleting the ith row and jth column from A. In particular, we have
−5 1 7
adj(A) = 7 −5 1
1 7 −5
1
hence
−5 7 1
1
A−1 = 1 −5 7 .
18
7 1 −5
Finally, we compute the desired matrix norms. Recall that the Frobenius norm of A is the
square root of the sum of the squared norms of the eigenvalues of A, hence
√ 2 √ 2
v
u
u 3 3 3 3 √
∥A∥F = 62 + − +
t i + − − i = 42.
2 2 2 2
Recall that the operator norm is the largest norm of an eigenvalue of A, hence
( √ √ )
3 3 3 3
∥A∥op = max 6, − + i , − − i = 6.
2 2 2 2
Q3. To begin, notice that the first two columns of A are linearly independent and that the third
column is the sum of the first two. Since the rank is just the dimension of the columnspace,
we see rank(A) = 2.
To find the nullspace, we first note that the third column being the sum of the first two
means that (1, 1, −1)T is in the nullspace of A. But the rank-nullity theorem tells us that the
nullspace of A must have dimension 3 − rank(A) = 1, hence
α
null(A) = α :α∈R
−α
To find the column space, we can simply take the span of the first two columns, hence
1 1 α+β
col(A) = span 2 , 1
= 2α + β : α, β ∈ R .
3 1 3α + β
Now we find the projection matrices. Projection onto null(A) is just projection onto the
subspace spanned by v = (1, 1, −1)T , hence
T 1 1 −1
vv 1
P null(A) = T = 1 1 −1 .
v v 3
−1 −1 1
2
Projection onto col(A) is just projection onto the subspace spanned by the vectors (1, 2, 3)T
and (1, 1, 1)T . Hence, we can set
1 1
C = 2 1
3 1
and get
5 2 −1
1
P col(A) = C(C T C)−1 C T = 2 2 2 .
6
−1 2 5
Matrix calculus
Q1. If a ≤ 0 then the integrand diverges so the integral equals ∞. If a > 0 then we simply
complete the square to get
Z ∞ Z ∞
µ µ2
1 1
exp − ax2 + µx dx = exp − a x − + dx
−∞ 2 −∞ 2 a 2a
2Z ∞
µ 1 µ
= exp exp − a x − dx
2a −∞ 2 a
2r
µ 2π
= exp .
2a a
Q2. First consider Z ∼ N (0, I d ). For a ∈ Rd , we use the independence of the coordinates
Z1 , . . . , Zd of Z to compute:
d d
" !#
X Y
T
E[exp(a Z)] = E exp ai Zi = E [exp (ai Zi )] .
i=1 i=1
Note that each factor can be computed with the help of Q1, since
Z ∞ 2
1 1 2 ai
E [exp (ai Zi )] = √ exp − x + ai x dx = exp .
2π −∞ 2 2
Thus, we have
d
a2i ∥a∥22
Y
T
E[exp(a Z)] = exp = exp
2 2
i=1
Next we set X = AZ + µ. By rearranging and applying the calculation for Z, we conclude:
E[exp(aT X)] = E[exp(aT (AZ + µ))]
= E[exp(aT AZ)] exp(aT µ)
T 2
∥a A∥2
= exp exp(aT µ)
2
T 2
∥a A∥2 T
= exp +a µ .
2
In fact, if we write Σ = AAT then we have shown that X ∼ N (µ, Σ) has
T
T a Σa T
E[exp(a X)] = exp +a µ .
2
3
for all a ∈ Rd .
Q3. From the form of M found in Q2 and the results of Q4 below, we can compute:
T
a Σa T
∇a M (µ, a) = (Σa + µ) exp +a µ
2
and
aT Σa
T
∇µ M (µ, a) = a exp +a µ .
2
Q4. We compute these by writing out the definition of each function as a summation and taking
the partial derivative in each coordinate:
– If f (u) = di=1 ai ui , then ∂u
∂f
P
k
= ak for all k = 1, . . . , d, so
∇u f (u) = (a1 , . . . , ad ) = a.
Pd 2 ∂f
– If f (u) = i=1 ui , then ∂uk = 2uk for all k = 1, . . . , d, so
Pd Pd ∂f Pd
– If f (u) = i=1 j=1 ui uj Aij , then ∂uk =2 i=1 ui Aik for all k = 1, . . . , d, so
d d
!
X X
∇u f (u) = 2 ui Ai1 , . . . , 2 ui Aid = 2Au.
i=1 i=1
Q5. By using the results of Q4 and the chain rule, we can compute
and
∆u f (u) = X T X + λI.
Since
λmin (X T X + λI) ≥ λ > 0,
we see that f is strictly convex, hence its unique stationary point must be a global minimizer,
Then we see that u ∈ Rd satisfies ∇u f (u) = 0 if and only if u = (X T X + λI)−1 X T y, and
that the inverse is always well-defined. Therefore, we have
4
Probability and statistics
Q1. We can easily check that the vector (X T , aT X)T has a multivariate Gaussian distribution,
with
X µ Σ Σa
∼N , .
aT X aT µ aT Σ aT Σa
Therefore, by the Gaussian conditioning formula, we get
y − aT µ ΣaaT Σ
(X | Y = y) ∼ N µ + Σa T ,Σ − T .
a Σa a Σa
In particular, we have
y − aT µ
E[X | Y = y] = µ + Σa
aT Σa
and
ΣaaT Σ
Cov(X | Y = y) = Σ −
aT Σa
Q2. The calculations are straightforward applications of the independence and of known properties
of Bernoulli random variables:
– Since Var(X1 ) = p(1 − p),
n n
!
1X 1 X p(1 − p)
Var Xi = 2
Var (Xi ) = .
n n n
i=1 i=1
– Using the formula of the probability mass function for a Binomial random variable, we
get
d
!
X n nq n!
fn (p, q) = P Xi = nq = p (1 − p)n−nq = pnq (1 − p)n(1−q) .
nq (nq)!(n(1 − q))!
i=1
5
Therefore,
log(n!) log((nq)!) log((n(1 − q))!)
− − ∼ log n − 1
n n n
− q log n − q log q + q
− (1 − q) log n − (1 − q) log(1 − q) + (1 − q)
∼ −q log q − (1 − q) log(1 − q).
– The inverse of the MLE is just an average of IID random variables, so we can compute:
" n # n
−1 1X 1X
E[λ̂ ] = E xi = E [xi ] = λ−1
n n
i=1 i=1
and
n n
!
−1 1X 1 X 1
Var(λ̂ )=E xi = 2
Var(xi ) = · λ−2 .
n n n
i=1 i=1
and the likelihood ratio test is the test that rejects for extreme values of T . Noticing that T is
just a monotone transformation of the sample average, we can equivalently writePthe likelihood
ratio test as the test which rejects for extreme values of the sample average n1 ni=1 xi .
6
In order to choose the rejection threshold t ∈ R to maintain α significance for the null hy-
pothesis H0 : µ = 0, we must solve
n
!
1X √
α = PH0 xi ≥ t = P(N (0, 1) ≥ t n).
n
i=1
√
If we write Φ for the standard Gaussian CDF, then the unique solution is t = Φ−1 (α)/ n.
Therefore, the likelihood ratio test for H0 : µ = 0 versus H1 : µ = 1 at significance α is the
1 Pn √
test which rejects when n i=1 xi ≥ Φ−1 (α)/ n.
To construct a symmetric confidence interval for µ at level α we need to choose t ∈ R such
that
n
!
1X
α=P xi − µ ≥ t .
n
i=1
and that this distribution does not depend on µ. Thus, we can choose t ∈ R such that
√
α = P |N (0, 1)| ≥ t n ,
√
and we see that the unique solution is t = Φ−1 (α/2)/ n. Therefore, a symmetric confidence
interval for µ at significance α is given by
n n
!
X 1 −1 α X 1 −1 α
xi − √ Φ , xi + √ Φ .
n 2 n 2
i=1 i=1
Q1. For each n, we create a matrix X ∈ Rb×n with IID entries that are uniform on [−1, 1], and we
create A = XX T . In order to extract the upper triangle as a vector, we use the triu_indices(...)
function.
[2]: ns = [2, 4, 6, 8, 10, 50, 100]
b = 100
alphas = np.zeros(shape=(len(ns),b*(b-1)//2))
for i,n in enumerate(ns):
print('n=%d' % n, end='\t')
X = np.random.uniform(-1,1, size=(b,n))
A = X @ X.T
alphas[i,:] = A[np.triu_indices(b-1)]
print('shape of alpha:', alphas[i,:].shape)
7
n=2 shape of alpha: (4950,)
n=4 shape of alpha: (4950,)
n=6 shape of alpha: (4950,)
n=8 shape of alpha: (4950,)
n=10 shape of alpha: (4950,)
n=50 shape of alpha: (4950,)
n=100 shape of alpha: (4950,)
Q2. Now we plot the histogram of the rescaled inner products, and the corresponding Gaussian
density.
[3]: fig, axs = plt.subplots(2, 4, sharex=True, sharey=True, figsize=(12,5))
ax.title.set_text('n=%d' % n)
axs[1,3].axis('off')
axs[1,2].legend(bbox_to_anchor=(2.15, 0.95))
plt.show()
√
To see why we need to rescale by 3/ n, let us see how to apply the central limit theorem to this
8
problem. We are interested in the random variable
n
X
Vn = xT y = x i yi
i=1
where x, y are independent copies of a random variable which is uniformly distributed on [−1, 1]n .
In other words, x1 , . . . , xn , y1 , . . . , yn are IID random variables which are uniformly distributed on
[−1, 1]. Since we have
1
E[x1 y1 ] = 0 and Var(x1 y1 ) = ,
9
it follows that we should have
3
√ Vn → N (0, 1)
n
as n → ∞.
As another note, you may wonder why a second mode appears in the histogram for large values
of n. This is because our approximation of the distribution of Vn does not consist of independent
samples! When n is much larger than b our samples are still approximately independent so the CLT
holds, but when n and b are of comparable size then we can observe some non-Gaussian behavior.