0% found this document useful (0 votes)
12 views

Homework1_solution

Uploaded by

hangyuju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Homework1_solution

Uploaded by

hangyuju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

STAT 154 / 254 Homework #1 Solutions

Linear algebra
Q1. To find the eigenvectors and eigenvaulues, we first find the characteristic polynomial:
 
1−λ 2 3
pA (λ) = det  3 1−λ 2  = −λ3 + 3λ2 + 15λ + 18
2 3 1−λ

3
The eigenvalues of A are exactly the roots of pA , which we compute to be 6 and − 32 ± 2 i.

To find the eigenvector v for the eigenvalue λ, we need to find a unit vector solution to the
linear system of equations Av = λv. For λ1 = 6, we find that this is
1
v 1 = √ (1, 1, 1)T
3

3
for λ2 = − 23 + 2 i we find
1 √ √
v 2 = √ (−1 − 3i, −1 + 3i, 2)T ,
2 3

3
and for λ3 = − 32 − 2 i we find
1 √ √
v 2 = √ (−1 + 3i, −1 − 3i, 2)T .
2 3
Next we recall that the determinant and the trace are just the product and the sum of the
eigenvalues, respectively, hence
√ ! √ !
3 3 3 3
det(A) = 6 − + i − − i = 18,
2 2 2 2
√ ! √ !
3 3 3 3
tr(A) = 6 + − + i + − − i = 3.
2 2 2 2

(As a sanity check, recall that tr(A) is also the sum of the diagonal elements of A.)
The inverse of A is given by adj(A)T / det(A), where adj(A) is the adjugate matrix : the
matrix whose i, j entry is (−1)i+j times the determinant of the 2x2 submatrix which arises
by deleting the ith row and jth column from A. In particular, we have
 
−5 1 7
adj(A) =  7 −5 1 
1 7 −5

1
hence  
−5 7 1
1 
A−1 = 1 −5 7  .
18
7 1 −5
Finally, we compute the desired matrix norms. Recall that the Frobenius norm of A is the
square root of the sum of the squared norms of the eigenvalues of A, hence
√ 2 √ 2
v
u
u 3 3 3 3 √
∥A∥F = 62 + − +
t i + − − i = 42.
2 2 2 2

Recall that the operator norm is the largest norm of an eigenvalue of A, hence
( √ √ )
3 3 3 3
∥A∥op = max 6, − + i , − − i = 6.
2 2 2 2

Q2. We want a vector x = (x1 , x2 , x3 )T such that


       
1 1 2 3
0 = x1 3 + x2 1 + x3 2 .
0 2 3 1
In matrix notation, this is exactly e1 = Ax, where A is as in Q1. Since we already showed
that A is invertible and we computed A−1 , we conclude
 
−5
1  
x= 1 .
18
7

Q3. To begin, notice that the first two columns of A are linearly independent and that the third
column is the sum of the first two. Since the rank is just the dimension of the columnspace,
we see rank(A) = 2.
To find the nullspace, we first note that the third column being the sum of the first two
means that (1, 1, −1)T is in the nullspace of A. But the rank-nullity theorem tells us that the
nullspace of A must have dimension 3 − rank(A) = 1, hence
  
 α 
null(A) =  α  :α∈R
−α
 

To find the column space, we can simply take the span of the first two columns, hence
      
1 1  α+β 
col(A) = span   2 , 1
    = 2α + β : α, β ∈ R .

3 1 3α + β
 

Now we find the projection matrices. Projection onto null(A) is just projection onto the
subspace spanned by v = (1, 1, −1)T , hence
 
T 1 1 −1
vv 1
P null(A) = T =  1 1 −1 .
v v 3
−1 −1 1

2
Projection onto col(A) is just projection onto the subspace spanned by the vectors (1, 2, 3)T
and (1, 1, 1)T . Hence, we can set  
1 1
C = 2 1 
3 1
and get  
5 2 −1
1
P col(A) = C(C T C)−1 C T =  2 2 2  .
6
−1 2 5

Matrix calculus
Q1. If a ≤ 0 then the integrand diverges so the integral equals ∞. If a > 0 then we simply
complete the square to get
Z ∞ Z ∞
µ  µ2
   
1 1 
exp − ax2 + µx dx = exp − a x − + dx
−∞ 2 −∞ 2 a 2a
 2Z ∞  
µ 1  µ
= exp exp − a x − dx
2a −∞ 2 a
 2r
µ 2π
= exp .
2a a

Q2. First consider Z ∼ N (0, I d ). For a ∈ Rd , we use the independence of the coordinates
Z1 , . . . , Zd of Z to compute:
d d
" !#
X Y
T
E[exp(a Z)] = E exp ai Zi = E [exp (ai Zi )] .
i=1 i=1

Note that each factor can be computed with the help of Q1, since
Z ∞    2
1 1 2 ai
E [exp (ai Zi )] = √ exp − x + ai x dx = exp .
2π −∞ 2 2
Thus, we have
d
a2i ∥a∥22
Y    
T
E[exp(a Z)] = exp = exp
2 2
i=1
Next we set X = AZ + µ. By rearranging and applying the calculation for Z, we conclude:
E[exp(aT X)] = E[exp(aT (AZ + µ))]
= E[exp(aT AZ)] exp(aT µ)
 T 2
∥a A∥2
= exp exp(aT µ)
2
 T 2 
∥a A∥2 T
= exp +a µ .
2
In fact, if we write Σ = AAT then we have shown that X ∼ N (µ, Σ) has
 T 
T a Σa T
E[exp(a X)] = exp +a µ .
2

3
for all a ∈ Rd .
Q3. From the form of M found in Q2 and the results of Q4 below, we can compute:
 T 
a Σa T
∇a M (µ, a) = (Σa + µ) exp +a µ
2

and
aT Σa
 
T
∇µ M (µ, a) = a exp +a µ .
2

Q4. We compute these by writing out the definition of each function as a summation and taking
the partial derivative in each coordinate:
– If f (u) = di=1 ai ui , then ∂u
∂f
P
k
= ak for all k = 1, . . . , d, so

∇u f (u) = (a1 , . . . , ad ) = a.

Pd 2 ∂f
– If f (u) = i=1 ui , then ∂uk = 2uk for all k = 1, . . . , d, so

∇u f (u) = (2u1 , . . . , 2ud ) = 2u.

Pd Pd ∂f Pd
– If f (u) = i=1 j=1 ui uj Aij , then ∂uk =2 i=1 ui Aik for all k = 1, . . . , d, so

d d
!
X X
∇u f (u) = 2 ui Ai1 , . . . , 2 ui Aid = 2Au.
i=1 i=1

Q5. By using the results of Q4 and the chain rule, we can compute

∇u f (u) = 2X T (Xu − y) + 2λu = 2(X T X + λI)u − 2X T y

and
∆u f (u) = X T X + λI.
Since
λmin (X T X + λI) ≥ λ > 0,
we see that f is strictly convex, hence its unique stationary point must be a global minimizer,
Then we see that u ∈ Rd satisfies ∇u f (u) = 0 if and only if u = (X T X + λI)−1 X T y, and
that the inverse is always well-defined. Therefore, we have

arg min ∥Xu − y∥22 + λ∥u∥22 = (X T X + λI)−1 X T y



u∈Rd

for any λ > 0.

4
Probability and statistics
Q1. We can easily check that the vector (X T , aT X)T has a multivariate Gaussian distribution,
with      
X µ Σ Σa
∼N , .
aT X aT µ aT Σ aT Σa
Therefore, by the Gaussian conditioning formula, we get
y − aT µ ΣaaT Σ
 
(X | Y = y) ∼ N µ + Σa T ,Σ − T .
a Σa a Σa
In particular, we have
y − aT µ
E[X | Y = y] = µ + Σa
aT Σa
and
ΣaaT Σ
Cov(X | Y = y) = Σ −
aT Σa
Q2. The calculations are straightforward applications of the independence and of known properties
of Bernoulli random variables:
– Since Var(X1 ) = p(1 − p),
n n
!
1X 1 X p(1 − p)
Var Xi = 2
Var (Xi ) = .
n n n
i=1 i=1

– Using the formula of the probability mass function for a Binomial random variable, we
get
d
!  
X n nq n!
fn (p, q) = P Xi = nq = p (1 − p)n−nq = pnq (1 − p)n(1−q) .
nq (nq)!(n(1 − q))!
i=1

– By the previous part, we have


1 log(n!) log((nq)!) log((n(1 − q))!)
log fn (p, q) = + q log p + (1 − q) log(1 − p) − − .
n n n n
Now recall Stirling’s approximation, that we have
√  x x
x! ∼ 2πx
e
hence
1
log(x!) ∼
log(2πx) + x log x − x
2
as x → ∞. This shows that the terms involving factorials are asymptotically equivalent
to
log(n!)
∼ log n − 1
n
log((nq)!)
∼ q log n + q log q − q
n
log((n(1 − q))!)
∼ (1 − q) log n + (1 − q) log(1 − q) − (1 − q).
n

5
Therefore,
log(n!) log((nq)!) log((n(1 − q))!)
− − ∼ log n − 1
n n n
− q log n − q log q + q
− (1 − q) log n − (1 − q) log(1 − q) + (1 − q)
∼ −q log q − (1 − q) log(1 − q).

Combining these expressions, we have shown


   
1 q 1−p
log fn (p, q) → q log + (1 − q) log
n p 1−p
as n → ∞. (Note that this is sort of like a KL divergence from p to q.)
Q3. We make the following calculations.
– First, we simply use the formula for pλ to get
n n n n
!
Y X X X
ℓ(λ) = log pλ (xi ) = log(pλ (xi )) = (log λ − λxi ) = n log λ − λ xi .
i=1 i=1 i=1 i=1

– The derivative of ℓ is just


n
n X
ℓ′ (λ) = − xi ,
λ
i=1
Pn
so ℓ′ (λ) = 0 is equivalent to λ−1 = 1
n i=1 xi . This shows the MLE is
n
λ̂ = Pn .
i=1 xi

– The inverse of the MLE is just an average of IID random variables, so we can compute:
" n # n
−1 1X 1X
E[λ̂ ] = E xi = E [xi ] = λ−1
n n
i=1 i=1

and
n n
!
−1 1X 1 X 1
Var(λ̂ )=E xi = 2
Var(xi ) = · λ−2 .
n n n
i=1 i=1

Q4. The likelihood ratio is the following statistic:


Qn
√1 1 2 n
i=1 2π exp(− 2 xi )
 
Y 1
T (x1 , . . . , xn ) = Qn = exp −xi +
√1 1 2 2
i=1 2π exp(− 2 (xi − 1) ) i=1
n
!
X n
= exp − xi + ,
2
i=1

and the likelihood ratio test is the test that rejects for extreme values of T . Noticing that T is
just a monotone transformation of the sample average, we can equivalently writePthe likelihood
ratio test as the test which rejects for extreme values of the sample average n1 ni=1 xi .

6
In order to choose the rejection threshold t ∈ R to maintain α significance for the null hy-
pothesis H0 : µ = 0, we must solve
n
!
1X √
α = PH0 xi ≥ t = P(N (0, 1) ≥ t n).
n
i=1

If we write Φ for the standard Gaussian CDF, then the unique solution is t = Φ−1 (α)/ n.
Therefore, the likelihood ratio test for H0 : µ = 0 versus H1 : µ = 1 at significance α is the
1 Pn √
test which rejects when n i=1 xi ≥ Φ−1 (α)/ n.
To construct a symmetric confidence interval for µ at level α we need to choose t ∈ R such
that
n
!
1X
α=P xi − µ ≥ t .
n
i=1

That is, the coverage probability must be exactly α. Notice that


n n  
1X 1X 1
xi − µ = (xi − µ) ∼ N 0, ,
n n n
i=1 i=1

and that this distribution does not depend on µ. Thus, we can choose t ∈ R such that
√ 
α = P |N (0, 1)| ≥ t n ,

and we see that the unique solution is t = Φ−1 (α/2)/ n. Therefore, a symmetric confidence
interval for µ at significance α is given by
n n
!
X 1 −1  α  X 1 −1  α 
xi − √ Φ , xi + √ Φ .
n 2 n 2
i=1 i=1

Inner products and the central limit theorem


[1]: import numpy as np
import matplotlib.pyplot as plt

Q1. For each n, we create a matrix X ∈ Rb×n with IID entries that are uniform on [−1, 1], and we
create A = XX T . In order to extract the upper triangle as a vector, we use the triu_indices(...)
function.
[2]: ns = [2, 4, 6, 8, 10, 50, 100]
b = 100

alphas = np.zeros(shape=(len(ns),b*(b-1)//2))
for i,n in enumerate(ns):
print('n=%d' % n, end='\t')
X = np.random.uniform(-1,1, size=(b,n))
A = X @ X.T
alphas[i,:] = A[np.triu_indices(b-1)]
print('shape of alpha:', alphas[i,:].shape)

7
n=2 shape of alpha: (4950,)
n=4 shape of alpha: (4950,)
n=6 shape of alpha: (4950,)
n=8 shape of alpha: (4950,)
n=10 shape of alpha: (4950,)
n=50 shape of alpha: (4950,)
n=100 shape of alpha: (4950,)
Q2. Now we plot the histogram of the rescaled inner products, and the corresponding Gaussian
density.
[3]: fig, axs = plt.subplots(2, 4, sharex=True, sharey=True, figsize=(12,5))

for i,n in enumerate(ns):


alpha_rescaled = alphas[i,:]*3/np.sqrt(n)
ax = axs[i//4,i%4]
ax.hist(alpha_rescaled, density=True, bins=40, label='empirical␣
,→inner\nproducts (rescaled)')

z = np.linspace(alpha_rescaled.min(), alpha_rescaled.max(), num=100)


ax.plot(z, np.exp(-0.5*np.power(z, 2))/np.sqrt(2*np.pi), label='Gaussian␣
,→density')

ax.title.set_text('n=%d' % n)

axs[1,3].axis('off')
axs[1,2].legend(bbox_to_anchor=(2.15, 0.95))
plt.show()


To see why we need to rescale by 3/ n, let us see how to apply the central limit theorem to this

8
problem. We are interested in the random variable
n
X
Vn = xT y = x i yi
i=1

where x, y are independent copies of a random variable which is uniformly distributed on [−1, 1]n .
In other words, x1 , . . . , xn , y1 , . . . , yn are IID random variables which are uniformly distributed on
[−1, 1]. Since we have
1
E[x1 y1 ] = 0 and Var(x1 y1 ) = ,
9
it follows that we should have
3
√ Vn → N (0, 1)
n
as n → ∞.
As another note, you may wonder why a second mode appears in the histogram for large values
of n. This is because our approximation of the distribution of Vn does not consist of independent
samples! When n is much larger than b our samples are still approximately independent so the CLT
holds, but when n and b are of comparable size then we can observe some non-Gaussian behavior.

You might also like