0% found this document useful (0 votes)
44 views7 pages

HW 1

Uploaded by

quanliangliu1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views7 pages

HW 1

Uploaded by

quanliangliu1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Homework 1

>>Quanliang Liu<<
>>9085925288<<

Instructions: This is a background self-test on the type of math we will encounter in class. If you find many
questions intimidating, we suggest you drop 760 and take it again in the future when you are more prepared.
You can use this latex file as a template to develop your homework. Submit your homework on time as a
single pdf file to Gradescope. There is no need to submit the latex source or any code.

1 Vectors and Matrices [6 pts]


Consider the matrix X and the vectors y and z below:
     
3 2 2 1
X= y= z=
−7 −5 1 −1

1. Compute y T Xz
0
2. Is X invertible? If so, give the inverse, and
 if no, explain
 why not.
5 2
Yes, X is invertible, and the inverse is:
−7 −3

2 Calculus [3 pts]
x
1. If y = e− x + arctan(z) x6/z − ln , what is the partial derivative of y with respect to x?
x+1
6− z
6x z arctan(z)
∂y
∂x = −e− x + z − 1
x ( x +1)

3 Probability and Statistics [10 pts]


Consider a sequence of data S = (1, 1, 1, 0, 1) created by flipping a coin x five times, where 0 denotes that
the coin turned up heads and 1 denotes that it turned up tails.
1. (2.5 pts) What is the probability of observing this data, assuming it was generated by flipping a biased
coin with p( x = 1) = 0.6?
0.05184
2. (2.5 pts) Note that the probability of this data sample could be greater if the value of p( x = 1) was not
0.6, but instead some other value. What is the value that maximizes the probability of S? Please justify
your answer.
If the coin has this probability p( x = 1) = a, then the probability of S is: p(S) = a4 (1 − a). Now the
goal is to fine the a which makes p(S) the largest. Based on p′ (S) = 4a3 (1 − a) − a4 , we know that
when a = 0.8, S has a maximum probability.
3. (5 pts) Consider the following joint probability table where both A and B are binary random variables:

1
Homework 1 CS 760 Machine Learning

A B P( A, B)
0 0 0.3
0 1 0.1
1 0 0.1
1 1 0.5

(a) What is P( A = 0| B = 1)?


1
6
(b) What is P( A = 1 ∨ B = 1)?
0.7

4 Big-O Notation [6 pts]


For each pair ( f , g) of functions below, list which of the following are true: f (n) = O( g(n)), g(n) = O( f (n)),
both, or neither. Briefly justify your answers.
1. f (n) = ln(n), g(n) = log2 (n).
Since ln(n) = log2 (n) · ln(2), the functions differ by a constant factor. Therefore, both f (n) = O( g(n))
and g(n) = O( f (n)) are true. Conclusion: Both are true.

2. f (n) = log2 log2 (n), g(n) = log2 (n).


Since log2 (log2 (n)) grows much slower than log2 (n), so f (n) = O( g(n)). However, g(n) = O( f (n))
is false because log2 (n) grows faster. Conclusion: Only f (n) = O( g(n)) is true.
3. f (n) = n!, g(n) = 2n .
Since n! grows faster than 2n , so f (n) = O( g(n)) is false. g(n) = O( f (n)) is true because 2n grows
slower than n!. Conclusion: Only g(n) = O( f (n)) is true.

5 Probability and Random Variables


5.1 Probability [12.5 pts]
State true or false. Here Ω denotes the sample space and Ac denotes the complement of the event A.
1. For any A, B ⊆ Ω, P( A| B) P( A) = P( B| A) P( B).
False
2. For any A, B ⊆ Ω, P( A ∪ B) = P( A) + P( B) − P( B ∩ A).
True
P( A∪ B∪C )
3. For any A, B, C ⊆ Ω such that P( B ∪ C ) > 0, P( B∪C )
≥ P ( A | B ∪ C ) P ( B ).
False
4. For any A, B ⊆ Ω such that P( B) > 0, P( Ac ) > 0, P( B| AC ) + P( B| A) = 1.
False
5. If A and B are independent events, then Ac and Bc are independent.
True

5.2 Discrete and Continuous Distributions [12.5 pts]


Match the distribution name to its probability density / mass function.

2
Homework 1 CS 760 Machine Learning

 
(f) f ( x; Σ, µ) = √ 1
exp − 12 ( x − µ)T Σ−1 ( x − µ)
(2π )k det(Σ)
(g) f ( x; n, α) = (nx)α x (1 − α)n− x for x ∈ {0, . . . , n}; 0
otherwise  
1 | x −µ|
(h) f ( x; b, µ) = 2b exp − b
(a) Gamma j
x
(b) Multinomial i (i) f ( x; n, α) = n!
Πk α i for xi ∈ {0, . . . , n} and
Πik=1 xi ! i =1 i
(c) Laplace h ∑ik=1 xi = n; 0 otherwise
(d) Poisson l (j) f ( x; α, β) =
βα α−1 − βx
x e for x ∈ (0, +∞); 0 other-
Γ(α)
(e) Dirichlet k wise
Γ(∑ik=1 αi ) α −1
(k) f ( x; α) = ∏ik=1 xi i for xi ∈ (0, 1) and
∏ik=1 Γ(αi )
∑ik=1 xi = 1; 0 otherwise
−λ
(l) f ( x; λ) = λ x ex! for all x ∈ Z + ; 0 otherwise

5.3 Mean and Variance [10 pts]


1. Consider a random variable which follows a Binomial distribution: X ∼ Binomial(n, p).
(a) What is the mean of the random variable?
µ = np
(b) What is the variance of the random variable?
Var ( X ) = np( p − 1)
2. Let X be a random variable and E[ X ] = 1, Var( X ) = 1. Compute the following values:
(a) E[5X ]
5
(b) Var(5X )
25
(c) Var( X + 5)
1

5.4 Mutual and Conditional Independence [12 pts]


1. (3 pts) If X and Y are independent random variables, show that E[ XY ] = E[ X ]E[Y ].
By definition, if two random variables X and Y are independent, their joint probability distribution
factors as:
P( X = x, Y = y) = P( X = x ) P(Y = y)
Thus, the expected value of the product of X and Y is:

E[ XY ] = ∑ ∑ xyP(X = x, Y = y)
x y

Using the independence property:

E[ XY ] = ∑ ∑ xyP(X = x) P(Y = y)
x y

This can be rewritten as:


! !
E[ XY ] = ∑ xP(X = x) ∑ yP(Y = y)
x y

3
Homework 1 CS 760 Machine Learning

Hence, we have:
E[ XY ] = E[ X ]E[Y ]

2. (3 pts) If X and Y are independent random variables, show that Var( X + Y ) = Var( X ) + Var(Y ).
Hint: Var( X + Y ) = Var( X ) + 2Cov( X, Y ) + Var(Y )
Since the sum of two random variables:

Var( X + Y ) = Var( X ) + Var(Y ) + 2Cov( X, Y )

Since X and Y are independent, their covariance is zero:

Cov( X, Y ) = E[ XY ] − E[ X ]E[Y ] = 0

Thus, the variance simplifies to:

Var( X + Y ) = Var( X ) + Var(Y )

3. (6 pts) If we roll two dice that behave independently of each other, will the result of the first die tell us
something about the result of the second die?
Since the probabilities of the results are independent. Therefore, the result of the first die tells nothing
about the result of the second die.
If, however, the first die’s result is a 1, and someone tells you about a third event — that the sum of
the two results is even — then given this information is the result of the second die independent of the
first die?
Since the first die’s result is 1, the second die must be 1 for the sum to be even. As a result, the result
of the second die is no longer independent of the first.

5.5 Central Limit Theorem [3 pts]


Prove the following result.

1. Let X1 , . . . , Xn are iid, Xi ∼ N (0, 1), and X̄ = 1


n ∑in=1 Xi , then the distribution of X̄ satisfies
√ n→∞
n X̄ −→ N (0, 1)

Given that Xi ∼ N (0, 1), the mean and variance of each Xi are:

E [ Xi ] = 0 and Var( Xi ) = 1

Step 1. The expectation of the sample mean X̄ is:

n
1 1
E[ X̄ ] =
n ∑ E [ Xi ] = n
·0 = 0
i =1

Step 2. Since the Xi ’s are independent and identically distributed (iid), the variance of X̄ is:

n
1 1 1
Var( X̄ ) =
n2 ∑ Var(Xi ) = n2
·n =
n
i =1

Step 3. For the scaled sample mean n X̄. The expectation is:

4
Homework 1 CS 760 Machine Learning

√ √ √
E[ n X̄ ] = n · E[ X̄ ] = n · 0 = 0

The variance of n X̄ is:

√ 1
Var( n X̄ ) = n · Var( X̄ ) = n · = 1
n
As a result, the distribution of the standardized sample mean approaches the normal distribution as
n → ∞. Specifically, we have:

√ d
n X̄ −
→ N (0, 1)

Thus, we have shown that:


√ n→∞
n X̄ −−−→ N (0, 1)

This completes the proof.

6 Linear algebra
6.1 Norms [5 pts]
Draw the regions corresponding to vectors x ∈ R2 with the following norms:

1. ||x||1 ≤ 1 (Recall that ||x||1 = ∑i | xi |)

q
2. ||x||2 ≤ 1 (Recall that ||x||2 = ∑i xi2 )

5
Homework 1 CS 760 Machine Learning

3. ||x||∞ ≤ 1 (Recall that ||x||∞ = maxi | xi |)

 
5 0 0
For M = 0 7 0, Calculate the following norms.
0 0 3
4. || M ||2 (L2 norm)
7 ( largest singular value of the matrix M)
5. ∥ M ∥ F (Frobenius norm)
p √ √
∥ M∥ F = 52 + 72 + 32 = 25 + 49 + 9 = 83.

6.2 Geometry [10 pts]


Prove the following. Provide all steps.
1. The smallest Euclidean distance from the origin to some point x in the hyperplane w T x + b = 0 is
|b|
||w||
. You may assume w ̸= 0.
2
The distance d from a point x0 to the hyperplane w T x + b = 0 is given by the formula:

|w T x0 + b|
d=
∥ w ∥2
The distance is from the origin, so x0 = 0. Substituting x0 = 0 into the formula:

|wT 0 + b| |b|
d= =
∥ w ∥2 ∥ w ∥2
Thus, the smallest Euclidean distance from the origin to the hyperplane is:
|b|
d=
∥ w ∥2

|b1 −b2 |
2. The Euclidean distance between two parallel hyperplane w T x + b1 = 0 and w T x + b2 = 0 is ||w||2
(Hint: you can use the result from the last question to help you prove this one).
Since the hyperplanes are parallel, they share the same normal vector w. The distance between these
hyperplanes is the perpendicular distance between a point on one hyperplane to the other hyperplane.
Let’s choose a point x1 that lies on the hyperplane w T x + b1 = 0.
A point x1 that satisfies the equation of the hyperplane w T x + b1 = 0 must lie on this hyperplane, so:

w T x1 = −b1

6
Homework 1 CS 760 Machine Learning

The distance from the point x1 (which lies on the first hyperplane) to the second hyperplane w T x +
b2 = 0 is given by the formula:
|w T x1 + b2 |
d=
∥ w ∥2

Substituting w T x1 = −b1 into the formula, we get:

|(−b1 ) + b2 | |b − b1 |
d= = 2
∥ w ∥2 ∥ w ∥2

7 Programming Skills [10 pts]


Sampling from a distribution. For each question, submit a scatter plot (you will have 2 plots in total). Make
sure the axes for all plots have the same ranges.
1. Make a scatter plot by drawing 100 items from a two dimensional Gaussian N ((1, −1) T , 2I ), where I
is an identity matrix in R2×2 .

  
1 0.25
2. Make a scatter plot by drawing 100 items from a mixture distribution 0.3N (5, 0) T , +
   0.25 1
T 1 −0.25
0.7N (−5, 0) , .
−0.25 1

You might also like