Homework 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

EE512: Machine Learning

Homework 1
Due 26th, Feb 2019, Before Class Spring 2019

Name:
Roll Number:

1. Make Sure to read the questions carefully before attempting them.

2. You are allowed to discuss among yourselves and with TAs for general advice, but you must
submit your own work.
3. Plagiarism will be not tolerated.

4. The write-ups must be very clear, to-the-point, and presentable. We will not just be looking for
correct answers rather highest grades will be awarded to write-ups that demonstrate a clear
understanding of the material. Write your solutions as if you were explaining your answer to a
colleague. Style matters and will be a factor in the grade.

5. Codes and their results must be submitted with the homework in hard. Ideally, we would like you
to submit a very well documented printout of Jupyter Notebook.
Suggested readings:
1. Elements of Statistical Learning (by Hastie, Tibshirani, and Friedman): Chapter 1 (pages 1-8).
2. Learning from Data (by Abu-Mostafa, Magdon-Ismail, Lin): Chapter 1. This covers much of the same
ground we did in our ”First look at generalization”, and does so in a very conversational manner.
3. ”Introduction to Statistical Learning Thoery” by Bousquet, Boucheron, and Lugosi: Sections 1-3 (first
13 pages). Ignore Section 2.1 for now. The beginning of this paper provides an overview of using
concentration bounds to bound the performance of empirical risk minimization in the context of a
finite set of hypotheses. You can download the paper from the course web page.
Problems:
1. Suppose that we have some number m of coins. Each coin has the same probability of landing on heads,
denoted p. Suppose that we pick up each of the m coins in turn and for each coin do n independent
coin tosses. Note that the probability of obtaining exactly k heads out of n tosses for any given coin
is given by the binomial distribution:
 
n k
P[k|n, p] = p (1 − p)n−k
k
For each series of coin tosses, we will record the results via the empirical estimates given by

number of times coin i lands on heads


pbi =
n
(a) Assume that n = 10. If all the coins have p = 0.05, compute a formula for the exact probability
that at least one coin will have pbi = 0. (This may be easier to calculate by instead computing the
probability that this does not occur.) Give a table containing the values of this probability for
the cases of m = 1, m = 1, 000, and m = 1, 000, 000. Repeat for p = 0.75.
(b) Now assume that n = 10, m = 2, and that p = 0.5 for both coins. Compute (exactly, via a
formula) and then plot/sketch  
P max|b
pi − p| > 
i

as a function of  ∈ [0, 1]. (Note that if n = 10, pbi can take only 11 discrete values, so your plot
will have discrete jumps at certain values as  changes.) On the same plot, show the bound that
results from applying the Hoeffdings inequality together with the union bound.

2. (a) Suppose that X is a Gaussian random variable with zero mean and variance σ 2 :
1 2
/2σ 2
fX (x) = √ e−x
2πσ 2
Find a tail bound for X using the Chernoff bounding method (see page 9 of the notes on con-
centration inequalities). In other words, fill in the right hand side below with an expression that
depends on t (and σ 2 )
P[X > t] ≤ ???
To make this bound as good as possible, optimize over your choice of λ. Expressions for the
moment generating function of a Gaussian random variable are easy to come by (e.g., in the
”Normal distribution” entry in Wikipedia).
(b) Suppose that X1 , X2 , ..., Xm are iid Gaussian random variables with mean 0 and variance σ 2 .
Using your answer for part (a) and the union bound, find a bound for
 
P max Xi > t ≤ ???
i=1,...,m
(c) For Xi as in part (b), complete the following sentence: With probability at least 0.9,

max Xi ≤ ???
i=1,...,m

(d) Using Python, create histograms for the random variable

Z = max Xi , Xi ∼ Normal(0, 1)
i=1,...,m

for m = 10β for β = 3, 4, 5, 6. The code in hist-example.py should help you get started. Discuss
in the context of your answer to part (c). Turn in plots of your histograms along with your
comments.
3. In this problem we will explore nearest neighbor classification in Python.

The file knn-example.py provides a good start at this. You should be able to run this in the iPython
environment simply by typing run knn-example.py. This uses the NumPy, Matplotlib, and scikit
learn python packages. These should come included in the standard Anaconda distribution, but if you
dont have them you will need to install them first.

The file begins by loading the appropriate packages and fixing the random seed so that your results will
be repeatable. It then generates a simple 2-dimensional dataset with n datapoints from two possible
classes. Next it builds a k-nearest neighbor classifier. Finally, it plots the results. Before going further,
spend some time with this and try to understand what the code is doing.

In this problem I would like you to design a k-nearest neighbor classifier for several different val-
ues of n. In particular, I would like you to consider n = 100, 500, 1000, 5000. For each of these values
of n, experiment with different choices of k and decide what the ”best” choice of k is for each of these
values of n (either based on the visual results, or using some quantitative method of your own devising).
Provide a table showing your choices of k, and include a plot of the resulting classifier for each value
of n.
4. Consider a binary classification problem involving a single (scalar) feature x and suppose that X|Y = 0
and X|Y = 1 are continuous random variables with densities given by
1 √ 1 2
fX|Y (x|Y = 0) = g0 (x) = √ e− 2|x| and fX|Y (x|Y = 1) = g1 (x) = √ e−x /2
2 2π
respectively.
(a) Plot g0 and g1 .
(b) Suppose that π0 = P[Y = 0] = 12 and hence π1 = P[Y = 1] = 12 . Derive the optimal classification
rule in terms of minimizing the probability of error. Relate this rule to the plot of g0 and g1 .
(c) Calculate the Bayes risk for this classification problem (i.e., calculate the probability of error
for the classification rule derived above). You can use the Python to compute integrals of the
Gaussian density.
5. Suppose that our probability model for (X, Y ), where X takes values in Rd and Y takes values in
{0, 1}, is given by
P[Y = 0] = π0 P[Y = 1] = π1 = 1 − π0
and the conditional densities
1 1 T −1
fX|Y (x|Y = 0) = g0 (x) = p e− 2 (x−µ0 ) Σ (x−µ0 )
d
(2π) det(Σ)
1 1 T −1
fX|Y (x|Y = 1) = g1 (x) = p e− 2 (x−µ1 ) Σ (x−µ1 )
d
(2π) det(Σ)
That is, X|Y = 0 and X|Y = 1 are multivariate normal random variables with the same covariance
matrix Σ and means µ0 ,µ1 , respectively. (Recall that covariance matrices are symmetric and have
positive eigenvalues.)

(a) Find the Bayes classification rule (in terms of the πi , µi , and Σ).
(b) Find w ∈ Rd and b ∈ R such that your rule can be expressed as
(
∗ 1 wT x + b ≥ 0
h (x) =
0 otherwise

(This question is easier than it looks. It is really just a matter of manipulating the expressions
above. Note that you can work with the log of the functions, since if f (x) > 0 and g(x) > 0,

f (x) ≥ g(x) ⇔ log f (x) ≥ log(x)

You might also like