Homework 1
Homework 1
Homework 1
Homework 1
Due 26th, Feb 2019, Before Class Spring 2019
Name:
Roll Number:
2. You are allowed to discuss among yourselves and with TAs for general advice, but you must
submit your own work.
3. Plagiarism will be not tolerated.
4. The write-ups must be very clear, to-the-point, and presentable. We will not just be looking for
correct answers rather highest grades will be awarded to write-ups that demonstrate a clear
understanding of the material. Write your solutions as if you were explaining your answer to a
colleague. Style matters and will be a factor in the grade.
5. Codes and their results must be submitted with the homework in hard. Ideally, we would like you
to submit a very well documented printout of Jupyter Notebook.
Suggested readings:
1. Elements of Statistical Learning (by Hastie, Tibshirani, and Friedman): Chapter 1 (pages 1-8).
2. Learning from Data (by Abu-Mostafa, Magdon-Ismail, Lin): Chapter 1. This covers much of the same
ground we did in our ”First look at generalization”, and does so in a very conversational manner.
3. ”Introduction to Statistical Learning Thoery” by Bousquet, Boucheron, and Lugosi: Sections 1-3 (first
13 pages). Ignore Section 2.1 for now. The beginning of this paper provides an overview of using
concentration bounds to bound the performance of empirical risk minimization in the context of a
finite set of hypotheses. You can download the paper from the course web page.
Problems:
1. Suppose that we have some number m of coins. Each coin has the same probability of landing on heads,
denoted p. Suppose that we pick up each of the m coins in turn and for each coin do n independent
coin tosses. Note that the probability of obtaining exactly k heads out of n tosses for any given coin
is given by the binomial distribution:
n k
P[k|n, p] = p (1 − p)n−k
k
For each series of coin tosses, we will record the results via the empirical estimates given by
as a function of ∈ [0, 1]. (Note that if n = 10, pbi can take only 11 discrete values, so your plot
will have discrete jumps at certain values as changes.) On the same plot, show the bound that
results from applying the Hoeffdings inequality together with the union bound.
2. (a) Suppose that X is a Gaussian random variable with zero mean and variance σ 2 :
1 2
/2σ 2
fX (x) = √ e−x
2πσ 2
Find a tail bound for X using the Chernoff bounding method (see page 9 of the notes on con-
centration inequalities). In other words, fill in the right hand side below with an expression that
depends on t (and σ 2 )
P[X > t] ≤ ???
To make this bound as good as possible, optimize over your choice of λ. Expressions for the
moment generating function of a Gaussian random variable are easy to come by (e.g., in the
”Normal distribution” entry in Wikipedia).
(b) Suppose that X1 , X2 , ..., Xm are iid Gaussian random variables with mean 0 and variance σ 2 .
Using your answer for part (a) and the union bound, find a bound for
P max Xi > t ≤ ???
i=1,...,m
(c) For Xi as in part (b), complete the following sentence: With probability at least 0.9,
max Xi ≤ ???
i=1,...,m
Z = max Xi , Xi ∼ Normal(0, 1)
i=1,...,m
for m = 10β for β = 3, 4, 5, 6. The code in hist-example.py should help you get started. Discuss
in the context of your answer to part (c). Turn in plots of your histograms along with your
comments.
3. In this problem we will explore nearest neighbor classification in Python.
The file knn-example.py provides a good start at this. You should be able to run this in the iPython
environment simply by typing run knn-example.py. This uses the NumPy, Matplotlib, and scikit
learn python packages. These should come included in the standard Anaconda distribution, but if you
dont have them you will need to install them first.
The file begins by loading the appropriate packages and fixing the random seed so that your results will
be repeatable. It then generates a simple 2-dimensional dataset with n datapoints from two possible
classes. Next it builds a k-nearest neighbor classifier. Finally, it plots the results. Before going further,
spend some time with this and try to understand what the code is doing.
In this problem I would like you to design a k-nearest neighbor classifier for several different val-
ues of n. In particular, I would like you to consider n = 100, 500, 1000, 5000. For each of these values
of n, experiment with different choices of k and decide what the ”best” choice of k is for each of these
values of n (either based on the visual results, or using some quantitative method of your own devising).
Provide a table showing your choices of k, and include a plot of the resulting classifier for each value
of n.
4. Consider a binary classification problem involving a single (scalar) feature x and suppose that X|Y = 0
and X|Y = 1 are continuous random variables with densities given by
1 √ 1 2
fX|Y (x|Y = 0) = g0 (x) = √ e− 2|x| and fX|Y (x|Y = 1) = g1 (x) = √ e−x /2
2 2π
respectively.
(a) Plot g0 and g1 .
(b) Suppose that π0 = P[Y = 0] = 12 and hence π1 = P[Y = 1] = 12 . Derive the optimal classification
rule in terms of minimizing the probability of error. Relate this rule to the plot of g0 and g1 .
(c) Calculate the Bayes risk for this classification problem (i.e., calculate the probability of error
for the classification rule derived above). You can use the Python to compute integrals of the
Gaussian density.
5. Suppose that our probability model for (X, Y ), where X takes values in Rd and Y takes values in
{0, 1}, is given by
P[Y = 0] = π0 P[Y = 1] = π1 = 1 − π0
and the conditional densities
1 1 T −1
fX|Y (x|Y = 0) = g0 (x) = p e− 2 (x−µ0 ) Σ (x−µ0 )
d
(2π) det(Σ)
1 1 T −1
fX|Y (x|Y = 1) = g1 (x) = p e− 2 (x−µ1 ) Σ (x−µ1 )
d
(2π) det(Σ)
That is, X|Y = 0 and X|Y = 1 are multivariate normal random variables with the same covariance
matrix Σ and means µ0 ,µ1 , respectively. (Recall that covariance matrices are symmetric and have
positive eigenvalues.)
(a) Find the Bayes classification rule (in terms of the πi , µi , and Σ).
(b) Find w ∈ Rd and b ∈ R such that your rule can be expressed as
(
∗ 1 wT x + b ≥ 0
h (x) =
0 otherwise
(This question is easier than it looks. It is really just a matter of manipulating the expressions
above. Note that you can work with the log of the functions, since if f (x) > 0 and g(x) > 0,