10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
Aarti Singh
Carnegie Mellon University
• The assignment is due at 10:30 am (beginning of class) on Mon, Sept 27, 2010.
• Separate you answers into five parts, one for each TA, and put them into 5 piles at the table in front
of the class. Don’t forget to put both your name and a TA’s name on each part.
• If you have question about any part, please direct your question to the respective TA who design the
part.
For each of the task above, please specify what type of machine learning problem it is (regression, classifica-
tion, density estimation, etc). Identify what will be the training data, features and labels (if any), and what
would be the output of the algorithm.
P (A ∩ B)
P (A|B) =
P (B)
1. Prove that P (A ∩ B ∩ C) = P (A|B, C)P (B|C)P (C)
2. Suppose we play a game where I present to you three doors, one of which has a prize behind. The
doors are closed and so you choose a door which you think conceals the prize. After you make your
choice, I open one of the two doors you didn’t pick, and reveal that the prize wasn’t there (note that
I can always do this). Then I give you the choice whether to stick with your current door, or switch
to the remaining un-opened door. What should you do to have the highest probability of winning the
prize? (Hint: consider the event that your initial door conceals the prize, then consider the probability
that you win given that you decide to switch or not).
1
2.2 Total Probability [5points]
Suppose that I have two six-sided dice, one is fair and the other one is loaded – having:
(
1
x=6
P (x) = 21
10 x ∈ {1, 2, 3, 4, 5}
I will toss a coin to decide which die to roll. If the coin flip is heads I will roll the fair die, otherwise the
loaded one. The probability that the coin flip is heads is p ∈ (0, 1).
1. What is the expectation of the die roll (in terms of p).
2. What is the variance of the die roll (in terms of p).
Something commonly used in statistics and machine learning are so called “mixture models” which may
be seen as a generalization of the above scenario. For some sample space we have several distributions
Pi (X) = P (X|C = i), i = 1 . . . k (e.g., the two dice from above). We also have a distribution over
these “components” P (C = i) (e.g., the coin toss, where C is a binary RV).
3. Show the form of P (X) in terms of Pi (X) and P (C).
4. Show the form of E(X) in terms of E(X|C) make your answer as compact as possible.
5. Show the form of Var(X) in terms of Var(X|C) and E(X|C). Make your answer as compact as possible.
A great deal of current work in machine learning is concerned with data which are in a high dimensional
space (for example text documents, which may be seen as vectors in the lattice points of Rd where d is the
number of words in the language). In this problem we will see that we must be careful when porting familiar
concepts from the loving embrace of R3 into higher dimensions.
Consider the d−dimensional Gaussian distribution:
x ∼ N (0d×1 , I d×d )
where z ∼ N (µ, σ 2 ). Here g denotes a function and g 0 denotes the derivative of g. Note that this is
called “Stein’s Lemma.”)
3. Compute the mean and variance of Td .
√ √
4. Show that P (d − 10 2d ≤ Td ≤ d + 10 2d) ≥ 0.99, for suitably large d. To assist with this problem
you may use Chebyshev’s inequality, which is:
Var(X)
P (|X − EX| ≥ ) ≤
2
√ √ √
5. Prove that your answer above implies that P ( d − 10 ≤ Td ≤ d + 10) ≥ 0.99.
2
See that although the interval in the last expression is moving out towards infinity, its length is bounded.
We may see then that in high dimensions, the norm of a gaussian vector is highly likely to be contained in
this small interval. Therefore we may see that in high dimension, the gaussian distribution is more like a
“shell” than a sphere.
which is a function of θ under some fixed sample {x1 , x2 , . . . , xn }. The MLE estimate θ̂ mle is then defined
as follows:
1. θ̂ mle ∈ Θ.
2. ∀θ ∈ Θ, l(θ) ≤ l(θ̂ mle ).
If we have access to some prior distribution p(θ) over Θ, be it from past experiences or domain knowledge
or simply belief, we can think about the posterior distribution over Θ:
Q
n
i=1 f (x i |θ) p(θ) Z Yn
q(θ) := , where z(x1 , x2 , . . . , xn ) := f (xi |θ) p(θ)dθ.
z(x1 , x2 , . . . , xn ) Θ i=1
The Poisson distribution is useful for modeling the number of events occurring within a unit time, such
as the number of packets arrived at some server per minute. The probability mass function of a Poisson
distribution is as follows:
λk e−λ
P (k|λ) := ,
k!
where λ > 0 is the parameter of the distribution and k ∈ {0, 1, 2, . . .} is the discrete random variable modeling
the number of events encountered per unit time.
3
1. [2 points] Let {k1 , k2 , . . . , kn } be an i.i.d. sample drawn from a Poisson distribution with parameter λ.
Derive the MLE estimate λ̂mle of λ based on this sample.
2. [4 points] Let K be a random variable following a Poisson distribution with parameter λ. What is its
mean E[K] and variance Var[K]. Since λ̂mle depends on the sample used for estimation, it is also a
random variable. Derive the mean and the variance of λ̂mle , and compare them with E[K] and Var[K].
What do you find?
3. [2 pt] Suppose you believe the Gamma distribution
λα−1 e−λ/β
p(λ) := ,
Γ(α)β α
is a good prior for λ, where Γ(·) is the Gamma function, and you also know the values of the two
hyper-parameters1 α > 1 and β > 0. Derive the MAP estimate λ̂map .
4. [2 pt] What happens to λ̂map when the sample size n goes to zero or infinity? How do they relate to
the prior distribution and λ̂mle ?
where Λ is the inverse of the covariance matrix, or the so-called precision matrix. Let {x1 , x2 , . . . , xn } be
an i.i.d. sample from a p-dimensional Gaussian distribution.
1. [4 points] Suppose that n p. Derive the MLE estimates µ̂mle and Λ̂mle .
2. [4 points] Suppose you believe the following distribution2
|Λ|(ν−p−1)/2 tr(V −1 Λ)
W (Λ|V, ν) := exp −
Z(V, ν) 2
with tr(·) being the trace of a square matrix and Z(V, ν) the normalization term. You also know the
values of the hyper-parameters µ0 ∈ Rp , s > 0, ν > p + 1, and V ∈ Rp×p being positive definite. Derive
the MAP estimates µ̂map and Λ̂map .
3. [2 points] Again, what happens to µ̂map and Λ̂map when n goes to zero or infinity? How do they relate
to the prior distribution and the MLE estimates?
It is known that MLEs do not always exist. Even if they do, they may not be unique.
1 It is common to refer to parameters in a prior distribution as hyper-parameters.
2 It is sometimes referred to as the Gaussian-Wishart prior.
4
1. [2 points] Give an example where MLEs do not exist. Please specify the family of distributions being
considered, and the kind of samples on which MLEs are not well-defined.
2. [2 points] Give an example where MLEs exist but are not unique. Please specify the family of distri-
butions being considered, and the kind of samples from which multiple MLEs can be found.
3. [1 pt] By finding the two examples as described above, hopefully you have gained some intuition on
the properties of the log-likelihood that are crucial to the existence and uniqueness of MLE. What are
those properties?
5
5 Naive Bayes vs Logistic Regression [Jayant, 25 points]
In this problem you will implement Naive Bayes and Logistic Regression, then compare their performance on a
document classification task. The data for this task is taken from the 20 Newsgroups data set3 , and is avail-
able from (http://www.cs.cmu.edu/∼aarti/Class/10701/hws/hw1-data.tar.gz). The included README.txt
describes the data set and file format.
Our Naive Bayes model will use the bag-of-words assumption. This model assumes that each word in a
document is drawn independently from a multinomial distribution over possible words. (A multinomial
distribution is a generalization of a Bernoulli distribution to multiple values.) Although this model ignores
the ordering of words in a document, it works surprisingly well for a number of tasks. We number the words
in our vocabulary from 1 to m, where m is the total number of distinct words in all of the documents.
Documents from class y are drawn from a class-specific multinomial
Pm distribution parameterized by θy . θy is
a vector, where θy,i is the probability of drawing word i and i=1 θy,i = 1. Therefore, the class-conditional
probability of drawing document x from our Naive Bayes model is P (X = x|Y = y) = i=1 (θy,i )counti (x) ,
Qm
where counti (x) is the number of times word i appears in x.
1. [6 points] Provide high-level descriptions of the Naive Bayes and Logistic Regression algorithms. Be
sure to describe how to estimate the model parameters and how to classify a new example.
2. [4 points] Imagine that a certain word is never observed in the training data, but occurs in a test
instance. What will happen when our Naive Bayes classifier predicts the probability of the this test
instance? Explain why this situation is undesirable. Will logistic regression have a similar problem?
Why or why not?
Add-one smoothing is one way to avoid this problem with our Naive Bayes classifier. This technique
pretends that every word occurs one additional time in the training data, which eliminates zero counts
in the estimated parameters of the model. For a set of documents C = x1 , ..., xn , the add-one smoothing
1+ j=1 counti (xj )
Pn
parameter estimate is θ̂i = D+m , where D is the total number of words in C (i.e., D =
Pm Pn j
i=1 j=1 counti (x )). Empirically, add-one smoothing often improves classification performance
when data counts are sparse.
3. [12 points] Implement Logistic Regression and Naive Bayes. Use add-one smoothing when estimating
the parameters of your Naive Bayes classifier. For logistic regression, we found that a step size around
.0001 worked well. Train both models on the provided training data and predict the labels of the test
data. Report the training and test error of both models. Submit your code along with your homework.
4. [3 points] Which model performs better on this task? Why do you think this is the case?