Cs 419 Endsemsols
Cs 419 Endsemsols
25-Apr-2014 (2:00pm-5:00pm)
Important Instructions
• Fill the blanks in the questions, in place1 , with as concise and legible answers as possible.
• The blanks in the questions are of sufficient size to accommodate the expected answer. Hence,
answers that go well beyond the blank, and/or those that are not legible, and/or those that
written in a very tiny font-size, will not be evaluated.
• For the sake of preciseness in the questions, next to every blank a keyword in [[. . . ]] format is
written that indicates the type of answer I am expecting:
• Marks for questions/blanks are mentioned. Note that these marks are atomic. For e.g., if I mention
4 marks, then you will get 4 if all the answers for the corresponding blanks are absolutely correct
and 0 otherwise. So please be very careful in writing your final answers. Sometimes I may mention
for e.g., 2+2 Marks after a question consisting of 2 blanks. This means each blank is of 2 (atomic)
marks.
• You should NOT carry anything with you other than pens/pencils. If you are caught copying or
showing your answers to others or using any other unfair means, then you will get an FR in the
course and your case will be reported to appropriate disciplinary committee.
1
There is no separate answer sheet. You will only be given this question paper and a rough sheet. You should return
the question paper containing your answers and keep the rough sheet with you.
1
Fill in the blanks
1. Consider a machine learning application for which the following background knowledge is available
from the domain experts:
B1 “The output variable is definitely a linear function of the two input variables x1 , x2 .”
B2 “Moreover, it is more likely that it is a linear function of x1 alone.”
• Now suppose you were to do probabilistic modeling of this problem. Then you would use
a linear regression model, so that the information B1 is utilized. And further employ a
suitable prior (over the parameters) so that the information B2 is utilized.
[1+1 marks]
• Now suppose you were to do deterministic modeling for the same problem, which leads to a
prediction function that is dependent on a few training examples. Then you would use the
support vector regression formalism such that the information B1 is utilized. And further
employ a suitable hierarchy (over models) so that the information B2 is utilized.
[1+2 marks]
2. Consider the Gaussian mixture model with n components, denoted by GMMn. Assume that pa-
rameter selection is done using the EM algorithm discussed in the lecture. Let us denote the distri-
bution in GMM2 selected using the EM algorithm by gmm2 and that in GMM3 by gmm3. Then,
the likelihood of the training data computed using the gmm2 distribution is not comparable to
that computed with gmm3.
[2 marks]
Explanation: This is because EM algorithm need not necessarily maximize the likelihood.
3. In context of the above problem, now assume that it so happens that the likelihood of the training
data is exactly same for both gmm2 and gmm3. Given this, if you are forced to choose one of
gmm2 or gmm3 as the predictive distribution, then you will pick gmm2.
[2 marks]
4. Consider the following binary classification2 training data
1 −1 0 0
D= , +1 , , +1 , , −1 , , −1
0 0 1 −1
and the homogeneous quadratic
model, which is the set of all functions
2 of the form: g(x) =
w1 x1
>
w φ(x), where w = w2 is the model parameter, and φ(x) = x22 . Then, optimiza-
w3 x1 x2
3
tion problem corresponding to the hard-margin SVM , discussed in the lecture, for choosing the
optimal (homogeneous) quadratic discriminator is:
1
min 2
kwk2 ,
w1 ,w2 ,w3
s.t. w1 ≥ 1, w3 ≤ −1.
Note that you need to fill the above two blanks with expressions involving w alone4 .
[3 marks]
Solve the above optimization problem for the optimal w. The equation5 for the discriminating
2
Labels are +1 or -1.
3
Hard-margin SVM is same as the SVM presented in Murphy’s book where all slack variables are set to zero, i.e.,
ξi = 0 ∀ i.
4
The expression should not involve φ or x etc.
5
Note that your expression should not involve φ or x or w etc.
2
quadratic surface with this optimal w is x21 − x22 = 0.
[2marks]
5. A coin, with unknown probability of heads, was tossed 5 times and it was head only twice. Assume
two Beta-Bernoulli6 models are available: one with hyperparameters a = 3, b = 3, denoted by M1 ,
and the other with hyperparameters a = 1, b = 1, denoted by M2 . Let m̂i denote that distribution
in Mi , which is chosen according to maximum likelihood principle. Then the likelihood of the
training data with m̂1 is 0.03456 and that with m̂2 is 0.03456.
[1mark]
Let mi denote that distribution in Mi , which is chosen according to MAP principle. Then the
likelihood of the training data with m1 is 0.033870176 and that with m2 is 0.03456. Hence the
likelihood with m1 is < that with m2 .
[2marks]
The likelihood of the training data with the BAM corresponding to M1 is 0.033529751 and that
corresponding to M2 is 0.034271435. Among these two numbers, the former is < the latter.
[2marks]
The marginal likelihood of M1 is 0.002164502 and that of M2 is 0.002380952. Hence, the maxi-
mum marginal likelihood principle will select M2 .
[2marks]
6. Let M denote a model consisting of all distributions with pdf/pmf given by fψ for the various
values of the parameters ψ ∈ Ψ. Now consider this definition: the model M is said to belong to
the exponential family iff there exist the following:
• a, perhaps modified, parameterization of the pdf/pmf in terms of parameters θ ∈ Θ ⊂ Rd .
In other words, consider g : Ψ 7→ Θ and θ ≡ g(ψ). Then the new parameterized pdf/pmf is
given by fˆθ (x) ≡ fψ (x) ∀ x ∈ X ⊂ Rn .
• a function h : Rn 7→ R+ ,
• a function φ : Rn 7→ Rd ,
such that fψ (x) ≡ fˆθ (x) = 1
Z(θ)
h(x) exp{θ> φ(x)}, where Z(θ) is simply the normalization factor7 .
It turns out that many models familiar to you belong to this family:
Multinoulli model: Let ψi denote the probability that X takes value i, for all i = 1, . . . , 3. Let
I(x, i) denote 0 if x 6= i and denote 1 if x = i. Once this multinoulli’s pmf is written in the
exponential form,
> n o
ψ1 ψ2
θ = log ψ3 log ψ3 , Θ = θ = g(ψ) | ψ1 + ψ2 + ψ3 = 1, ψi ≥ 0 ∀ i = 1, 2, 3
h i>
φ(x) = I(x,1) I(x,2) , Z(θ) = 1 + exp(θ1 ) + exp(θ2 ), h(x) = 1.
Alternative answers are possible with 3-size vectors etc., which some of you have written
correctly..
6 Γ(a+b) a−1
The pdf for Beta distribution is given by: p(x) = Γ(a)Γ(b) x (1 − x)b−1 , where a > 0, b > 0. Recall that the Gamma
function satisfies: Γ(a + 1) = aΓ(a).
7
For conts. rvs., it is given by Z(θ) = X h(x) exp{θ> φ(x)}dx and for discrete it is x∈X h(x) exp{θ> φ(x)}.
R P
3
[3marks]
Gaussian model: Let µ ∈ R denote its mean and let σ 2 denote its variance. Once this Gaussian’s
pdf is written in the exponential form,
h i>
θ = σµ2 2σ
−1
2 , Θ = R × R−
q 2
> −θ
φ(x) = [x x2 ] , Z(θ) = −πθ2
exp 4θ21 , h(x) = 1.
Alternative answers are possible with 3-size vectors etc., which some of you have written
correctly..
[3marks]
More commonly, each entry of φ(x) is called as a sufficient static for x (and hence φ(x) is the vector
of sufficient statistics for x) and Z : Θ 7→ R is called as the partition function. Interestingly, it
turns out that log(Z(θ)) is a convex function in θ and the conjugate prior turns out be exponential
again8 . You may prove these leisurely sometime later after this examination. Infact, owing to
these two facts, the expressions related to MLE, MAP, BAM turn out to be extremely elegant.
Now let F denote a particular model that belongs to the exponential family and θ ∈ Θ represents
its model parameters. Consider a binary classification problem, with class labels represented
by +1 and -1. Assume that the class-conditionals are modeled using F and the class prior is
modeled using the Bernoulli model. Let θ+1 and θ−1 be the optimal parameters chosen according
to MLE for the class-conditionals of classes +1 and −1 respectively. Let α be that selected by
MLE for the class-prior and represents the prior probability of class +1. Then the equation of
the discriminating surface is given by:
> αZ(θ−1 )
(θ+1 − θ−1 ) φ(x) + log (1−α)Z(θ+1 ) = 0.
[2marks]
Observe that there exists a transformation ζ : X 7→ Rd+1 such that the form of the distribution
p(y/x) with the above described exponential model based generative model is exactly same as
that with logistic regression over the transformed data ζ(x). This transformation is given by
h i>
ζ(x) = φ(x)> 1 . Also, the relation between w, the parameter of logistic regression and
θ+1 , θ−1 , α is given by
h i>
> > αZ(θ−1 )
w = θ+1 − θ−1 log (1−α)Z(θ+1 )
.
[2marks]
Non-linear logistic regression can be easily generalized to multi-class classification. The only
difference is that there will be one w for each class9 . Now consider the generative model, HMM,
where emission distributions are modeled by F (parameterized by θ) and π, A represent the vector
of initial state probabilities and the state transition probabilities matrix respectively. Provide
expressions with the corresponding non-linear logistic regression for:
>
ζ(x) = ζ(x1 , . . . , xT ) = φ(x1 )> . . . φ(xT )> 1
8
This is sometimes called as self-conjugacy.
9
As in linear logistic regression, if there are k classes, then instead of k number of ws, we can use k − 1. To keep
notation simple, let us use k number of ws only.
4
and h i>
> > π(y1 )A(y1 ,y2 )...A(yT ,yT −1)
wy = wy1 ,...,yT = θy1 . . . θyT log Z(θy )...Z(θy )
.
1 T
[3marks]
It may at first appear that there are too many ws, one for each state sequence. But a close
observation will reveal that they are related, as given by your expression above (in the blank),
and essentially only the (π, A, θ) are the free variables. An alternative to the above, popularly
called as Conditional Random Fields (CRFs), is to assume that the p(y/ζ(x)) itself factorizes, say
as p(y/ζ(x)) = p(y1 /ζ(x))p(y2 /y1 , ζ(x)) . . . p(yT /yT −1 , ζ(x)). The advantage with CRF is that the
number of parameters is itself low (and hence parameter learning is less messy). If the number of
states is k, then the number of w parameters with CRF is k 2 + k. Alternative answer is possible
for the last blank as k 2 − 1..
[1mark]
m
X Z
log( pθ (xio , xih ) dxih )
i=1 Xih
[2marks]
Now suppose you want to employ the EM algorithm for parameter selection. Let us assume t
iterations of it are performed and the parameter after this iteration is θt . The qt+1 distribution
you would choose will then be given by:
Hint: Recall that log is a concave function and hence satisfies the so called Jensen’s inequality
log(E[Z]) ≥ E[log(Z)], where Z is any random variable12 such that the involved expectations are
finite.
10
You would have observed that some real-world datasets in the UCI repository (the online repository you downloaded
the datasets for your practical assignments) do have missing feature
values.
2.5 5 0.1 ?
11
Here is an example of such a 3-dimensional data: D = ? , 3.3 , ? , 3 . ‘?’ represents missing
3.4 8 ? 1
datum.
12
In lectures we used a special case of Jensen’s inequality where Z is discrete. Note that when Z is Bernoulli, then the
Jensen’s inequality provides the definition of a concave function.
5
[3marks]
After this examination, at leisure, write down the entire EM algorithm for this missing value
problem.