Quiz3 2024
Quiz3 2024
Question 1: Write T or F for True/False in the box next to each question given below, with a brief
(1-2 sentences at most) explanation in the provided space in the box below the question. Marks will
be awarded only when the answer (T/F) and explanation both are correct. (3 x 2 = 6 marks)
1.1 To predict the label using a generative classification model, comparing the probabilities
𝑝(𝑦 = 𝑘|𝒙) for different values of 𝑘 is equivalent to comparing the class-conditional F
probability densities 𝑝(𝒙|𝑦 = 𝑘) for different values of 𝑘
𝑝(𝑦 = 𝑘|𝒙) ∝ 𝑝(𝑦 = 𝑘)𝑝(𝒙|𝑦 = 𝑘) so it also incorporate the class prior (class marginal)
distribution 𝑝(𝑦 = 𝑘)
1.2 A Gaussian prior 𝑝(𝒘) = 𝒩(𝒘|𝒘0 , 𝜆−1 𝑰) on the weight vector 𝒘 ∈ ℝ𝐷 will cause a F
regularization effect and encourage the entries in 𝒘 to take small values.
This prior corresponds to a regularizer of the form 𝜆‖𝒘 − 𝒘0 ‖2 which will encourage each entry
of the vector 𝒘 to be close the the corresponding entry in the vector 𝒘0 . Only when 𝒘0 is the
zero vector, the statement above would be true but in general it would be false.
1.3 Even though the MAP estimate is the mode of the posterior distribution, to compute T
the MAP estimate, it is not necessary to compute the posterior distribution.
𝑝(𝜃)𝑝(𝑦|𝜃)
Recall that the posterior is 𝑝(𝜃|𝑦) = . Because of the denominator (marginal likelihood)
𝑝(𝑦)
is independent of 𝜃, maximization of the posterior only requires maximization of the numerator
𝑝(𝜃)𝑝(𝑦|𝜃) (or log 𝑝(𝜃) + log 𝑝(𝑦|𝜃)) and we don’t need to compute the full posterior for the
maximization.
Question 2: Answer the following questions concisely in the space provided below the question.
2.1 Consider the RBF kernel 𝑘(𝒙 , 𝒙 ) = exp (−𝛾‖𝒙 − 𝒙 ‖2 ) where 𝒙 and 𝒙 are 𝐷 dim
𝑖 𝑗 𝑖 𝑗 𝑖 𝑗
inputs. Consider two cases: (1) when bandwidth hyperparameter 𝛾 is set as very-very large,
and (2) when 𝛾 is set as very-very small. For each of these two cases, answer (with brief
justification) whether the resulting kernel function would be practically useful. (4 marks)
(1) When 𝛾 is very-very large, 𝑘(𝒙𝑖 , 𝒙𝑗 ) will be nonzero (will equal 1) only when 𝒙𝑖 and 𝒙𝑗 are
nearly identical. For all other pairs of inputs, the kernel will give 0 similarity.
(2) When 𝛾 is very-very small, 𝑘(𝒙𝑖 , 𝒙𝑗 ) will be close to 1 for all pairs of inputs, thus treating
all pairs of inputs as equally similar to each other.
Clearly, neither of these two extreme cases are desirable.
2.2 Briefly explain why using kernels with the landmarks approach or the random features
approach is faster at test time than using kernels in the standard manner? (3 marks)
When using landmarks or random features approach, we use the kernel to construct an 𝐿
dimensional feature representation 𝜓(𝒙𝑛 ), and train a linear model on these representations to get
a weight vector 𝒘 that is 𝐿 dimensional. Thus, for a test input 𝒙∗ the prediction cost for computing
𝑤 ⊤ 𝜓(𝒙∗ ) is also 𝑂(𝐿). In contrast, when using the kernel in the standard manner, this cost of
𝑂(𝑁) which can be very high if the number of training inputs is very large
2.3 Given a dataset 𝑿 as the 𝑁 × 𝐷 input matrix with 𝑁 inputs and 𝐷 features, write down the
𝐾-means hard-clustering problem for this dataset in form of an equivalent matrix
factorization problem, clearly specifying the meanings of the variables involved in the matrix
factorization, their dimensions, and constraints on them, if any. (4 marks)
̂, 𝝁
{𝒁 ̂ } = argmin𝒁,𝝁 ‖𝑿 − 𝒁𝝁‖𝟐 . Here 𝒁 is the 𝑁 × 𝐾 matrix with row 𝑛 (𝒛𝑛 ) being a one-hot
vector denoting which cluster the input 𝒙𝑛 belongs to, and 𝝁 denotes the 𝐾 × 𝐷 matrix with row
𝑘 (𝝁𝑘 ) denoting the mean of the 𝑘𝑡ℎ cluster.
Constraints on 𝒛𝑛 : Must be a one-hot vector
Constraints on 𝝁𝑘 : None
2.4 Why is it difficult to compute the predictive distribution of a logistic regression model which,
by definition, is given by 𝑝(𝑦∗ = 1|𝒙∗ , 𝑿, 𝒚) = ∫ 𝑝(𝑦∗ = 1|𝒘, 𝒙∗ )𝑝(𝒘|𝑿, 𝒚)𝑑𝒘. Suggest a
method to approximate it and clearly show the necessary equations. (3 marks)
It is difficult because the integral here is not tractable (involves integrating 𝑝(𝑦∗ = 1|𝒘, 𝒙∗ ) which
is a sigmoid function over the posterior 𝑝(𝒘|𝑿, 𝒚) and even if the latter is Gaussian (like in Laplace
approximation), the integral still is intractable. To approximate the integral, one way is to use
Monte-Carlo approximation where we draw 𝑆 i.i.d. samples 𝒘(1) , 𝒘(2) , … , 𝒘(𝑆) from the posterior
and approximate the predictive distribution as
1 𝑆
𝑝(𝑦∗ = 1|𝒙∗ , 𝑿, 𝒚) ≈ ∑ 𝑝(𝑦∗ = 1|𝒘(𝑠) , 𝒙∗ )
𝑆 𝑠=1
2.5 Show that, for generative classification with uniform class marginal and Gaussian class
conditionals 𝒩(𝒙|𝝁𝑘 , 𝚺), the posterior probability of input 𝒙 belonging to class 𝑘, i.e.,
𝑝(𝑦 = 𝑘|𝒙) ∝ exp (𝒘⊤ 𝑘 𝒙 + 𝑏𝑘 ), and write down the expressions for 𝒘𝑘 and 𝑏𝑘 (5 marks)
Since we have a uniform class marginal, the posterior probability will be
𝑝(𝒙|𝑦 = 𝑘)
𝑝(𝑦 = 𝑘|𝒙) = ∝ 𝑝(𝒙|𝑦 = 𝑘)
𝑝(𝒙)
Since the class conditional 𝑝(𝒙|𝑦 = 𝑘) = 𝒩(𝒙|𝝁𝑘 , 𝚺), we have
1 1
𝑝(𝒙|𝑦 = 𝑘) ∝ exp (− ((𝒙 − 𝝁𝑘 )𝚺−1 (𝒙 − 𝝁𝑘 )) ∝ exp (𝝁⊤ −1
𝑘𝚺 𝒙 − 𝝁⊤
𝑘 𝚺𝝁𝑘 ), where in the
2 2
last expression (after the proportionality sign), we are ignoring any terms that are not specific to
1
class 𝑘. Thus 𝑝(𝑦 = 𝑘|𝒙) ∝ 𝑝(𝒙|𝑦 = 𝑘) ∝ exp (𝝁⊤ −1
𝑘 𝚺 𝒙− 𝝁⊤
𝑘 𝚺𝝁𝑘 ) which is clearly in the
2
1
form of exp (𝒘⊤
𝑘𝒙 + 𝑏𝑘 ) where 𝒘𝑘 = (𝝁⊤ −1 ⊤
𝑘𝚺 ) =𝚺 −1
𝝁𝑘 and 𝑏𝒌 = − 𝝁⊤
𝑘 𝚺𝝁𝑘2
Side note (not required for the answer): Note that the above implies that this generative
classification model has a similar form as softmax classification model, although the weight vector
is learned using a generative manner, and not using GD as we do in case of softmax classification