AI60201 2024 Endsem Solutions
AI60201 2024 Endsem Solutions
Subject No. : AI60201 Subject : Graphical and Generative Models for Machine Learning
Department/Center/School: Centre of Excellence in Artificial Intelligence
i) Explain why directed graphical models and undirected graphical models do not
represent the same set of probability distributions.
- You need to show at least one graphical model whose DGM and moralized
UGM versions do not have same set of conditional independence relations.
They may include the head-head collision or “square” graphs.
ii) Explain how message-passing inference can be used for inference of any
intermediate latent state of a sequence of observations that follows a HMM.
- You are expected to draw the HMM, show the message paths and define
message formula (sum-product)
- You should write the formulation and why it is needed for training VAE
(backpropagation cannot handle intermediate sampling)
- DPMM is not same as CRP (though related). Writing CRP equations will bring
only patial marks. For topic models, we have documents, each having own
distribution over shared set of topics. This requires hierarchical DPMM. The
topics are the mixture components, drawn from the base distribution (Dirichlet)
ii) Explain the approach of variational inference with necessary equations. Show how it can
be applied for inference in Hidden Markov Models. [4+6=10 marks]
Q3. Draw a Bayesian Network for the given model in plate notation. Derive the Gibbs
Sampling updates for Z1, Z2, Z3. State the full algorithm for inference (including burn-in,
sample collection etc). Assume the parameters are known. [2+4+4=10 marks]
- Π outside any box, a big box having Z2, within it another smaller box
containing Z3 and X3. Arrows: Π -> Z2, Z2->Z3, Z3->X3
- The decoder represents function X=f(z) s.t. X = [6z2+4, 8z2-8]. We start with
z=1.2, and calculate the corresponding X (call it X0). This X is now passed to
the encoder, which estimates u(X)=(x2-x1)/2 = z2-6, S(X)=(4X1-3X2)/2 = 20.
- Now, to calculate the loss, we need to see how the decoder can reconstruct
X0 from the encoder’s output. We need to calculate Z ~ N(u(X0), S(X0)) [Z =
u(X0) + e*S(X0)], and then X=f(Z), and then the loss ||X-X0||2 +
KL(N(0,1)||N(u(X0),S(X0))). To calculate the new value of Z, we use the
parameterization trick, for which we are provided with 5 values of e. We
calculate the loss for each of them, then the average
Q5. i) Develop a Normalizing Flow Model that takes a input vector Z (1xD) and converts it
into a vector of the form [a1Z1 ……. adZd b/Zd+1……..b/ZD] at each stage. The original input
is Z~N(0,I) where 0 is D-dim 0-vector and I is DxD identity matrix. Derive the PDF of the
output vector X after 2 stages of transformation as above.
- Need to derive the expression for X in terms of Z. This will help to calculate
p(X) in terms of p(Z), which is known. For (ii), need to write the joint PDF
p(X1)*p(X2)….*p(XN), and then maximize it w.r.t the parameters.
[5+5=10 marks]
Q6. Consider a GAN whose generator has the same architecture as the decoder of the VAE
in Q4. The discriminator is a logistic regression classifier, whose weight vector is [2 1] with
bias 0. The dataset has 5 observations [(1 -2), (2 -4), (-2 4), (5, -10), (-3, 6)].
i) Drawing 5 samples from the generator using noise values Z={-0.5 0.8 1.3 -2.2
0.1}, evaluate the GAN objective function.
ii) Suggest new weights of the generator, so that the objective function improves
w.r.t the generator. Similarly, suggest new weights of the discriminator so that
objective function improves w.r.t the discriminator.
iii) Explain the GAN objective function from the perspective of J-S divergence.
[4+4+2=10 marks]
[I have given a max of 6 marks for doing the first part correctly]
- The “decoder” from Q4 needs to be used, not “encoder”. For each value of Z,
it gives us X = f(Z), whose formula is again [6z2+4, 8z2-8]. Thus we get 5
values of X (drawn from Xgen). We also have 5 samples from Xdata.To calculate
the GAN objective function, we calculate average values of log(D(x)) and
log(1-D(x)) using these, where D(x) = 1/(1+exp(-w.x)) with w=[2 1]. This gives
us the GAN objective function value.
- We can easily see that the true data is of the form (z, -2z) while we are
generating [6z2+4, 8z2-8]. So to improve objective WRT generator, we must
change the weights of the decoder accordingly. Similarly, we find that
discriminator is totally confused (D(x)=0.5) for Xdata. So its weight has to be
changed so that it has high response for Xdata (D(x)=1)
Q7. Consider a labelled dataset as given below. Two models have been used to generate
10 observations each, with class labels. Using different measures of generative model
evaluation, compare the two models against the training data. Use a suitably trained
Bayesian Classifier for classification purposes. [10 marks]
TrainID 1 2 3 4 5 6 7 8 9 10
X 21.5 24.8 29.4 27.6 -12.2 -15.7 -13.6 -11.1 0.5 -0.7
Y A A A A B B B B C C
M1ID 1 2 3 4 5 6 7 8 9 10
X 12.1 18.3 16.7 14.5 13.6 -7.4 -1.2 -4.6 -6.8 -2.0
Y A A A A A B B B B B
M2ID 1 2 3 4 5 6 7 8 9 10
X 4.8 1.2 3.4 -2.4 -3.2 -1.6 1.8 0.3 -0.5 -1.2
Y A A A B B B C C C C
- We can use the standard criteria like diversity, sharpness and inception score
as discussed in class. For sharpness, we need a classifier trained on the
original data to classify the generated samples, but the classifier should be
probabilistic (Bayesian specified). For each of the 20 observations, the
classifier will give a probability distribution over class labels, whose entropy
can be calculated. Then average entropy value will be calculated separately
for the M1 samples and M2 samples, and compared for sharpness (lower
entropy: higher sharpness). For the diversity, the generated class label
distributions should be compared to the original class label distribution w.r.t
cross-entropy. Other criteria (ex. Based on mean/variance etc are acceptable
as long as done consistently)
Q8. i) Consider an online clustering problem, where the data arrives one by one, and may
either be placed in an existing cluster or to a new cluster, according to Chinese Restaurant
Process. Considering the mixture components are N(u, 25) with a base distribution of
N(0,25) on ‘u’ and α=2, demonstrate how the dataset below will be clustered according to
the CRP. How can we control the number of clusters formed?
21.5 -15.7 0.5 -12.2 27.6 24.8 -13.6 -11.1 29.4 -0.7
- For online clustering, we cannot use Gibbs Sampling. But for each datapoint,
we will calculate its clustering distribution (probability of joining existing cluster
or creating new cluster). For these calculations we need CRP formula:
P(Zn+1=k) (nk/(n+ α)*N(x; uk, 25). ‘uk’ for each cluster is estimate as the mean
of the datapoints already assigned to the cluster.
- The main thing to note here is that (x,y,z) vector denotes the locations where
we have measurements, so we calculate mean function and cov . function
accordingly. For eg. Σ(x,x’) = exp(-||x-x’||2) where x, x’ are 3D location vectors.