0% found this document useful (0 votes)
5 views4 pages

AI60201 Module3 4 Problems

The document contains a series of questions related to various machine learning models, including binary image generation, variational autoencoders, Deep Boltzmann Machines, Normalizing Flows, GANs, and clustering with Dirichlet Processes. Each question requires mathematical derivations, probability distributions, and parameter estimations based on given observations and model architectures. The document also discusses advanced concepts such as Gibbs Sampling, Chinese Restaurant Processes, and Gaussian Processes in the context of real-valued observations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

AI60201 Module3 4 Problems

The document contains a series of questions related to various machine learning models, including binary image generation, variational autoencoders, Deep Boltzmann Machines, Normalizing Flows, GANs, and clustering with Dirichlet Processes. Each question requires mathematical derivations, probability distributions, and parameter estimations based on given observations and model architectures. The document also discusses advanced concepts such as Gibbs Sampling, Chinese Restaurant Processes, and Gaussian Processes in the context of real-valued observations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Q1. I want to generate a 3x3 binary image by sequentially generating pixels row-wise.

The first pixel


X(1,1) follows Bernoulli(0.5). Each pixel X(i,j) follows a Bernoulli distribution, with parameter h(i,j)
equal to 0.5 times the mean of the pixel values {X(<=i,<=j)}, i.e. above and to the left of it.

i) Write a general expression for p(X), where X is any 3x3 binary model, according to this
model.
ii) Calculate the probability of generating an image where the central pixel (i.e. X(2,2)) is
different from the remaining 8 pixels (which are all equal).

Q2. A variational autoencoder can generate data X as 2x2 matrices as follows: i) Two random variables
Z1, Z2 are sampled from N(a,1) and N(b,1) independently, ii) these are passed to a neural network
with 1 hidden layer which produces 4 output values – X(1,1), X(1,2), X(2,1), X(2,2). The decoder
network architecture is given below.

i) What is the probability distribution over the space of 2x2 real-valued matrices that is
induced by the autoencoder?
ii) I have N observations from the VAE: [X1, X2, …. XN] (all 2x2 matrices). Assuming all edges
between Z and V have equal weight ‘w1’, and all edges between V and X also have equal
weight ‘w2’, estimate these parameters a, b, w1, w2.
iii) Now consider another encoder network with 1 hidden layer of 3 nodes, and 2 output
nodes. The edge weights are again constrained in the same way as the decoder (as
described in (ii). Consider that the output nodes represent the means of the Gaussian-
distributed codes (variance is considered to be fixed to 1). Given the N observations,
explain how encoder and decoder parameters will be estimated with necessary
derivations.

Solution Sketch:

i) P(X) = P(X11)*P(X12)*P(X21)*P(X22)

Z1~N(a,1), Z2~N(b,1). V1=w1*(Z1+Z2), V2=w1*Z1, V3=w1*Z2, so V1~N(w1(a+b), 2w12), V2~N(w1a,


w12), V3~N(w1b, w12). Also, X11=w2*V1, X12=w2*V2, X21=w2*V2, X22=w2*(V1+V3).

So we have X11~N(w1w2(a+b), 2w12 w22), X12~N(w1w2a, w12 w22), X21~N(w1w2a, w12 w22), X22~N(w1
w2(a+2b), 3w12 w22).

ii) From the observations, calculate sample means [m11, m12, m21, m22] and sample
variances [s11, s12, s21, s22]. Considering N large enough, we can have equations like
w1w2(a+b) = m11, w1w2a = m12 = m21, w1w2(a+2b) = m22 etc. Solving these, we can find
a, b, and w1w2. We can estimate w1, w2 individually if we have some prior on them.

iii) We first develop the augmented dataset as [(X1, e1, f1 ), ….. [(XN, eN, fN)] by sampling the
noise ei~N(0,1), fi~N(0,1) (for the reparameterization trick). The loss function L(Xi) has two
parts: reconstruction error L1(X1) and KL divergence L2(Xi). For input Xi to the encoder,
we can easily derive expressions for the two outputs, let’s say c(Xi) and d(Xi). So the code
variables have distributions N(c(Xi),1) and N(d(Xi),1), while according to the model these
are N(a,1) and N(b,1). So L2(Xi) = ½*(a-c(Xi))2 + ½*(b-d(Xi))2 [Using formular for K-L Div
between two Gaussians]. Again, the code for Xi as calculated by the encoder is [c(Xi)+ei,
d(Xi)+fi], which is equivalent to sampling from N(c(Xi),1), N(d(Xi),1). The decoder calculates
X’ where X’(1,1)=w2*w1*(c(Xi)+ d(Xi)+ei+ fi), X’(1,2)=X’(2,1)=w2*w1*(c(Xi)+ei),
X’(2,2)=w2*w1*(c(Xi)+ei+2*d(Xi)+2*fi). Accordingly, we have L1(Xi)=||X’-Xi||2. We now
need to calculate the gradient of L=Σi(L1(Xi)+L2(Xi)) with respect to each parameter in both
encoder and decoder, and run gradient descent.

Q3. Consider a Deep Boltzmann Machine to model 4D real-valued observations. There are 3 hidden
layers, with 2, 2, 1 hidden nodes, where the 2 nodes on the lowest hidden layer is real-valued in (0,1),
while the other 3 are binary. The edge potential functions are defined as usual, i.e. S(x,y)=exp(-x*y*w)
where ‘w’ is the edge weight.

i) Calculate the density p(X) over the 4D real space as defined by the model (assuming the
edge weights are all 1).
ii) We are provided with N observations [X1, ….. XN], each of which is a 4D vector. How do we
estimate the suitable model parameters (edge weights)?

Solution Sketch:

In addition to the observation X=[X1,X2,X3,X4] there are latent variables [Y1,Y2] (real), [Z1,Z2] (binary),
and V (binary). So p(X, Y, Z, V) =

S(X1,Y1)* ….. *S(X4,Y1)*S(X1,Y2)*….*S(X4,Y2)*S(Y1,Z1)*S(Y1,Z2)*S(Y2,Z1)*S(Y2,Z2)*S(Z1,V)*S(Z2,V)

= exp(-X1*Y1 - … - X4*Y1 - X1*Y2 - … - X4*Y2 - Y1*Z1 - Y1*Z2 - Y2*Z1 - Y2*Z2 - Z1*V - Z2*V).

To obtain p(X), we need to marginalize the other variables. Y1, Y2 can be eliminated by integration
over (0,1), while for Z1, Z2, V we need to sum over {0,1}.

For edge weight estimation, we need the approach of computing gradients using contrastive
divergence, as discussed. For this, we need to sample different variables through Gibbs Sampling, for
which we need to derive the distributions, such as p(V|Z1,Z2), p(Z1|V,Y1,Y2), p(Y1|Z1,Z2,X1,X2,X3,X4)
etc. These are easy to calculate based on the edge potential functions.

Q4. Consider a Normalizing Flow model, where we start with a 2x2 matrix Z where each entry follows
N(0,1). In each step, we carry out the operation Z(i+1) = A(i)*Z(i)+B(i) where A(i) is an invertible 2x2
matrix, and B(i) is any 2x2 matrix.

i) In the special case that A(i)=i*I (I is 2x2 identity) and B(i)=[[i 0], [0 i]] at each step, calculate
the distributions of Z(2), Z(3) etc
ii) Given any observation X (a 2x2 matrix), calculate the corresponding Z in the general case
of A, B
iii) Given a set of observations of 2x2 real matrices (X1, …. , XN), and assuming A(i)=a(i)*I,
B(i)=[[b(i) 0], [0 b(i)]], discuss how to estimate the parameters by maximum likelihood.
Q5. Consider a variable X0 ~ U(4,5). It is subjected to diffusion q(Xt|X0) according to a schedule (β1, β2
…..) where 0<βi<1. Calculate the marginal distribution of the diffusion stages X1, X2, …. Based on these,
calculate the denoising distributions q(Xt|Xt+1). Discuss how the denoising distribution can take a
sample from N(0,1) and convert it into a sample from U(4,5).

Q6. The generator of a GAN takes as input a random variable Z~N(0,1), and maps it to a 2D vector X
by a simple linear transformation with parameters (w1,w2). Find the generator’s distribution pGEN.

Now, we want to distinguish between vectors based on which component is larger using a binary label.
How do we make the generator into a conditional generator?

The data distribution (pDATA) is a GMM where the two modes are (100,1) and (1,100) with variance
20 each. We now need to build a discriminator based on logistic regression. Taking a random initial
value of the discriminator’s parameter, calculate the GAN objective by taking N samples from pGEN
and pDATA. Optimize the discriminator’s parameter w.r.t. these samples, and re-calculate the GAN
objective.

For generating Gaussian samples, use samples drawn from N(0,1): [0.54, 1.8, -2.26, 0.86, 0.32, -1.31,
-0.43, 0.34, 3.58, 2.77, -1.35, 3.03, 0.73, -0.06, 0.71, -0.21, -0.12, 1.49, 1.41, 1.42]

Q7. We are interested in the task of 3x3 binary image generation. The class label indicates the number
of white pixels in the image (0-9). For each class. The class distribution p(Y) and the distribution p(X|Y)
are specified. Now a generative model f is developed, which also produces binary images. How will
you evaluate the model performance? If the class label is based on a weighted sum of the pixels, how
will this approach change?

Q8. Consider the following dataset of 3D real vectors. We wish to cluster them based on Dirichlet
Process, for which we consider Gaussian Base Distribution N(0, I) [I: 3x3 identity matrix] for the
Gaussian component mean, and Gamma Base distribution Γ(6, 2) on σ2, where σ2I is the Gaussian
component covariance. The DP concentration parameter α=2. Consider any arbitrary initial clustering
of the vectors. Demonstrate one full pass of Gibbs Sampling based on CRP, that includes updating the
cluster indices, and the Gaussian component parameters.

ID 1 2 3 4 5 6 7 8 9 10 11 12
X1 1.9 -4.6 0.2 7.5 -1.3 1.5 2.4 9.0 -3.0 -6.2 4.8 5.8
X2 2.7 -8.7 0.8 4.5 4.9 2.4 3.6 5.8 -7.2 -9.9 1.7 3.1
X3 4.1 4.3 2.1 -2.7 8.6 3.6 4.0 -1.2 6.2 3.2 -5.2 -2.2

Q9. Distance-Dependent Chinese Restaurant Process: All observations are real-valued vectors. Each
new observation joins a cluster depending not only the size of the cluster (usual CRP), but also on its
mean Euclidean distance from all the observations in that cluster. The DDCRP score between an
observation and a cluster is defined by these two quantities, and the distribution is created using
softmax function on these score. Demonstrate this process on the above dataset.
Q10. In case of Chinese Restaurant Process, calculate p(Z2|Z1=1). In case of DPMM, calculate
p(Z2|Z1=1, X2, X1).

Solution Sketch: p(Z2|Z1=1) = ∫p(Z2|π)p(π)dπ [Z1=1 is by definition, i.e. it is a sure event]. In other
words, p(Z2=1)= π1, and π1~Beta(1, α) and Expected value of π1: 1/(1+ α). p(Z2=2)=1- π1= α /(1+ α).

In case of DPMM, p(Z2=1|Z1=1, X2, X1) = p(X2|Z2=1, Z1=1, X1)*p(Z2=1| Z1=1, X1) = f(X2,
φ1)*(1/(1+α))

Q11. We are looking to estimate real-valued observations ‘y’ at points with 2D vector representations
‘x’, i.e. y=f(x). We consider a Gaussian Process prior over it, i.e. f ~ GP(u(x), C(x,x’)) where the mean
function u(x)=x, and covariance function C(x,x’) = exp(-||x-x’||2), where x is a 2D vector. Estimate y at
(0,0) based on the following observations of ‘f’. Show how the uncertainty varies as we use more
observations for this estimate.

ID 1 2 3 4 5
X1 3 2 -4 5 -6
X2 4 6 -2 -5 3
Y 0.17 0.14 0.18 0.12 0.13

You might also like