0% found this document useful (0 votes)
67 views26 pages

Practice Exam Solutions

Uploaded by

ike.cai.cxc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views26 pages

Practice Exam Solutions

Uploaded by

ike.cai.cxc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

10-423/10-623 Gen AI Name:

Fall 2024 Andrew ID:


Practice Questions Room:
11/10/24 Seat:
Time Limit: NA Exam Number:

Instructions:
• Verify your name and Andrew ID above.
• This exam contains 26 pages (including this cover page).
The total number of points is 107.
• Clearly mark your answers in the allocated space. If you have made a mistake, cross
out the invalid parts of your solution, and circle the ones which should be graded.
• Look over the exam first to make sure that none of the 26 pages are missing.
• No electronic devices may be used during the exam.
• Please write all answers in pen or darkly in pencil.
• You have NA to complete the exam. Good luck!

Question Points
1. AutoDiff / RNN-LMs 11
2. Transformers and LLMs 13
3. Learning neural language models / Decoding 6
4. Pre-training, fine-tuning / Modern Transformers 13
5. Vision Transformers 9
6. Generative Adversarial Networks (GANs) 5
7. Variational Autoencoders (VAEs) 7
8. Diffusion Models 7
9. In-context Learning 6
10. RLHF 10
11. Text-to-image generation / CLIP 6
12. Prompt2Prompt 10
13. Scaling Laws 4
Total: 107
10-423/10-623 Gen AI Practice Questions - Page 2 of 26 -
10-423/10-623 Gen AI Practice Questions - Page 3 of 26 -

1 AutoDiff / RNN-LMs (11 points)


1.1. (1 point) Select one: What is the primary purpose of using the chain rule of
probability in language models?
⃝ To convert raw text into a sequence of tokens
⃝ To generate new tokens from a fixed-length vector
⃝ To improve the accuracy of word embeddings
⃝ To decompose the joint probability of a sequence of words into a product
of conditional probabilities
⃝ To calculate the probability of each token independently of the others
1.2. (1 point) Select one: Which of the following best describes the main purpose of
backpropagation in neural networks?
⃝ Backpropagation is used to compute the output of the neural network
during inference.
⃝ Backpropagation is used to update the weights of the network by propa-
gating errors from the output layer to the input layer using the chain rule
of calculus.
⃝ Backpropagation is a technique to regularize the model and prevent over-
fitting.
⃝ Backpropagation ensures that the neural network outputs discrete labels
instead of continuous values.
10-423/10-623 Gen AI Practice Questions - Page 4 of 26 -

1.3. (2 points) Select multiple: Which of the following are true about Recurrent
Neural Networks (RNNs) in language modeling?
2 RNNs can process sequences of variable length.
2 RNNs maintain context through hidden states.
2 RNNs do not suffer from vanishing and exploding gradient problems.
2 RNNs are capable of capturing long-term dependencies in data, but may
struggle with them without architecture modifications.
2 RNNs use fixed-size input windows to handle sequential data.
1.4. (3 points) Select all that apply:
⃝ One motivation of LSTMs is to improve the vanishing gradient problem
seen with RNNs.
⃝ It is not possible for LSTMs to experience the vanishing gradient problem.
⃝ It is possible for LSTMs to experience the vanishing gradient problem.
⃝ It is not possible for LSTMs to experience the exploding gradient problem.
⃝ It is possible for LSTMs to experience the exploding gradient problem.
1.5. (1 point) Select all that apply: Vanishing gradients can cause weights to shrink
uncontrollably, leading to numerical instability. Which of the following techniques
is commonly used to mitigate vanishing gradients?
2 activation functions such as ReLU
2 batch normalization
2 layer normalization
2 gradient clipping
2 for an RNN: use LSTM or GRU as the recurrent units instead of simple
RNNs
10-423/10-623 Gen AI Practice Questions - Page 5 of 26 -

1.6. Consider the sigmoid activation function:

(a) (1 point) Short answer: What would the gradient of the sigmoid be for a
very large input?

(b) (1 point) Short answer: Is this a potential problem when training an RNN?
Explain.

(c) (1 point) Short answer: How does ReLU activation (ReLU(z) = max(0,z))
compare to sigmoid activation? Does it solve the problem above completely(if
any)?
10-423/10-623 Gen AI Practice Questions - Page 6 of 26 -

2 Transformers and LLMs (13 points)


2.1. (1 point) Select one: Which of the following statements about Transformer Lan-
guage Models is true?
⃝ Transformer computation graphs grow linearly with the number of input
tokens
⃝ Residual connections in Transformers help in solving the vanishing gradi-
ent problem
⃝ Layer normalization in Transformers helps mitigate internal covariate
shift during training
⃝ Transformers use a single attention head to process input sequences
2.2. (2 points) Select one: What is the primary reason for using multi-head attention
instead of single-head attention in Transformer models?
⃝ To allow the model to compute attention more efficiently.
⃝ To capture different aspects or patterns of the input sequences by using
multiple attention heads.
⃝ To reduce the overall complexity of the Transformer model.
⃝ To enable the model to focus solely on the most important tokens in the
sequence.
2.3. (2 points) Select one: What is the main reason that adding residual connections
to a transformer is useful?
⃝ Improve the long-term memory of the model by helping to remember
information from past timestamps.
⃝ Instead of having to learn a complex transformation, they allow the model
to learn an additive modification of the input.
⃝ Instead of having to learn a complex transformation, they allow the model
to learn an multiplicative modification of the input.
⃝ To fix the issue that attention is position invariant by learning embeddings
related to the position of each word in the input.
10-423/10-623 Gen AI Practice Questions - Page 7 of 26 -

2.4. For the questions below, write the correct option from the list of models below -
1. An encoder model
2. A decoder model
3. A sequence to sequence model
(a) (1 point) Which of the above models would you use to complete prompts with
generated text?

(b) (1 point) Which of the above models would you use for summarizing text?

(c) (1 point) Which of the above models would you use for classifying text inputs
according to certain labels?

2.5. (1 point) Select the correct answer: What is one of the main computational
challenges in scaling Transformers for large language models?
⃝ The fixed size of the positional encoding limits the model’s ability to
process long sequences.
⃝ The quadratic complexity of the self-attention mechanism with respect to
sequence length.
⃝ Overfitting due to the large number of parameters.
⃝ The inability to parallelize model training.
10-423/10-623 Gen AI Practice Questions - Page 8 of 26 -

2.6. (1 point) Select the correct answer: In self-attention, each token of the se-
quence:
⃝ pays attention to itself
⃝ pays attention to all words in the sequence
⃝ pays attention all other tokens in the sequence except itself
⃝ pays attention to a few tokens in its neighbourhood
2.7. (2 points) Select all that apply: Which of the following statements is true about
LLM training?
2 Since LSTMs generate one token at a time, their training across time
steps cannot be parallelized
2 Since transformers generate one token at a time, their training across time
steps cannot be parallelized.
2 If a sentence is padded with pad tokens, the transformer learns to generate
the pad tokens.
2 None of the above
2.8. (1 point) True or False: Computing attention values in a local neighbourhood
over the input is the same as applying a convolution filter over the input
⃝ True
⃝ False
10-423/10-623 Gen AI Practice Questions - Page 9 of 26 -

3 Learning neural language models / Decoding (6 points)


3.1. (1 point) Select one: Which of the following statements is incorrect?
⃝ RNNs process tokens sequentially, while Transformers process tokens in
parallel.
⃝ RNNs suffer from vanishing gradients for long dependencies, while Trans-
formers can capture long-range dependencies using attention mechanisms.
⃝ RNNs have difficulty with parallelization, whereas Transformers are highly
parallelizable, allowing faster training.
⃝ RNN language models inherently capture the sequential order of informa-
tion, while transformer models do not have this built-in capability.
3.2. (2 points) Select multiple: Which of the following statements is/are correct?
2 Neural language models are typically trained using a maximum likelihood
estimation (MLE) approach.
2 Greedy decoding selects the token with the highest probability at each
step, and always returns the sequence with highest probability.
2 Beam search decoding explores multiple possible sequences at each step
and retains the top beam size sequences based on their cumulative prob-
abilities (product of probabilities).
2 During training, language models are optimized directly for decoding ac-
curacy using beam search.
3.3. (3 points) Select all that are true:
⃝ Storing previously computed keys and values across timesteps can help
avoid redundant calculations, potentially improving efficiency.
⃝ Because attention computation for each timestep is dependent on previous
timesteps, scaled dot-product attention cannot easily be parallelized.
⃝ Subword tokenization can help with computational tractability in com-
parison with other tokentization methods, while eliminating OOV words.
⃝ As a transformer LM computes attention based on previous tokens, its
computation graph grow linearly as the number of input tokens grows.
10-423/10-623 Gen AI Practice Questions - Page 10 of 26 -

4 Pre-training, fine-tuning / Modern Transformers (13 points)


4.1. (1 point) Select one: Which of the following is NOT one of the motivations for
LoRA.
2 Large language models are intrinsically low-dimensional.
2 Adapters add latency at test time.
2 Prefix tuning results in improvements to the model that do not monoton-
ically increase with the number of parameters.
2 None of the above
4.2. (1 point) Select one: Which of the following best describes the rank in LoRA.
2 The number of layers in the model.
2 The dimension of the low-rank matrices A and B.
2 The batch size during training
2 The learning rate during finetuning
4.3. (1 point) Select all that apply: Compared to full fine-tuning, LoRA typically:
2 Increases the total number of trained parameters.
2 Increases the computation time during training.
2 Decreases the total number of trained parameters.
2 Requires more data to train.
2 Decreases the GPU memory required during training.
2 Improves interpretability of the final model.
4.4. (1 point) In the context of parameter efficient fine-tuning, how do adapter modules
achieve parameter efficiency while still allowing for efficient fine-tuning?
10-423/10-623 Gen AI Practice Questions - Page 11 of 26 -

4.5. (2 points) Select all that are true: Which of the following are true regarding
Prefix-Tuning.
2 Prefix-tuning involves pre-pending task-specific vectors to the input.
2 Prefix tuning is typically used to train a single set of parameters that are
shared across multiple tasks.
2 Prefix-tuning requires storing a separate tuned copy of the model for
each task, leading to increased storage overhead as the number of tasks
increases.
2 None of the above
4.6. (2 points) Select one: What is the primary purpose of Rotary Position Embed-
dings (RoPE) in a transformer model?
⃝ To improve the efficiency of inference for a transformer model.
⃝ The encode positional information in a way that is rotation invariant.
⃝ To reduce the dimensionality of the input data.
⃝ To improve the model’s ability to retain long-range dependencies.
4.7. (2 points) Select all that are true: What is the primary advantage of using
convolutional layers in neural networks, especially for image-related tasks?
⃝ Convolutional layers reduce the number of parameters by sharing weights
across the input.
⃝ Convolutional layers ensure the model is invariant to transformations like
rotation and scaling.
⃝ Convolutional layers allow the model to learn feature hierarchies by fo-
cusing on local spatial regions.
⃝ Convolutional layers remove noise from the input data during training.
10-423/10-623 Gen AI Practice Questions - Page 12 of 26 -

4.8. (3 points) Select one: Which of the following is not true about auto-encoders?
⃝ Encoders use a non linear activation function to compute an encoding
of the input, which the decoder uses to compute a reconstruction of the
input.
⃝ The output of an autoencoder is exactly the same as its input.
⃝ Autoencoders are an unsupervised learning technique.
⃝ Autoencoders can be trained by using the same backpropagation algo-
rithm used for a 2-layer neural network, where the loss is propagated
back from the reconstruction error to update both the encoder and de-
coder weights.
10-423/10-623 Gen AI Practice Questions - Page 13 of 26 -

5 Vision Transformers (9 points)


5.1. (2 points) Select all that are true In a convolutional neural network:
2 Backpropagation cannot be applied over a max-pool layer as the max
function is not differentiable
2 Backpropagation cannot be applied over a downsampling layer as the
downsampling function is not differentiable
2 A 4x4 convolutional kernel over RGB inputs has 16 weights
2 A 4x4 convolutional kernel over RGB inputs has 48 weights
5.2. (2 points) Select One: We have the following image and want to apply the fol-
lowing 3 × 3 kernel, with a padding of 1 and stride of 2. After applying this kernel,
what is the value at position (0, 1) (0-indexed)?
 
1 2 2 1 1
2 4 4 2 3
 
1 3 3 1 4
 
1 2 3 1 3
0 1 4 2 0

Figure 1: Image Matrix

 
0 1 0
1 −4 1
0 1 0

Figure 2: Kernel Matrix

⃝ 1
⃝ 2
⃝ -1
⃝ 0
10-423/10-623 Gen AI Practice Questions - Page 14 of 26 -

5.3. (1 point) In a Vision Transformer (ViT) model, which is an encoder-only trans-


former network, the extra learnable [class] embedding is analogous to the [CLS]
token in a BERT model. It is used to aggregate the information from the entire
image into a single vector, which can be used for many downstream tasks such as
object detection, etc.
True or False: The use of the extra learnable [class] embedding at the beginning of
the input sequence, in principle, will lead to a better-performing model than if the
extra learnable [class] embedding was used at the very end of the input sequence.
⃝ True
⃝ False
5.4. Consider an application of attention to an image.
(a) (2 points) Select all that apply: For a flattened sequence X ∈ RN ×C of
N = H × W pixels of an image with height H, width W , and C-dimensional
T
features, the attention scores are given by A = softmax( QK
√ ), where Q, K are
D
D-dimensional linear projections of the features: Q = XWq√ , K = XWk , with
Wq , Wk ∈ RC×D . If we do not scale the attention scores by D:
2 We might get instability issues during training.
2 The attention scores will not be normalized.
2 We will not use information from nearby pixels.
2 The dot product will have a larger variance.
(b) (2 points) Select the correct answer: We now subsample in the spatial
dimensions the image of the previous question by a max pooling layer with a
stride of 2 and taking the maximum over a window of size 2 × 2, to obtain
the flattened sequence X̂. We redefine the keys such as K = X̂Wk . The self-

attention can then be written as S = A×V , where V = X̂Wv with Wv ∈ RD×D .
If we use the subsampled version X̂ when defining the keys and values, we
reduce the computational cost of the self-attention by a factor of:
⃝ 1/2
⃝ 1/4
⃝ 1/16
⃝ 1
10-423/10-623 Gen AI Practice Questions - Page 15 of 26 -

6 Generative Adversarial Networks (GANs) (5 points)


6.1. (1 point) True or False: In the training process of a Generative Adversarial Net-
work (GAN), if the input noise vector z to the generator Gθ is completely random
and lacks any discernible pattern, it becomes impossible to perform backpropaga-
tion through the generator Gθ .
⃝ True
⃝ False
6.2. (1 point) Select One: Which of the following statements best describes the role
of the discriminator in a Generative Adversarial Network (GAN)?
⃝ The discriminator generates new data samples from the latent space.
⃝ The discriminator updates the generator’s parameters to improve the
quality of generated samples.
⃝ The discriminator distinguishes between real and generated data samples,
providing feedback to the generator.
⃝ The discriminator minimizes the difference between real and generated
data distributions.
6.3. (2 points) Select all that apply: What could serve as input to the generator in
a GAN?
2 noise
2 noise and label
2 text description
2 image
6.4. (1 point) True or False: If the discriminator overfits on the training data, the
generator will produce samples that closely resemble the training images.
⃝ True
⃝ False
10-423/10-623 Gen AI Practice Questions - Page 16 of 26 -

7 Variational Autoencoders (VAEs) (7 points)


7.1. (0 points) Select all that apply: Which of the following are true statements
about KL divergence?
2 KL divergence is symmetric: KL(p||q) = KL(q||p).
2 KL(p||q) is minimized when distributions p, q are equivalent, i.e. q(x) =
p(x) for all x ∈ X.
2 Maximizing ELBO is the same as minimizing KL divergence.
2 0 ≤ KL(p||q) ≤ 1 for all probability distributions p and q.
2 None of the above.
7.2. (1 point) True or False: In Variational Autoencoders (VAEs), we sample from the
latent distribution to generate the output. Since sampling prevents backpropagation
through the network, neural networks cannot be used to build VAEs.
⃝ True
⃝ False
7.3. (2 points) Select all that apply: The following statements of KL-divergence are
correct:
2 KL-divergence measures the proximity of two distribution
2 KL-divergence is maximized when two distribution are identical
2 KL-divergence is not symmetric: KL(q||p) ̸= KL(p||q)
2 KL-divergence is always non-negative
7.4. (1 point) Short answer: Explain the intuitive meanings of two terms in the VAEs
loss function: L = i −Eqθ (z|x(i) ) [log pϕ (x(i) |z)] + KL(qθ (z|x(i) )||p(z)).
P
10-423/10-623 Gen AI Practice Questions - Page 17 of 26 -

7.5. (1 point) Select all that apply:Why must the noise scaling factor αt in Diffusion
Probabilistic Models (DDPMs) follow a schedule?
2 To ensure smooth addition of noise, preventing excessive corruption of
data early on.
2 To allow the model to gradually learn how to denoise at different noise
levels.
2 To ensure the reverse process can recover the original data distribution
effectively.
2 To control the amount of noise added during each step, ensuring balanced
corruption across all timesteps.
2 None of the above.
7.6. (1 point) Select One: Suppose you are training a neural network p(x) to approx-
imate a known but hard-to-define hprobabilityi distribution q(x). We choose as our
q(x)
objective function KL(q ∥ p) = Eq log p(x) . Which of the following describes how
to select the optimal p(x) to minimize this objective function?
q(x)
⃝ Select p(x) to be as large as possible in order to minimize p(x)
.
q(x)
⃝ Select p(x) to be as small as possible in order to maximize p(x)
.
q(x)
⃝ Select p(x) to be as close as possible to q(x) in order to make p(x)
= 1.
⃝ Select p(x) to be the uniform distribution so this term is independent of
the parameters of the network and can be ignored.
7.7. (1 point) Select One: Let function f(x, N) be the function which takes an image
x and adds Gaussian noise to the image N times such that the output is the Nth
step of forward diffusion in a DDPM model. What is the time complexity of an
optimal implementation of this function? Assume h and w are the image height
and width.
⃝ O(h*w)
⃝ O(log(N)*h*w)
⃝ O(N*h*w)
10-423/10-623 Gen AI Practice Questions - Page 18 of 26 -

8 Diffusion Models (7 points)


8.1. (1 point) Select all that apply: Which of the following will follow a Gaussian
distribution?
2 The sum of two Gaussians
2 The difference of two Gaussians
2 The conditional of a Gaussian with a Gaussian mean
2 The conditional of a Guassian with a nonlinear mean
2 None of the above
8.2. (1 point) Select all that apply: Which of the following can NOT be used in the
implementation of a diffusion model such as the DDPM or LDM?
2 Rotational positional embeddings
2 Sinusoidal positional embeddings
2 A neural network
2 A UNet model
2 Self-attention
2 A large language model (LLM)
2 A variational autoencoder (VAE)
2 None of the above
8.3. (5 points) Short Answer: Consider the noise-based parameterization of a diffu-
sion model: in this case, we use a neural network to estimate the noise (ϵ) added
to the initial data point (x0 ) to produce the current state (xt ). This noise-based
parameterization has been empirically observed to yield images of higher quality
compared to other parameterization methods.
How does this noise-based parameterization differ from the alternatives, and why
might this formulation lead to the highest quality images?
10-423/10-623 Gen AI Practice Questions - Page 19 of 26 -

9 In-context Learning (6 points)


9.1. (2 points) Select all that are true:
2 In-context learning can involve providing AI model with examples of the
desired task within the input prompt.
2 In-context learning requires fine-tuning the model on a specific dataset
before it can perform the task.
2 In-context learning allows the model to adapt to new tasks without the
need for additional training.
2 The performance of in-context learning depends on the model’s ability to
understand and generalize from the given examples.
2 In-context learning is limited to tasks that the model has been explicitly
trained on during its initial training phase.
2 The choice and quality of the examples provided in the input prompt can
significantly impact the model’s performance in in-context learning.
9.2. (1 point) True or False: The performance of In-Context Learning is generally
better in larger language models because they are trained on more diverse tasks.
⃝ True
⃝ False
9.3. (2 points) True or False: In-Context Learning can be thought of as a form of
meta-learning, where the model implicitly learns to adapt to tasks through examples
within the prompt.
⃝ True
⃝ False
9.4. (1 point) Short answer: Explain why In-Context Learning can struggle with tasks
that require multi-step reasoning.
10-423/10-623 Gen AI Practice Questions - Page 20 of 26 -

10 RLHF (10 points)


10.1. (1 point) True or False: In the RLHF process, the reward model typically has
more parameters than the language model being fine-tuned.
⃝ True
⃝ False
10.2. (1 point) Select one: Which of the following best describes the training process
for the reward model in RLHF?
⃝ It is trained to minimize the perplexity of human-written responses
⃝ It is trained to maximize the likelihood of all possible responses
⃝ It is trained so that higher-ranking responses receive larger rewards than
lower-ranking ones
⃝ It is trained to directly optimize the parameters of the language model
10.3. (1 point) Select one: During the RLHF process, what is the primary objective
function being optimized when training the policy (language model)?
⃝ To minimize the perplexity of human-labeled responses
⃝ To maximize the expected reward from the reward model
⃝ To minimize the KL divergence between model outputs
⃝ To maximize the likelihood of all possible responses
10.4. (1 point) Select all that apply: What does the state space consist of in RLHF?
2 The vocabulary of possible next tokens
2 The set of all possible prompts
2 All possible sequences of tokens
2 The parameters of the language model
10-423/10-623 Gen AI Practice Questions - Page 21 of 26 -

10.5. (2 points) Select the correct answer Which of the following best describes the
primary goal of chain-of-thought prompting?
⃝ To improve the model’s ability to generate coherent and engaging stories
⃝ To enhance the model’s performance on tasks that require multi-step rea-
soning and problem-solving
⃝ To reduce the computational resources required for training large language
models
⃝ To increase the model’s capacity to understand and respond to emotional
cues in natural language
10.6. (2 points) Select all that are true: In the context of Reinforcement Learning
with Human Feedback (RLHF), what steps are involved in the process?
2 Training a reward model based on human rankings of model-generated
responses.
2 Using reinforcement learning to train the model with rewards as ”ground
truth” for desirable outputs.
2 Collecting prompts and generating multiple responses for each to be ranked
by humans.
2 Applying unsupervised learning techniques to automatically generate train-
ing data without human intervention.
10.7. (2 points) Select all that are true: Which of the following statements accurately
describe aspects of Instruction Fine-Tuning?
2 It aims to reduce the perplexity of a large training corpus by predicting
the most likely next words in a sequence.
2 It involves fine-tuning Language Models (LMs) on a dataset specifically
created to align the model’s output with human expectations for given
tasks.
2 It uses sources of prompts and responses to build a ”chat agent” training
dataset.
2 It incorporates multiple names such as chat fine-tuning, alignment, and
behavioral fine-tuning.
10-423/10-623 Gen AI Practice Questions - Page 22 of 26 -

11 Text-to-image generation / CLIP (6 points)


11.1. Select all that apply: In Homework 4 (and very briefly in lecture) we saw an
example of a text-to-image model that used “classifier free guidance”. In this tech-
nique, motivated by the desire to add a temperature parameter to diffusion models,
two diffusion models ϵθ (xt , t, τθ (y)) and ϵθ (xt , t) are jointly trained, where y repre-
sents a text string to condition upon, t a timestep, and xt a partially noised image.
“Classifier free guidance” adds a parameter w which interpolates between each
models output at inference time such that ϵ̃θ (xt , t, τθ (y)) = (1 − w)ϵθ (xt , t, τθ (y)) +
(w)ϵθ (xt , t) .
Which of the following would be the effect of generating a single image using this
ϵ̃θ (xt , t, τθ (y)) (as compared to generating images using ϵθ (xt , t, τθ (y))? Assume w
is positive and less than 1.
2 Less diversity in generated images
2 Higher quality generated images
2 Inference time doubled
2 Worse text-image alignment
11.2. (2 points) Select all that apply: A Text-To-Image Latent Diffusion Model (LDM)
pipeline includes the following components.
2 image autoencoder
2 text encoder
2 UNet
2 tokenizer
2 discriminator
11.3. (1 point) True or False: CLIP separately trains an image encoder and a text
encoder and only jointly inference to predict the correct pairings of (text, image)
at test time.
⃝ True
⃝ False
11.4. (2 points) Select all that apply: The following statements of classifier-free guid-
ance are correct:
2 Classifier-free guidance can improve the quality of generated samples.
2 Larger guidance scale always leads to higher quality of generated samples.
2 Larger guidance scale increases the diversity of generated samples.
2 Larger guidance scale decreases the diversity of generated samples.
10-423/10-623 Gen AI Practice Questions - Page 23 of 26 -

11.5. (1 point) True or False: In the LDM framework, the latent representations from
the VAE can be directly fed to the diffusion UNet.
⃝ True
⃝ False
10-423/10-623 Gen AI Practice Questions - Page 24 of 26 -

12 Prompt2Prompt (10 points)


12.1. Fill in the blank: In Prompt-to-Prompt editing, controlling the cross-attention
maps allows for while maintaining the overall structure of the
source image.
⃝ generating entirely new images
⃝ local edits to specific objects or attributes
⃝ adding random noise to the image
⃝ removing all semantic information from the image
12.2. Fill in the blank: In the Prompt-to-Prompt method, changes to the image are
primarily achieved by modifying .
⃝ the latent noise vector
⃝ the cross-attention maps
⃝ the input image resolution
⃝ the number of diffusion steps
12.3. True or False: Using Prompt-to-prompt to edit real images (such as those cap-
tured by a digital camera or cell phone) is impossible.
⃝ True
⃝ False
12.4. (1 point) Select the correct answer: In the context of attention replacement
editing, why might a mapping matrix be required to be sparse (mostly zeros)?
⃝ To reduce computational complexity
⃝ To emphasize the modifications between the original and edited prompts
⃝ To prevent overfitting during training
⃝ To maintain attention on unchanged tokens while updating the changed
ones
10-423/10-623 Gen AI Practice Questions - Page 25 of 26 -

12.5. (2 points) Select the correct answer: How would attention re-weighting affect
the generation of an edited image in diffusion models?
⃝ It adjusts the emphasis on specific features in the image
⃝ It alters the style of the image without affecting the content
⃝ It changes the random seed influencing the image generation
⃝ It increases the resolution of the generated image
12.6. (3 points) Select the correct answer: The trade-off in Prompt-to-Prompt image
editing between fidelity to the edited prompt and the source image is controlled by:
⃝ Adjusting the image contrast settings
⃝ The number of injection timestamps
⃝ Varying the amount of Gaussian noise added
⃝ The depth of the U-Net used
12.7. (4 points) Select the correct answer: The process of adjusting attention weights
in image editing via text prompts is analogous to which of the following?
⃝ Fine-tuning the hyperparameters of a neural network
⃝ Adjusting the focus in a photographic image
⃝ Changing the storyline in a movie script
⃝ Resizing an image while maintaining aspect ratio
10-423/10-623 Gen AI Practice Questions - Page 26 of 26 -

13 Scaling Laws (4 points)


13.1. Select One: What is the relationship between compute budget and data filtering,
as suggested by recent scaling laws?
⃝ More compute always requires more data filtering
⃝ As compute increases, less data filtering is needed
⃝ Data filtering is independent of compute budget
⃝ More compute requires exponentially more data filtering
13.2. Select One: What was the key finding regarding the balance between model size
and training data found by the Hoffman et. al. study?
⃝ Increasing model size is more important than increasing training data
⃝ The optimal ratio is to increase parameters 8x for every 5x increase in
training data
⃝ Previous models were using too little data relative to their parameter
count
⃝ Convergence is not critical for good performance
13.3. (2 points) Short answer: Why is the MoE model more efficient?

13.4. (2 points) Short answer: Why is the MoE model difficult to train?

You might also like