Practice Exam Solutions
Practice Exam Solutions
Instructions:
• Verify your name and Andrew ID above.
• This exam contains 26 pages (including this cover page).
The total number of points is 107.
• Clearly mark your answers in the allocated space. If you have made a mistake, cross
out the invalid parts of your solution, and circle the ones which should be graded.
• Look over the exam first to make sure that none of the 26 pages are missing.
• No electronic devices may be used during the exam.
• Please write all answers in pen or darkly in pencil.
• You have NA to complete the exam. Good luck!
Question Points
1. AutoDiff / RNN-LMs 11
2. Transformers and LLMs 13
3. Learning neural language models / Decoding 6
4. Pre-training, fine-tuning / Modern Transformers 13
5. Vision Transformers 9
6. Generative Adversarial Networks (GANs) 5
7. Variational Autoencoders (VAEs) 7
8. Diffusion Models 7
9. In-context Learning 6
10. RLHF 10
11. Text-to-image generation / CLIP 6
12. Prompt2Prompt 10
13. Scaling Laws 4
Total: 107
10-423/10-623 Gen AI Practice Questions - Page 2 of 26 -
10-423/10-623 Gen AI Practice Questions - Page 3 of 26 -
1.3. (2 points) Select multiple: Which of the following are true about Recurrent
Neural Networks (RNNs) in language modeling?
2 RNNs can process sequences of variable length.
2 RNNs maintain context through hidden states.
2 RNNs do not suffer from vanishing and exploding gradient problems.
2 RNNs are capable of capturing long-term dependencies in data, but may
struggle with them without architecture modifications.
2 RNNs use fixed-size input windows to handle sequential data.
1.4. (3 points) Select all that apply:
⃝ One motivation of LSTMs is to improve the vanishing gradient problem
seen with RNNs.
⃝ It is not possible for LSTMs to experience the vanishing gradient problem.
⃝ It is possible for LSTMs to experience the vanishing gradient problem.
⃝ It is not possible for LSTMs to experience the exploding gradient problem.
⃝ It is possible for LSTMs to experience the exploding gradient problem.
1.5. (1 point) Select all that apply: Vanishing gradients can cause weights to shrink
uncontrollably, leading to numerical instability. Which of the following techniques
is commonly used to mitigate vanishing gradients?
2 activation functions such as ReLU
2 batch normalization
2 layer normalization
2 gradient clipping
2 for an RNN: use LSTM or GRU as the recurrent units instead of simple
RNNs
10-423/10-623 Gen AI Practice Questions - Page 5 of 26 -
(a) (1 point) Short answer: What would the gradient of the sigmoid be for a
very large input?
(b) (1 point) Short answer: Is this a potential problem when training an RNN?
Explain.
(c) (1 point) Short answer: How does ReLU activation (ReLU(z) = max(0,z))
compare to sigmoid activation? Does it solve the problem above completely(if
any)?
10-423/10-623 Gen AI Practice Questions - Page 6 of 26 -
2.4. For the questions below, write the correct option from the list of models below -
1. An encoder model
2. A decoder model
3. A sequence to sequence model
(a) (1 point) Which of the above models would you use to complete prompts with
generated text?
(b) (1 point) Which of the above models would you use for summarizing text?
(c) (1 point) Which of the above models would you use for classifying text inputs
according to certain labels?
2.5. (1 point) Select the correct answer: What is one of the main computational
challenges in scaling Transformers for large language models?
⃝ The fixed size of the positional encoding limits the model’s ability to
process long sequences.
⃝ The quadratic complexity of the self-attention mechanism with respect to
sequence length.
⃝ Overfitting due to the large number of parameters.
⃝ The inability to parallelize model training.
10-423/10-623 Gen AI Practice Questions - Page 8 of 26 -
2.6. (1 point) Select the correct answer: In self-attention, each token of the se-
quence:
⃝ pays attention to itself
⃝ pays attention to all words in the sequence
⃝ pays attention all other tokens in the sequence except itself
⃝ pays attention to a few tokens in its neighbourhood
2.7. (2 points) Select all that apply: Which of the following statements is true about
LLM training?
2 Since LSTMs generate one token at a time, their training across time
steps cannot be parallelized
2 Since transformers generate one token at a time, their training across time
steps cannot be parallelized.
2 If a sentence is padded with pad tokens, the transformer learns to generate
the pad tokens.
2 None of the above
2.8. (1 point) True or False: Computing attention values in a local neighbourhood
over the input is the same as applying a convolution filter over the input
⃝ True
⃝ False
10-423/10-623 Gen AI Practice Questions - Page 9 of 26 -
4.5. (2 points) Select all that are true: Which of the following are true regarding
Prefix-Tuning.
2 Prefix-tuning involves pre-pending task-specific vectors to the input.
2 Prefix tuning is typically used to train a single set of parameters that are
shared across multiple tasks.
2 Prefix-tuning requires storing a separate tuned copy of the model for
each task, leading to increased storage overhead as the number of tasks
increases.
2 None of the above
4.6. (2 points) Select one: What is the primary purpose of Rotary Position Embed-
dings (RoPE) in a transformer model?
⃝ To improve the efficiency of inference for a transformer model.
⃝ The encode positional information in a way that is rotation invariant.
⃝ To reduce the dimensionality of the input data.
⃝ To improve the model’s ability to retain long-range dependencies.
4.7. (2 points) Select all that are true: What is the primary advantage of using
convolutional layers in neural networks, especially for image-related tasks?
⃝ Convolutional layers reduce the number of parameters by sharing weights
across the input.
⃝ Convolutional layers ensure the model is invariant to transformations like
rotation and scaling.
⃝ Convolutional layers allow the model to learn feature hierarchies by fo-
cusing on local spatial regions.
⃝ Convolutional layers remove noise from the input data during training.
10-423/10-623 Gen AI Practice Questions - Page 12 of 26 -
4.8. (3 points) Select one: Which of the following is not true about auto-encoders?
⃝ Encoders use a non linear activation function to compute an encoding
of the input, which the decoder uses to compute a reconstruction of the
input.
⃝ The output of an autoencoder is exactly the same as its input.
⃝ Autoencoders are an unsupervised learning technique.
⃝ Autoencoders can be trained by using the same backpropagation algo-
rithm used for a 2-layer neural network, where the loss is propagated
back from the reconstruction error to update both the encoder and de-
coder weights.
10-423/10-623 Gen AI Practice Questions - Page 13 of 26 -
0 1 0
1 −4 1
0 1 0
⃝ 1
⃝ 2
⃝ -1
⃝ 0
10-423/10-623 Gen AI Practice Questions - Page 14 of 26 -
7.5. (1 point) Select all that apply:Why must the noise scaling factor αt in Diffusion
Probabilistic Models (DDPMs) follow a schedule?
2 To ensure smooth addition of noise, preventing excessive corruption of
data early on.
2 To allow the model to gradually learn how to denoise at different noise
levels.
2 To ensure the reverse process can recover the original data distribution
effectively.
2 To control the amount of noise added during each step, ensuring balanced
corruption across all timesteps.
2 None of the above.
7.6. (1 point) Select One: Suppose you are training a neural network p(x) to approx-
imate a known but hard-to-define hprobabilityi distribution q(x). We choose as our
q(x)
objective function KL(q ∥ p) = Eq log p(x) . Which of the following describes how
to select the optimal p(x) to minimize this objective function?
q(x)
⃝ Select p(x) to be as large as possible in order to minimize p(x)
.
q(x)
⃝ Select p(x) to be as small as possible in order to maximize p(x)
.
q(x)
⃝ Select p(x) to be as close as possible to q(x) in order to make p(x)
= 1.
⃝ Select p(x) to be the uniform distribution so this term is independent of
the parameters of the network and can be ignored.
7.7. (1 point) Select One: Let function f(x, N) be the function which takes an image
x and adds Gaussian noise to the image N times such that the output is the Nth
step of forward diffusion in a DDPM model. What is the time complexity of an
optimal implementation of this function? Assume h and w are the image height
and width.
⃝ O(h*w)
⃝ O(log(N)*h*w)
⃝ O(N*h*w)
10-423/10-623 Gen AI Practice Questions - Page 18 of 26 -
10.5. (2 points) Select the correct answer Which of the following best describes the
primary goal of chain-of-thought prompting?
⃝ To improve the model’s ability to generate coherent and engaging stories
⃝ To enhance the model’s performance on tasks that require multi-step rea-
soning and problem-solving
⃝ To reduce the computational resources required for training large language
models
⃝ To increase the model’s capacity to understand and respond to emotional
cues in natural language
10.6. (2 points) Select all that are true: In the context of Reinforcement Learning
with Human Feedback (RLHF), what steps are involved in the process?
2 Training a reward model based on human rankings of model-generated
responses.
2 Using reinforcement learning to train the model with rewards as ”ground
truth” for desirable outputs.
2 Collecting prompts and generating multiple responses for each to be ranked
by humans.
2 Applying unsupervised learning techniques to automatically generate train-
ing data without human intervention.
10.7. (2 points) Select all that are true: Which of the following statements accurately
describe aspects of Instruction Fine-Tuning?
2 It aims to reduce the perplexity of a large training corpus by predicting
the most likely next words in a sequence.
2 It involves fine-tuning Language Models (LMs) on a dataset specifically
created to align the model’s output with human expectations for given
tasks.
2 It uses sources of prompts and responses to build a ”chat agent” training
dataset.
2 It incorporates multiple names such as chat fine-tuning, alignment, and
behavioral fine-tuning.
10-423/10-623 Gen AI Practice Questions - Page 22 of 26 -
11.5. (1 point) True or False: In the LDM framework, the latent representations from
the VAE can be directly fed to the diffusion UNet.
⃝ True
⃝ False
10-423/10-623 Gen AI Practice Questions - Page 24 of 26 -
12.5. (2 points) Select the correct answer: How would attention re-weighting affect
the generation of an edited image in diffusion models?
⃝ It adjusts the emphasis on specific features in the image
⃝ It alters the style of the image without affecting the content
⃝ It changes the random seed influencing the image generation
⃝ It increases the resolution of the generated image
12.6. (3 points) Select the correct answer: The trade-off in Prompt-to-Prompt image
editing between fidelity to the edited prompt and the source image is controlled by:
⃝ Adjusting the image contrast settings
⃝ The number of injection timestamps
⃝ Varying the amount of Gaussian noise added
⃝ The depth of the U-Net used
12.7. (4 points) Select the correct answer: The process of adjusting attention weights
in image editing via text prompts is analogous to which of the following?
⃝ Fine-tuning the hyperparameters of a neural network
⃝ Adjusting the focus in a photographic image
⃝ Changing the storyline in a movie script
⃝ Resizing an image while maintaining aspect ratio
10-423/10-623 Gen AI Practice Questions - Page 26 of 26 -
13.4. (2 points) Short answer: Why is the MoE model difficult to train?