CS388N Practice Questions Answers
CS388N Practice Questions Answers
Question
Suppose you are working with a binary perceptron classifier and have the fol-
lowing dataset. The feature vectors are derived using a bag-of-words vocabulary
[happy, sad, okay]:
• Example 1: x1 = ”happy happy okay”, y1 = +
• Example 2: x2 = ”sad sad happy”, y2 = −
• Example 3: x3 = ”okay okay happy”, y3 = +
• Example 4: x4 = ”sad sad okay”, y4 = −
(a) Write down the feature vector for each of these examples using the pro-
vided vocabulary. Use the order [happy, sad, okay] when listing your fea-
ture vectors. Provide your answer as a comma-separated list of vectors,
like [x11 , x12 , x13 ], [x21 , x22 , x23 ], . . . .
(b) Run one epoch of the perceptron algorithm on this data, in order. Initialize
the weight vector w = [0, 0, 0] and use the decision rule w⊤ f (x) ≥ 0 (where
a score of 0 is classified as positive). Report the final weight vector at the
end of the epoch in the format [w1 , w2 , w3 ].
(c) Now assume you’re using subword tokenization with the subword vocab-
ulary {h, ap, py, sa, d, o, kay}.
(i) Tokenize each word using this subword vocabulary (apply greedy
tokenization).
(ii) Write down the new bag-of-words vocabulary (replacing the original).
(iii) Provide the feature vectors for each example using this subword vo-
cabulary.
(d) If you encountered the word ”sadokay” with no space, what would its
segmentation be under the subword vocabulary {h, ap, py, sa, d, o, kay}?
(e) For handling noisy data with possible word concatenations or typos, which
approach would likely be more effective for improving classification accu-
racy and robustness: (1) subword tokenization, or (2) repairing typos using
the word with the lowest edit distance from the vocabulary? Provide a
one-sentence justification for your choice.
1
Answer
(a) Feature vectors using the bag-of-words vocabulary [happy,
sad, okay]
We can construct feature vectors by counting the occurrences of each word in
the sentence. The order of the vocabulary is [happy, sad, okay].
• Example 1: ”happy happy okay”
Vector: [2, 0, 1]
• Example 2: ”sad sad happy”
Vector: [1, 2, 0]
• Example 3: ”okay okay happy”
Vector: [1, 0, 2]
• Example 4: ”sad sad okay”
Vector: [0, 2, 1]
The feature vectors are:
[2, 0, 1], [1, 2, 0], [1, 0, 2], [0, 2, 1]
2
(c) Subword tokenization and feature vectors
The subword vocabulary is {h, ap, py, sa, d, o, kay}.
3
Question: Logistic Regression and Gradient De-
scent
You are training a logistic regression model to classify sentiment as positive or
negative based on a set of features. The training dataset consists of the following
feature vectors:
• x1 = [1, 0.5], y1 = +
• x2 = [0.3, 0.8], y2 = −
• x3 = [0.6, 0.4], y3 = +
(a) (4 points) Write down the logistic regression formula for predicting the
probability that a given input belongs to the positive class. Use the feature
vector x and weight vector w = [w1 , w2 ].
(b) (6 points) Derive the gradient update rule for logistic regression using one
example from the dataset (e.g., x1 ) and apply stochastic gradient descent
for one iteration. Assume the current weight vector w = [0, 0] and learning
rate η = 0.1. Show your calculations.
(c) (4 points) Explain the impact of different step sizes on the convergence of
gradient descent for this logistic regression model. What would happen
with a step size that is too large or too small?
Answer
(a) Logistic Regression Formula for Predicting Probability
The logistic regression model predicts the probability that a given input x be-
longs to the positive class using the following formula:
1
P (y = +1 | x) = σ(w⊤ x) =
1 + e−w⊤ x
Where:
• w = [w1 , w2 ] is the weight vector
• x = [x1 , x2 ] is the feature vector
• σ(z) = 1+e1−z is the sigmoid function, which outputs a probability between
0 and 1.
For a specific feature vector x = [x1 , x2 ] and weight vector w = [w1 , w2 ], the
probability that the input belongs to the positive class is:
1
P (y = +1 | x) =
1+ e−(w1 x1 +w2 x2 )
4
(b) Gradient Update Rule for Logistic Regression and SGD
We will now derive the gradient update rule for logistic regression using the first
example x1 = [1, 0.5], y1 = +1, and apply stochastic gradient descent (SGD)
for one iteration.
Step 1: Compute the prediction
First, calculate the predicted probability for example 1 using the current
weight vector w = [0, 0].
1 1 1 1
P (y = +1 | x1 ) = = = = = 0.5
1+ e−(w1 x1 +w2 x2 ) 1+ e−(0·1+0·0.5) 1 + e0 2
So, the predicted probability for x1 is 0.5.
Step 2: Compute the gradient
The gradient of the loss function with respect to the weights w1 and w2 for
logistic regression is given by:
∂L
= (P (y = +1 | x) − y) · xi
∂wi
Where xi is the i-th feature of the input example x1 .
For example 1, where y1 = +1, the gradients for w1 and w2 are:
∂L
= (0.5 − 1) · 1 = −0.5
∂w1
∂L
= (0.5 − 1) · 0.5 = −0.25
∂w2
Step 3: Update the weights using SGD
The weight update rule in SGD is:
∂L
wi = wi − η ·
∂wi
Where η = 0.1 is the learning rate.
For w1 and w2 , the updates are:
w = [0.05, 0.025]
5
(c) Impact of Different Step Sizes on Gradient Descent Con-
vergence
If the step size is too large:
• A large step size can cause the gradient updates to overshoot the opti-
mal point, leading to oscillations or divergence, where the weights do not
converge to the minimum loss. This can make the model unstable and
prevent it from learning effectively.
If the step size is too small:
• A small step size will make the convergence very slow. The model will
make tiny updates to the weights, meaning it will take many iterations
to reach the minimum loss. While it ensures that the model converges, it
might be inefficient and time-consuming.
Ideally, the step size should be tuned to balance the speed of convergence
with stability, ensuring the model reaches the optimal weights in a reasonable
time without overshooting.
6
Question: Word Embeddings and Bias
You are training word embeddings using the skip-gram model. Consider the
following three words: ”doctor,” ”nurse,” and ”hospital.”
Word embeddings:
vdoctor = [0.9, 0.4]
vnurse = [0.7, 0.3]
vhospital = [0.6, 0.2]
(a) (4 points) Define the objective function for training word embeddings
using the skip-gram model. What is the role of the context window?
(b) (6 points) Calculate the cosine similarity between ”doctor” and ”nurse,”
and between ”doctor” and ”hospital.” Based on these similarities, which
pair of words is more similar in the embedding space?
(c) (5 points) Suppose you discover a gender bias in the word embeddings. For
example, ”doctor” is more similar to male-associated words, and ”nurse” is
more similar to female-associated words. How could you mitigate this bias
in the word embeddings? Provide two potential approaches and briefly
describe their pros and cons.
Answer
(a) Objective Function for the Skip-gram Model
The objective function for the skip-gram model is designed to maximize the
probability of the context words given a target word. More formally, for a given
target word wt , and its context words wt−c , wt+c , the objective is to maximize:
T
X X
log P (wt+j | wt )
t=1 −c≤j≤c,j̸=0
Where:
exp(vcontext · vtarget )
P (wcontext | wtarget ) = P
w exp(vw · vtarget )
7
Where vtarget and vcontext are the word vectors for the target and context
words.
Role of the Context Window: The context window defines how many
words around the target word are considered when learning word embeddings.
A larger window size allows the model to capture broader semantic relationships
(longer-range dependencies) between words, while a smaller window focuses on
more immediate, local relationships between words.
8
(c) Mitigating Gender Bias in Word Embeddings
Gender bias in word embeddings can lead to ”doctor” being more similar to
male-associated words and ”nurse” more similar to female-associated words.
Here are two approaches to mitigate this bias:
Approach 1: Debiasing through Neutralizing and Equalizing
This method, introduced by Bolukbasi et al., involves:
• Neutralizing: Identify gender-biased dimensions in the embedding space
and remove them from words that should be gender-neutral (e.g., ”doc-
tor,” ”nurse”). This is done by projecting the word vector onto a subspace
orthogonal to the gender direction.
• Equalizing: Ensure that gendered word pairs like (”he,” ”she”) are
equidistant from neutral words like ”doctor” and ”nurse.”
Pros:
9
Question: Self-Attention and Transformers
Consider a Transformer model with the following input sequence: [”I”, ”love”,
”NLP”]. The word embeddings for each token are as follows:
eI = [1, 0]
elove = [0, 1]
eNLP = [1, 1]
Answer
Let’s work through each part of the question on Self-Attention and Transformers
step by step.
• For each word, compute the attention score between its query and the
keys of all other words in the sequence using a dot product.
10
• The attention scores determine how much focus should be given to other
words when computing the word’s new representation.
• The final output for each word is a weighted sum of the value vectors,
with weights given by the attention scores.
eI = [1, 0]
elove = [0, 1]
eNLP = [1, 1]
The self-attention scores between two words are calculated using the dot
product of their embeddings. For simplicity, we will skip scaling and softmax.
Step 1: Attention score between ”I” and ”I”
11
Step 9: Attention score between ”NLP” and ”NLP”
Score(eNLP , eNLP ) = [1, 1] · [1, 1] = 1 · 1 + 1 · 1 = 2
Summary of Attention Scores: The attention scores between the tokens
are:
I love NLP
I 1 0 1
love 0 1 1
NLP 1 1 2
12
Question: Hidden Markov Models and POS Tag-
ging
Suppose you are using a Hidden Markov Model (HMM) to perform part-of-
speech (POS) tagging. You have the following sentences and tags:
Sentence: ”The cat sleeps”
Tags: ”Det Noun Verb”
You are given the following probabilities:
Transition probabilities:
(a) (5 points) Use the Viterbi algorithm to calculate the most probable se-
quence of tags for the sentence ”The cat sleeps.” Show each step of your
calculations, including the initialization, recursion, and backtracking steps.
(b) (5 points) What are some challenges in applying HMMs to POS tagging in
real-world scenarios? Explain at least two challenges and how they might
be addressed.
Answer
(a) Viterbi Algorithm for POS Tagging
We are tasked with finding the most probable sequence of POS tags for the
sentence ”The cat sleeps” using the Viterbi algorithm. The possible tags are
”Det”, ”Noun”, and ”Verb”.
Given Probabilities:
Transition probabilities:
13
P (sleeps | Verb) = 0.7
Step 1: Initialization
We start by initializing the first time step t = 1, which corresponds to the
first word ”The”.
V1 (Noun) = 0, V1 (Verb) = 0
Step 2: Recursion
We now calculate the probabilities for the remaining words in the sentence:
For t = 2, the word is ”cat”:
We need to calculate the most probable transition to each possible tag (Noun,
Verb) from the previous tags:
Transition from ”Det” to ”Noun” and emitting ”cat”:
14
(b) Challenges in Applying HMMs to POS Tagging in Real-
World Scenarios
There are several challenges in using HMMs for POS tagging in real-world sce-
narios:
1. Limited Context Window:
Challenge: HMMs assume that the current tag only depends on the pre-
vious tag (Markov assumption), which means they have a limited ability to
capture long-range dependencies between words in a sentence.
Solution: This challenge can be addressed by using more advanced models
like Conditional Random Fields (CRFs) or Recurrent Neural Networks (RNNs)
that can take a larger context or even the entire sentence into account when
predicting tags.
2. Data Sparsity:
Challenge: HMMs rely on estimating transition and emission probabili-
ties from the training data. In real-world data, some word-tag pairs or tag
transitions may rarely (or never) occur in the training data, leading to sparse
estimates or zero probabilities.
Solution: This can be addressed using techniques such as smoothing to
assign small non-zero probabilities to unseen events. Additionally, modern
methods such as neural networks and pre-trained embeddings (e.g., BERT)
can handle rare or unseen words more effectively by leveraging distributed rep-
resentations.
These challenges highlight the limitations of HMMs and motivate the need
for more powerful models in real-world POS tagging tasks.
15
Question: Multiclass Perceptron
You have the following training points for a three-class classification problem;
points are listed as (x1 , x2 , y), where y represents one of the three classes:
{1, 2, 3}:
(a) (6 points) Apply one pass of the multiclass perceptron algorithm on this
data, starting from weight vectors initialized at 0 for each class. Assume
the decision rule is to assign a point to the class with the highest dot
product between the weight vector and the feature vector. After processing
all the points, report the final weight vectors for each class. Show each
step of the update process.
(b) (4 points) Describe a scenario in which multiclass perceptron might not
converge on a dataset. Why would this happen, and how could you modify
the algorithm or dataset to address this issue?
Answer
(a) One Pass of the Multiclass Perceptron Algorithm
The multiclass perceptron algorithm uses a weight vector for each class. We
initialize the weight vectors for classes 1, 2, and 3 as:
Predicted class:
Compute the dot products:
Since all dot products are equal, we assume ties are broken arbitrarily, so let’s
predict class 1 (which is correct).
Update: No update is needed because the predicted class is correct.
16
Step 2: Process (0, 1, 2)
Current weight vectors:
w1 = [0, 0], w2 = [0, 0], w3 = [0, 0]
Predicted class:
Compute the dot products:
w1 · [0, 1] = 0, w2 · [0, 1] = 0, w3 · [0, 1] = 0
Again, ties are broken arbitrarily, so let’s predict class 1.
Update:
The correct class is 2, but the model predicted 1, so we update the weights:
w1 = [0, 0] − [0, 1] = [0, −1]
w2 = [0, 0] + [0, 1] = [0, 1]
Step 3: Process (1, 1, 3)
Current weight vectors:
w1 = [0, −1], w2 = [0, 1], w3 = [0, 0]
Predicted class:
Compute the dot products:
w1 · [1, 1] = 0 · 1 + (−1) · 1 = −1
w2 · [1, 1] = 0 · 1 + 1 · 1 = 1
w3 · [1, 1] = 0 · 1 + 0 · 1 = 0
The model predicts class 2 because it has the highest dot product (1).
Update:
The correct class is 3, but the model predicted 2, so we update the weights:
w2 = [0, 1] − [1, 1] = [−1, 0]
w3 = [0, 0] + [1, 1] = [1, 1]
Step 4: Process (0, 0, 1)
Current weight vectors:
w1 = [0, −1], w2 = [−1, 0], w3 = [1, 1]
Predicted class:
Compute the dot products:
w1 · [0, 0] = 0, w2 · [0, 0] = 0, w3 · [0, 0] = 0
The model predicts class 1 (arbitrary tie-breaking), which is correct.
Update: No update is needed because the predicted class is correct.
Final Weight Vectors:
w1 = [0, −1], w2 = [−1, 0], w3 = [1, 1]
17
(b) When Might the Multiclass Perceptron Not Converge?
The multiclass perceptron might not converge on a dataset in the following
scenario:
Non-linearly separable data: If the dataset is not linearl
18
Question: Word Embeddings and Skip-Gram
Consider the following word embedding vectors:
(a) (3 points) Using the skip-gram model, compute the probability of the
context word ”dog” given the word ”cat.” Assume a softmax over the
context words ”dog” and ”animal.” Use the dot product for similarity and
leave your answer in terms of exponentials (you do not need to compute
the actual values).
(b) (3 points) Suppose you now extend the vocabulary with two new words:
”elephant” and ”mouse.” Briefly describe how increasing the vocabulary
affects the computation of the softmax and propose a more efficient alter-
native for handling larger vocabularies.
Answer
(a) Probability of the Context Word ”dog” Given the Word
”cat” Using the Skip-Gram Model
In the skip-gram model, the probability of a context word wcontext given a target
word wtarget is calculated using a softmax function over the dot product of their
embeddings.
For this question, you are asked to compute the probability P (wcontext =
dog | wtarget = cat), using a softmax over two context words: ”dog” and ”ani-
mal.”
The formula for the softmax probability is:
exp(vcat · vdog )
P (dog | cat) =
exp(vcat · vdog ) + exp(vcat · vanimal )
Where:
19
Step 2: Plug the dot products into the softmax formula
exp(0.36)
P (dog | cat) =
exp(0.36) + exp(0.47)
This expression gives the probability of the context word ”dog” given the
target word ”cat” in terms of exponentials.
20
exp((vw1 + vw2 ) · vwcontext )
P (wcontext | w1 , w2 ) = P
w′ ∈V exp((vw1 + vw2 ) · vw )
′
vcat dog = vcat + vdog = [0.5, 0.3] + [0.6, 0.2] = [1.1, 0.5]
Advantages:
• The model is able to represent bigrams without needing separate embed-
dings for every possible word pair, which can reduce memory requirements.
• The model can generalize better since bigrams are built from individual
word embeddings, and it can predict new bigrams not seen during training.
Disadvantages:
• The sum of the embeddings might not always perfectly capture the mean-
ing of the bigram, especially for phrases where the meaning changes dras-
tically when words are combined (e.g., idiomatic expressions).
• This representation may lose some nuanced meaning that would be cap-
tured by a more complex model for bigram relationships.
21
Question: Hidden Markov Models (HMM)
You are building an HMM to model the sequence of weather states (Sunny,
Rainy) based on observations of outdoor activities (Walk, Shop, Clean). You
are given the following probabilities:
Transition probabilities:
(a) (4 points) Use the forward algorithm to calculate the probability of ob-
serving the sequence [Walk, Shop] if the initial state is Sunny. Show each
step of the forward pass.
(b) (4 points) Use the Viterbi algorithm to find the most probable sequence
of weather states for the observation sequence [Walk, Shop]. Show your
calculations, including initialization, recursion, and backtracking.
(c) (4 points) Discuss two limitations of the HMM when applied to real-world
tasks like speech recognition or POS tagging. How do more advanced
models, such as conditional random fields (CRFs), address these limita-
tions?
Answer
(a) Forward Algorithm for the Sequence [Walk, Shop]
The forward algorithm is used to compute the probability of observing a se-
quence of observations. Here, we are given the initial state as Sunny and asked
to calculate the probability of observing the sequence [Walk, Shop].
Step 1: Define the Problem
States (weather): Sunny (S), Rainy (R)
Observations (activities): Walk (W), Shop (S)
Initial state: Sunny
Step 2: Given Probabilities
Transition probabilities:
22
P (Rainy | Rainy) = 1 − 0.4 = 0.6
Emission probabilities:
For Rainy:
f1 (Rainy) = P (Rainy)·P (Walk | Rainy) = 0 (no initial probability given for Rainy)
Recursion (t = 2, Shop):
For Sunny at t = 2:
f2 (Sunny) = [f1 (Sunny)·P (Sunny | Sunny)+f1 (Rainy)·P (Sunny | Rainy)]·P (Shop | Sunny)
f2 (Rainy) = [f1 (Sunny)·P (Rainy | Sunny)+f1 (Rainy)·P (Rainy | Rainy)]·P (Shop | Rainy)
For Rainy:
23
For Sunny:
v2 (Sunny) = max[v1 (Sunny)·P (Sunny | Sunny), v1 (Rainy)·P (Sunny | Rainy)]·P (Shop | Sunny)
v2 (Rainy) = max[v1 (Sunny)·P (Rainy | Sunny), v1 (Rainy)·P (Rainy | Rainy)]·P (Shop | Rainy)
24
Question: Training Neural Networks
You are training a feedforward neural network for binary classification. The
network has the following architecture:
• Input layer with two neurons
W2 = [0.5, 0.7]
(a) (5 points) Compute the output of the neural network for the given in-
put. Show your work step by step, including the application of the ReLU
activation in the hidden layer and the sigmoid activation in the output
layer.
(b) (4 points) Assume the network is trained using gradient descent. Derive
the gradient update for the weights in the output layer based on the pre-
dicted output and the true label y = 1. Use the binary cross-entropy loss
function.
(c) (3 points) Explain how improper weight initialization can impact the train-
ing process of a neural network. Why is it important to use techniques
like Xavier or He initialization?
Answer
(a) Compute the Output of the Neural Network
The architecture has two layers:
25
Step 1: Inputs and Weights
Input vector:
x = [1, 0.5]
Weights for the first layer:
0.2 0.3
W1 =
0.4 0.6
This means the first hidden neuron has weights [0.2, 0.4] and the second hidden
neuron has weights [0.3, 0.6].
Weights for the output layer:
W2 = [0.5, 0.7]
z1 = [0.4, 0.6]
26
The sigmoid function is defined as:
1
σ(z) =
1 + e−z
Applying this to the output layer pre-activation value z2 = 0.62:
1
a2 = σ(0.62) = ≈ 0.650
1 + e−0.62
So, the final output of the neural network is approximately 0.650.
(b) Derive the Gradient Update for the Output Layer Weights
We are given that the true label is y = 1, and the predicted output from the
network is a2 . The loss function used is the binary cross-entropy loss:
L = − log(0.650)
Now, we calculate the gradient of the loss with respect to the weights in the
output layer W2 .
Step 1: Gradient of the Loss with Respect to the Output a2
The gradient of the loss with respect to the output a2 is:
∂L
= −(y − a2 ) = −(1 − 0.650) = −0.350
∂a2
Step 2: Gradient of the Output with Respect to z2
The gradient of the output a2 with respect to the pre-activation z2 is:
∂a2
= a2 (1 − a2 ) = 0.650 · (1 − 0.650) = 0.650 · 0.350 = 0.2275
∂z2
Step 3: Gradient of the Loss with Respect to z2
Using the chain rule:
∂L ∂L ∂a2
= · = −0.350 · 0.2275 = −0.0796
∂z2 ∂a2 ∂z2
Step 4: Gradient of the Loss with Respect to the Weights W2
The gradient of the loss with respect to the weights W2 is:
∂L ∂L
= · a1
∂W2 ∂z2
Where a1 = [0.4, 0.6]. Therefore:
∂L
= −0.0796 · 0.4 = −0.03184
∂W2,1
27
∂L
= −0.0796 · 0.6 = −0.04776
∂W2,2
Gradient Update Rule
Using the learning rate η, the weights W2 can be updated as:
∂L
W2new = W2old − η ·
∂W2
28
Question 5: Transformers and Attention Mecha-
nism
Consider a simple Transformer model with two tokens: ”apple” and ”fruit.”
The word embeddings for each token are as follows:
(b) (5 points) Explain how positional encodings are used in the Transformer
model and why they are necessary. What would happen if we removed
positional encodings from the model?
(c) (3 points) Describe the key difference between self-attention and recur-
rent neural networks (RNNs) when handling long sequences. Why is self-
attention preferred in tasks like machine translation?
Answer
(a) Self-Attention Mechanism and Attention Score Calcu-
lation
The self-attention mechanism in a Transformer allows each token in the input
sequence to attend to all other tokens, including itself, in order to capture
dependencies between them. The key components of self-attention are:
• Query (Q): Represents the word for which attention is being calculated.
• Key (K): Represents the words being compared against.
QK T
Attention(Q, K, V ) = softmax √ V
dk
Where:
• Q, K, and V are matrices for the queries, keys, and values, respectively.
29
• dk is the dimension of the query/key vectors, used for scaling.
For this specific question, we are given that the query, key, and value matrices
are identity matrices, so we can use the embeddings directly.
apple fruit
apple 0.89 0.63
fruit 0.63 0.45
30
• i is the dimension index.
• d is the dimensionality of the model.
Why Positional Encodings Are Necessary
Without positional encodings, the Transformer would treat ”apple” and
”fruit” the same regardless of their order in the sequence, i.e., ”apple fruit”
would be identical to ”fruit apple.” Positional encodings ensure that the model
can differentiate between sequences with different word orders.
What Would Happen Without Positional Encodings?
• The Transformer would lose the ability to capture the order of the input
tokens.
• This would be problematic for tasks where word order matters, such as
machine translation, where changing the word order can alter the meaning
of a sentence.
31
Question: Constituency Parsing and PCFGs
You are working with the following Probabilistic Context-Free Grammar (PCFG):
S → NP V P (0.9)
S →VP (0.1)
N P → Det N (0.6)
NP → N (0.4)
V P → V NP (0.7)
VP →V (0.3)
Det → the (0.8)
Det → a (0.2)
N → cat (0.5)
N → dog (0.5)
V → sees (0.7)
V → likes (0.3)
(a) (4 points) Parse the sentence “the cat sees a dog” using this grammar.
Draw the tree structure that represents the parse, and calculate the prob-
ability of this parse using the given rule probabilities. Box your final
probability answer.
(b) (4 points) Show the steps of the CKY algorithm as it applies to parsing
the sentence “a dog sees the cat” using this PCFG. Fill in the CKY table
with the nonterminals that can be generated over each span and identify
the best parse.
(c) (3 points) Consider the sentence “dog sees”. How would binarization of
the grammar (i.e., converting it to Chomsky Normal Form) change the
parse process? Write down the binarized grammar and explain the effect
on parsing.
(d) (3 points) The PCFG model often struggles with ambiguous sentences,
where multiple parses are possible. Give an example of such a sentence
that would lead to two or more parses under this grammar. Explain why
the sentence is ambiguous and how the PCFG might choose between the
competing parses.
32
Step 1: Parse Tree Structure
The parse tree for the sentence ”the cat sees a dog” is as follows:
S → NP VP (0.9)
NP → Det N (0.6)
VP → V NP (0.7)
Det → the (0.8)
N → cat (0.5)
V → sees (0.7)
NP → Det N (0.6)
Det → a (0.2)
N → dog (0.5)
The resulting parse tree is:
S
NP VP
Det N V NP
the cat sees Det N
a dog
P (parse) = 0.9 × 0.6 × 0.7 × 0.8 × 0.5 × 0.7 × 0.6 × 0.2 × 0.5 = 0.0252
33
Question: CYK Parsing Algorithm
You are tasked with parsing the sentence “dogs eat bones” using a simplified
grammar in Chomsky Normal Form (CNF):
• S → NP V P
• V P → V NP
• N P → Det N
• Det → the
• N → dogs | bones
• V → eat
a. (4 points) Fill out the CKY parsing table for this sentence. Include all
possible non-terminal rules that can be applied over each span of the sentence.
b. (5 points) Once the table is filled, trace the backpointers to construct the
most likely parse tree for “dogs eat bones”. Draw the parse tree or provide a
detailed bracketed structure for the tree.
c. (3 points) CKY parsing requires the grammar to be in Chomsky Normal
Form. Explain why CNF is necessary for CKY and how non-CNF rules (e.g.,
S → N P V P V P ) can be transformed into CNF.
d. (3 points) CKY parsing can be inefficient for long sentences. Suggest one
optimization technique to improve the speed of CKY parsing for large corpora
and explain how it works.
Answer
a. Fill out the CKY Parsing Table for the Sentence ”dogs
eat bones”
We are given the following grammar in Chomsky Normal Form (CNF):
• S → NP V P
• V P → V NP
• N P → Det N
• Det → the
• N → dogs | bones
• V → eat
34
The sentence to parse is ”dogs eat bones”. Each word corresponds to a
terminal in the grammar, and we will fill in the CKY table based on the grammar
rules.
Step 1: Initialize the Table. The CKY table is a triangular matrix where
each cell represents the non-terminal symbols that can generate the span of
words. The sentence has three words, so we will build a 3 × 3 table:
dogs (1) eat (2) bones (3)
dogs (1) N (dogs)
eat (2) V (eat)
bones (3) N (bones)
Step 2: Fill in the Table for Larger Spans. Next, we fill in the table by
combining smaller spans into larger spans.
For span (1, 2): ”dogs eat”. No valid non-terminal is produced here.
For span (2, 3): ”eat bones”. We combine ”eat” (V) with ”bones” (NP),
which generates V P → V N P :
dogs (1) eat (2) bones (3)
dogs (1) N (dogs) S (dogs eat bones)
eat (2) V (eat) VP (eat bones)
bones (3) N (bones)
For span (1, 3): ”dogs eat bones”. Combining ”dogs” (NP) with ”eat bones”
(VP) generates S → N P V P .
• V → eat: ”eat” is a V.
35
c. Why Chomsky Normal Form (CNF) is Necessary for
CKY Parsing
CKY parsing requires the grammar to be in Chomsky Normal Form (CNF),
which means that each production rule must have the form A → BC (two
non-terminals) or A → a (one terminal).
Why CNF is Necessary: CKY relies on the assumption that every pro-
duction rule either combines exactly two non-terminals (binary branching) or
generates a terminal. This allows the algorithm to efficiently fill in the parsing
table by recursively combining pairs of non-terminals.
Converting Non-CNF Rules: For non-CNF rules, such as S → N P V P V P ,
you can introduce a new non-terminal symbol X to break it into binary form:
S → NP X and X → V P V P
This transformation ensures that the rule conforms to CNF and can be used
in the CKY algorithm.
36
Question: Multi-Head Attention in Transformers
You are given the following input tokens: ["time", "flies", "quickly"].
The word embeddings for each token are as follows:
0.5 0.5 (2) 0.6 0.4 0.3 0.7
Wq(2) = , Wk = , Wv(2) =
0.5 0.5 0.4 0.6 0.7 0.3
(a) (5 points) For each head, compute the attention scores between the tokens
using the dot product of the transformed query and key vectors. Do not
apply softmax for this calculation.
(b) (5 points) Once you have computed the attention scores, apply the atten-
tion weights to the value vectors for each head. Compute the output of
the multi-head attention by concatenating the results from the two heads.
(c) (5 points) Explain how multi-head attention allows the model to focus on
different aspects of the input sequence and why this improves performance
compared to using a single attention head. Provide an example of how
this would help in a task like machine translation.
Answer
a. Compute the Attention Scores for Each Head
We are given three input tokens: "time", "flies", and "quickly" with the
following embeddings:
37
For Head 1:
Query matrix:
1 0 (1) 0 1
Wq(1) = , Key matrix: Wk =
0 1 1 0
For "quickly":
0.5 (1) 0.8
Wq(1) · equickly = , Wk · equickly =
0.8 0.5
Transform the query and key vectors, and compute the dot products using
the same method as Head 1.
38
b. Apply the Attention Weights to the Value Vectors
Once we have computed the attention scores, we apply them to the value vectors
to get the weighted sum for each token.
For Head 1:
(1) 0.5 0.5
Wv =
0.5 0.5
For each token, compute the weighted sum of the value vectors based on the
attention scores.
For Head 2:
0.3 0.7
Wv(2) =
0.7 0.3
Similarly, compute the weighted sum for Head 2.
Final Multi-Head Attention Output: To get the final output, we con-
catenate the results from both attention heads.
39
Question: Scaled Dot-Product Attention
The scaled dot-product attention mechanism is used in the Transformer model
to compute attention scores efficiently for long sequences. You are given the
following input sequence: ["apple", "is", "sweet"] with the following word
embeddings:
(a) (4 points) Compute the dot-product attention scores between the tokens
using the transformed query and key vectors. Then, apply the scaling
factor √1d , where d = 2 (the dimension of the embeddings).
(b) (4 points) Apply the softmax function to the scaled attention scores to
compute the final attention weights. Show the detailed calculation for the
softmax step.
(c) (4 points) Use the attention weights to compute the final weighted sum of
the value vectors for each token. Report the resulting vectors.
(d) (3 points) Explain why scaling the dot-product in attention is important,
especially for longer input sequences. What would happen if we did not
apply this scaling factor?
Answer
a. Compute the Dot-Product Attention Scores and Apply
Scaling
We are given the input sequence ["apple", "is", "sweet"] with the following
embeddings:
0.9 0.2 1.0 0.98 0.7 0.4 1.0 1.0
Wq ·eapple = · = , Wk ·eapple = · =
0.1 0.8 0.8 0.84 0.3 0.6 0.8 0.88
40
For ”is”:
0.9 0.2 0.6 0.69 0.7 0.4 0.6 0.75
Wq · eis = · = , Wk · eis = · =
0.1 0.8 0.9 0.78 0.3 0.6 0.9 0.78
For ”sweet”:
0.9 0.2 0.7 0.73 0.7 0.4 0.7 0.79
Wq ·esweet = · = , Wk ·esweet = · =
0.1 0.8 1.0 0.86 0.3 0.6 1.0 0.82
Scaled attention score for ”apple” and ”apple” = 1.7192 × 0.707 ≈ 1.215
Scaled attention score for ”apple” and ”is” = 1.3902 × 0.707 ≈ 0.983
Scaled attention score for ”apple” and ”sweet” = 1.463 × 0.707 ≈ 1.034
41
Step 1: Compute the Exponentials
For the token ”apple”, we compute the exponentials of the scaled scores:
3.37 2.67
Softmax for ”apple” and ”apple” = ≈ 0.381, Softmax for ”apple” and ”is” = ≈ 0.302, Softm
8.85 8.85
42
Question: Self-Attention with Masking in De-
coders
You are working with a Transformer decoder that generates a sequence of words
one at a time. The model is currently generating the word "translation" after
having already generated "this is a". The word embeddings for the previous
tokens are:
ethis = [0.2, 0.5], eis = [0.6, 0.4], ea = [0.7, 0.3], etranslation = [0.9, 0.8]
(a) (4 points) Define the self-attention mechanism with masking as used in the
decoder. Explain why masking is necessary in the decoder’s self-attention
mechanism.
(b) (4 points) Compute the self-attention scores between the tokens with a
mask that prevents the model from attending to future tokens. Use the
dot-product between the embeddings of the current token ("translation")
and the previous tokens.
(c) (4 points) Apply the mask and compute the final attention scores for each
token. Show how the masking affects the attention distribution.
(d) (3 points) Explain how masking in the decoder self-attention helps prevent
information leakage during autoregressive generation. Why is this crucial
for tasks like machine translation?
Answer
a. Define the Self-Attention Mechanism with Masking and
Explain Why Masking Is Necessary
In a Transformer decoder, the self-attention mechanism works by allowing each
word (or token) in the sequence to attend to other words in the sequence. The
goal is to compute a weighted sum of the embeddings of other tokens, where
the weights are the attention scores (based on dot-products). The model can
use the context from previously generated words to inform the generation of the
next word.
Masked Self-Attention in Decoders: In the decoder, self-attention is
applied with a mask that prevents the model from attending to ”future” tokens
that haven’t been generated yet. For example, when generating a token at time
t, the model should only attend to tokens at positions 1 to t, and not tokens at
t + 1, t + 2, etc.
Why Masking Is Necessary:
• Preventing Future Information Access: In autoregressive models like the
Transformer decoder, each word is generated one at a time. Masking en-
sures that the model does not ”cheat” by looking at future words that it
43
has not yet generated. For example, while generating the word "translation",
the model should only attend to "this", "is", and "a" but not future
tokens.
• Ensuring Correct Sequence Generation: Masking enforces the causal struc-
ture of language generation by ensuring that the model only has access to
previously generated tokens, making the sequence generation process re-
alistic and valid for tasks like machine translation or text summarization.
ethis = [0.2, 0.5], eis = [0.6, 0.4], ea = [0.7, 0.3], etranslation = [0.9, 0.8]
44
ezi
softmax(zi ) = P zj
je
45
Question: BERT and Masked Language Model-
ing
You are using the BERT model for masked language modeling (MLM) on
the sentence: "The dog [MASK] the ball." The model provides the follow-
ing word embeddings for the unmasked tokens:
eThe = [0.5, 0.6], edog = [0.7, 0.4], ethe = [0.3, 0.7], eball = [0.8, 0.5]
cThe = [0.9, 0.1], cdog = [0.6, 0.3], cthe = [0.2, 0.8], cball = [0.7, 0.6]
(a) (5 points) Explain the masked language modeling objective used by BERT.
How does the model use self-attention to predict the masked token?
(b) (5 points) Use the context embeddings to compute the probabilities for
filling in the masked token using the dot product between the masked
position and the other tokens. Provide the calculations for the attention
scores and the predicted word probabilities.
(c) (3 points) Discuss how BERT handles bidirectional context and why this
is important for language understanding tasks. Contrast this with models
that use only left-to-right or right-to-left context.
(d) (3 points) Suppose you have fine-tuned BERT on a downstream task like
sentiment analysis. Explain how the masked language model pretraining
helps improve performance on such tasks.
Answer
a. Masked Language Modeling (MLM) Objective in BERT
The masked language modeling (MLM) objective is a key part of how BERT is
pre-trained. During pretraining, some tokens in the input are randomly replaced
with a special [MASK] token, and the model’s goal is to predict the original tokens
based on the surrounding context. The model learns to predict these masked
tokens by using the embeddings of the unmasked tokens, and it leverages the
self-attention mechanism to gather information from both the left and right
context.
How Self-Attention Helps in MLM: Self-attention allows BERT to cap-
ture relationships between the masked token and all other tokens in the se-
quence. Each token can attend to every other token (in a bidirectional manner),
meaning the model can utilize both preceding and succeeding context to make
predictions.
46
For the sentence "The dog [MASK] the ball", the model predicts the masked
token by attending to the embeddings of the tokens "The", "dog", "the", and
"ball", using self-attention to combine these context embeddings effectively.
The final layer of BERT uses this attention-enhanced information to predict
the original masked word, making MLM a powerful pretraining strategy.
cThe = [0.9, 0.1], cdog = [0.6, 0.3], cthe = [0.2, 0.8], cball = [0.7, 0.6]
The resulting probabilities indicate the likelihood of each token filling the
masked position.
47
• More Complete Understanding: By attending to both preceding and fol-
lowing tokens, BERT can better capture the full meaning of a word in its
context. For example, the word ”bank” could mean a financial institution
or a riverbank depending on the surrounding words.
• Unidirectional Models: Models like GPT process the sequence from left
to right. When predicting a word, they only have access to the preceding
context. This limits the model’s ability to fully understand the word’s
meaning, as it cannot look at the following words.
• BERT’s Advantage: BERT’s ability to look at both past and future to-
kens in the sequence leads to a more holistic understanding of language,
improving performance across various NLP tasks.
48