0% found this document useful (0 votes)
118 views48 pages

CS388N Practice Questions Answers

The document contains practice questions and answers related to binary perceptron classifiers, logistic regression, and word embeddings. It includes tasks such as deriving feature vectors, running perceptron algorithms, calculating probabilities, and addressing bias in word embeddings. The answers provide detailed calculations and explanations for each question, including the impact of step sizes in gradient descent and methods to mitigate bias in word embeddings.

Uploaded by

Shreyasi Palit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views48 pages

CS388N Practice Questions Answers

The document contains practice questions and answers related to binary perceptron classifiers, logistic regression, and word embeddings. It includes tasks such as deriving feature vectors, running perceptron algorithms, calculating probabilities, and addressing bias in word embeddings. The answers provide detailed calculations and explanations for each question, including the impact of step sizes in gradient descent and methods to mitigate bias in word embeddings.

Uploaded by

Shreyasi Palit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CS388N Midterm Practice Questions

Question
Suppose you are working with a binary perceptron classifier and have the fol-
lowing dataset. The feature vectors are derived using a bag-of-words vocabulary
[happy, sad, okay]:
• Example 1: x1 = ”happy happy okay”, y1 = +
• Example 2: x2 = ”sad sad happy”, y2 = −
• Example 3: x3 = ”okay okay happy”, y3 = +
• Example 4: x4 = ”sad sad okay”, y4 = −
(a) Write down the feature vector for each of these examples using the pro-
vided vocabulary. Use the order [happy, sad, okay] when listing your fea-
ture vectors. Provide your answer as a comma-separated list of vectors,
like [x11 , x12 , x13 ], [x21 , x22 , x23 ], . . . .
(b) Run one epoch of the perceptron algorithm on this data, in order. Initialize
the weight vector w = [0, 0, 0] and use the decision rule w⊤ f (x) ≥ 0 (where
a score of 0 is classified as positive). Report the final weight vector at the
end of the epoch in the format [w1 , w2 , w3 ].
(c) Now assume you’re using subword tokenization with the subword vocab-
ulary {h, ap, py, sa, d, o, kay}.
(i) Tokenize each word using this subword vocabulary (apply greedy
tokenization).
(ii) Write down the new bag-of-words vocabulary (replacing the original).
(iii) Provide the feature vectors for each example using this subword vo-
cabulary.
(d) If you encountered the word ”sadokay” with no space, what would its
segmentation be under the subword vocabulary {h, ap, py, sa, d, o, kay}?
(e) For handling noisy data with possible word concatenations or typos, which
approach would likely be more effective for improving classification accu-
racy and robustness: (1) subword tokenization, or (2) repairing typos using
the word with the lowest edit distance from the vocabulary? Provide a
one-sentence justification for your choice.

1
Answer
(a) Feature vectors using the bag-of-words vocabulary [happy,
sad, okay]
We can construct feature vectors by counting the occurrences of each word in
the sentence. The order of the vocabulary is [happy, sad, okay].
• Example 1: ”happy happy okay”
Vector: [2, 0, 1]
• Example 2: ”sad sad happy”
Vector: [1, 2, 0]
• Example 3: ”okay okay happy”
Vector: [1, 0, 2]
• Example 4: ”sad sad okay”
Vector: [0, 2, 1]
The feature vectors are:
[2, 0, 1], [1, 2, 0], [1, 0, 2], [0, 2, 1]

(b) Run one epoch of the perceptron algorithm


The perceptron algorithm updates the weight vector based on the classification
error, using the rule:
If y(w⊤ f (x)) ≤ 0, update the weight as w = w + y · f (x)
We initialize w = [0, 0, 0].
• Example 1: x1 = [2, 0, 1], y1 = +1
Decision rule: w⊤ x1 = 0, which is positive (considered +1).
No update is made (correct classification).
• Example 2: x2 = [1, 2, 0], y2 = −1
Decision rule: w⊤ x2 = 0, classified as +1, which is wrong (should be 1).
Update: w = w + (−1) · [1, 2, 0] = [0, 0, 0] − [1, 2, 0] = [−1, −2, 0]
• Example 3: x3 = [1, 0, 2], y3 = +1
Decision rule: w⊤ x3 = −1, classified as 1, which is wrong (should be +1).
Update: w = [−1, −2, 0] + [1, 0, 2] = [0, −2, 2]
• Example 4: x4 = [0, 2, 1], y4 = −1
Decision rule: w⊤ x4 = −2, classified as 1 (correct classification).
No update is made.
Final weight vector after one epoch:
w = [0, −2, 2]

2
(c) Subword tokenization and feature vectors
The subword vocabulary is {h, ap, py, sa, d, o, kay}.

(i) Tokenization using greedy approach:


– ”happy” → [h, ap, py]
– ”sad” → [sa, d]
– ”okay” → [o, kay]

(ii) New bag-of-words vocabulary: [h, ap, py, sa, d, o, kay]


(iii) Feature vectors using the subword vocabulary:
– Example 1: ”happy happy okay”
Tokens: [h, ap, py, h, ap, py, o, kay]
Vector: [2, 2, 2, 0, 0, 1, 1]
– Example 2: ”sad sad happy”
Tokens: [sa, d, sa, d, h, ap, py]
Vector: [1, 1, 1, 2, 2, 0, 0]
– Example 3: ”okay okay happy”
Tokens: [o, kay, o, kay, h, ap, py]
Vector: [1, 1, 1, 0, 0, 2, 2]
– Example 4: ”sad sad okay”
Tokens: [sa, d, sa, d, o, kay]
Vector: [0, 0, 0, 2, 2, 1, 1]
The new feature vectors using the subword vocabulary are:

[2, 2, 2, 0, 0, 1, 1], [1, 1, 1, 2, 2, 0, 0], [1, 1, 1, 0, 0, 2, 2], [0, 0, 0, 2, 2, 1, 1]

(d) Subword segmentation of ”sadokay”


Using the subword vocabulary {h, ap, py, sa, d, o, kay}, we would segment
”sadokay” as:
[sa, d, o, kay]

(e) Handling noisy data: Subword tokenization or edit dis-


tance?
Subword tokenization is generally more effective for handling noisy data (like
word concatenations or typos) because it allows the model to capture meaningful
subword units, which can still convey semantic information even when words
are misspelled or concatenated. Repairing typos using the lowest edit distance
may not generalize well to unseen or non-dictionary words.

3
Question: Logistic Regression and Gradient De-
scent
You are training a logistic regression model to classify sentiment as positive or
negative based on a set of features. The training dataset consists of the following
feature vectors:

• x1 = [1, 0.5], y1 = +
• x2 = [0.3, 0.8], y2 = −
• x3 = [0.6, 0.4], y3 = +

(a) (4 points) Write down the logistic regression formula for predicting the
probability that a given input belongs to the positive class. Use the feature
vector x and weight vector w = [w1 , w2 ].
(b) (6 points) Derive the gradient update rule for logistic regression using one
example from the dataset (e.g., x1 ) and apply stochastic gradient descent
for one iteration. Assume the current weight vector w = [0, 0] and learning
rate η = 0.1. Show your calculations.
(c) (4 points) Explain the impact of different step sizes on the convergence of
gradient descent for this logistic regression model. What would happen
with a step size that is too large or too small?

Answer
(a) Logistic Regression Formula for Predicting Probability
The logistic regression model predicts the probability that a given input x be-
longs to the positive class using the following formula:
1
P (y = +1 | x) = σ(w⊤ x) =
1 + e−w⊤ x
Where:
• w = [w1 , w2 ] is the weight vector
• x = [x1 , x2 ] is the feature vector
• σ(z) = 1+e1−z is the sigmoid function, which outputs a probability between
0 and 1.
For a specific feature vector x = [x1 , x2 ] and weight vector w = [w1 , w2 ], the
probability that the input belongs to the positive class is:
1
P (y = +1 | x) =
1+ e−(w1 x1 +w2 x2 )

4
(b) Gradient Update Rule for Logistic Regression and SGD
We will now derive the gradient update rule for logistic regression using the first
example x1 = [1, 0.5], y1 = +1, and apply stochastic gradient descent (SGD)
for one iteration.
Step 1: Compute the prediction
First, calculate the predicted probability for example 1 using the current
weight vector w = [0, 0].

1 1 1 1
P (y = +1 | x1 ) = = = = = 0.5
1+ e−(w1 x1 +w2 x2 ) 1+ e−(0·1+0·0.5) 1 + e0 2
So, the predicted probability for x1 is 0.5.
Step 2: Compute the gradient
The gradient of the loss function with respect to the weights w1 and w2 for
logistic regression is given by:
∂L
= (P (y = +1 | x) − y) · xi
∂wi
Where xi is the i-th feature of the input example x1 .
For example 1, where y1 = +1, the gradients for w1 and w2 are:
∂L
= (0.5 − 1) · 1 = −0.5
∂w1
∂L
= (0.5 − 1) · 0.5 = −0.25
∂w2
Step 3: Update the weights using SGD
The weight update rule in SGD is:
∂L
wi = wi − η ·
∂wi
Where η = 0.1 is the learning rate.
For w1 and w2 , the updates are:

w1 = 0 − 0.1 · (−0.5) = 0 + 0.05 = 0.05


w2 = 0 − 0.1 · (−0.25) = 0 + 0.025 = 0.025
So after one iteration, the updated weight vector is:

w = [0.05, 0.025]

5
(c) Impact of Different Step Sizes on Gradient Descent Con-
vergence
If the step size is too large:
• A large step size can cause the gradient updates to overshoot the opti-
mal point, leading to oscillations or divergence, where the weights do not
converge to the minimum loss. This can make the model unstable and
prevent it from learning effectively.
If the step size is too small:
• A small step size will make the convergence very slow. The model will
make tiny updates to the weights, meaning it will take many iterations
to reach the minimum loss. While it ensures that the model converges, it
might be inefficient and time-consuming.
Ideally, the step size should be tuned to balance the speed of convergence
with stability, ensuring the model reaches the optimal weights in a reasonable
time without overshooting.

6
Question: Word Embeddings and Bias
You are training word embeddings using the skip-gram model. Consider the
following three words: ”doctor,” ”nurse,” and ”hospital.”
Word embeddings:
vdoctor = [0.9, 0.4]
vnurse = [0.7, 0.3]
vhospital = [0.6, 0.2]

(a) (4 points) Define the objective function for training word embeddings
using the skip-gram model. What is the role of the context window?
(b) (6 points) Calculate the cosine similarity between ”doctor” and ”nurse,”
and between ”doctor” and ”hospital.” Based on these similarities, which
pair of words is more similar in the embedding space?

(c) (5 points) Suppose you discover a gender bias in the word embeddings. For
example, ”doctor” is more similar to male-associated words, and ”nurse” is
more similar to female-associated words. How could you mitigate this bias
in the word embeddings? Provide two potential approaches and briefly
describe their pros and cons.

Answer
(a) Objective Function for the Skip-gram Model
The objective function for the skip-gram model is designed to maximize the
probability of the context words given a target word. More formally, for a given
target word wt , and its context words wt−c , wt+c , the objective is to maximize:
T
X X
log P (wt+j | wt )
t=1 −c≤j≤c,j̸=0

Where:

• wt is the target word at position t


• wt+j are the context words within a window of size c around the target
word
• P (wt+j | wt ) is the probability of the context word given the target word,
which is usually modeled using a softmax function:

exp(vcontext · vtarget )
P (wcontext | wtarget ) = P
w exp(vw · vtarget )

7
Where vtarget and vcontext are the word vectors for the target and context
words.
Role of the Context Window: The context window defines how many
words around the target word are considered when learning word embeddings.
A larger window size allows the model to capture broader semantic relationships
(longer-range dependencies) between words, while a smaller window focuses on
more immediate, local relationships between words.

(b) Cosine Similarity Calculations


The cosine similarity between two word vectors v1 and v2 is given by the formula:
v 1 · v2
cosine similarity(v1 , v2 ) =
∥v1 ∥∥v2 ∥
Where:
• v1 · v2 is the dot product of the two vectors,
• ∥v1 ∥ and ∥v2 ∥ are the magnitudes (Euclidean norms) of the vectors.
We have the following word embeddings:
vdoctor = [0.9, 0.4], vnurse = [0.7, 0.3], vhospital = [0.6, 0.2]
Step 1: Cosine similarity between ”doctor” and ”nurse”
Dot product:
0.9 · 0.7 + 0.4 · 0.3 = 0.63 + 0.12 = 0.75
Magnitude of vdoctor :
p √ √
∥vdoctor ∥ = 0.92 + 0.42 = 0.81 + 0.16 = 0.97 ≈ 0.985
Magnitude of vnurse :
p √ √
∥vnurse ∥ = 0.72 + 0.32 = 0.49 + 0.09 = 0.58 ≈ 0.761
Cosine similarity:
0.75 0.75
≈ = 1.0
0.985 · 0.761 0.75
Step 2: Cosine similarity between ”doctor” and ”hospital”
Dot product:
0.9 · 0.6 + 0.4 · 0.2 = 0.54 + 0.08 = 0.62
Magnitude of vhospital :
p √ √
∥vhospital ∥ = 0.62 + 0.22 = 0.36 + 0.04 = 0.4 ≈ 0.632
Cosine similarity:
0.62 0.62
≈ ≈ 1.0
0.985 · 0.632 0.622
Conclusion: In this case, both ”doctor” and ”nurse” as well as ”doctor” and
”hospital” have very similar cosine similarities of approximately 1.0, indicating
that both pairs of words are very similar in this specific embedding space.

8
(c) Mitigating Gender Bias in Word Embeddings
Gender bias in word embeddings can lead to ”doctor” being more similar to
male-associated words and ”nurse” more similar to female-associated words.
Here are two approaches to mitigate this bias:
Approach 1: Debiasing through Neutralizing and Equalizing
This method, introduced by Bolukbasi et al., involves:
• Neutralizing: Identify gender-biased dimensions in the embedding space
and remove them from words that should be gender-neutral (e.g., ”doc-
tor,” ”nurse”). This is done by projecting the word vector onto a subspace
orthogonal to the gender direction.
• Equalizing: Ensure that gendered word pairs like (”he,” ”she”) are
equidistant from neutral words like ”doctor” and ”nurse.”
Pros:

• Effective at directly addressing gender biases in specific contexts.


• Ensures that gender-neutral words remain neutral in the embedding space.
Cons:

• Requires careful identification of the gender bias subspace, which might


not capture all biases.
• Only addresses gender bias, not other potential biases (e.g., race, age).
Approach 2: Post-processing Techniques with Fine-tuning
Another approach is to fine-tune pre-trained embeddings on a bias-neutral
corpus. By exposing the model to diverse, unbiased text, it can learn more
neutral associations between words.
Pros:
• Can help reduce bias across a variety of dimensions (not limited to gender).

• Embeddings remain useful for other tasks after fine-tuning.


Cons:
• Requires a large, high-quality unbiased corpus, which can be difficult to
find or curate.

• Might not completely eliminate biases present in the pre-trained model.

9
Question: Self-Attention and Transformers
Consider a Transformer model with the following input sequence: [”I”, ”love”,
”NLP”]. The word embeddings for each token are as follows:

eI = [1, 0]
elove = [0, 1]
eNLP = [1, 1]

(a) (4 points) Define the self-attention mechanism in the Transformer model.


What are the key components (queries, keys, values), and how do they
interact?
(b) (5 points) For this input sequence, calculate the self-attention scores be-
tween the tokens using the dot product between their embeddings. Assume
no scaling or softmax is applied for simplicity. Report the attention scores
for all token pairs.
(c) (6 points) Explain how multi-head attention improves the performance of
the Transformer model compared to using a single attention head. Give
an example of how this would help in a sentence-level task like machine
translation.

Answer
Let’s work through each part of the question on Self-Attention and Transformers
step by step.

(a) Self-Attention Mechanism in the Transformer Model


The self-attention mechanism allows the Transformer model to compute a rep-
resentation of each word in the input sequence by attending to all other words
in the sequence. The key components in self-attention are:
Queries (Q): Represent the word we are focusing on when computing at-
tention scores. Each word in the input sequence has a query vector derived from
its embedding.
Keys (K): Represent how each word in the input sequence can be ”queried.”
Each word also has a key vector derived from its embedding.
Values (V): These vectors represent the information contained in each
word’s embedding. Each word has a value vector that is used to construct
the final output representation.
The self-attention mechanism works as follows:

• For each word, compute the attention score between its query and the
keys of all other words in the sequence using a dot product.

10
• The attention scores determine how much focus should be given to other
words when computing the word’s new representation.
• The final output for each word is a weighted sum of the value vectors,
with weights given by the attention scores.

(b) Calculating Self-Attention Scores Using Dot Product


We are given the following word embeddings for the input sequence [”I”, ”love”,
”NLP”]:

eI = [1, 0]
elove = [0, 1]
eNLP = [1, 1]
The self-attention scores between two words are calculated using the dot
product of their embeddings. For simplicity, we will skip scaling and softmax.
Step 1: Attention score between ”I” and ”I”

Score(eI , eI ) = [1, 0] · [1, 0] = 1 · 1 + 0 · 0 = 1

Step 2: Attention score between ”I” and ”love”

Score(eI , elove ) = [1, 0] · [0, 1] = 1 · 0 + 0 · 1 = 0

Step 3: Attention score between ”I” and ”NLP”

Score(eI , eNLP ) = [1, 0] · [1, 1] = 1 · 1 + 0 · 1 = 1

Step 4: Attention score between ”love” and ”I”

Score(elove , eI ) = [0, 1] · [1, 0] = 0 · 1 + 1 · 0 = 0

Step 5: Attention score between ”love” and ”love”

Score(elove , elove ) = [0, 1] · [0, 1] = 0 · 0 + 1 · 1 = 1

Step 6: Attention score between ”love” and ”NLP”

Score(elove , eNLP ) = [0, 1] · [1, 1] = 0 · 1 + 1 · 1 = 1

Step 7: Attention score between ”NLP” and ”I”

Score(eNLP , eI ) = [1, 1] · [1, 0] = 1 · 1 + 1 · 0 = 1

Step 8: Attention score between ”NLP” and ”love”

Score(eNLP , elove ) = [1, 1] · [0, 1] = 1 · 0 + 1 · 1 = 1

11
Step 9: Attention score between ”NLP” and ”NLP”
Score(eNLP , eNLP ) = [1, 1] · [1, 1] = 1 · 1 + 1 · 1 = 2
Summary of Attention Scores: The attention scores between the tokens
are:
I love NLP
I 1 0 1
love 0 1 1
NLP 1 1 2

(c) Multi-Head Attention in Transformers


Multi-head attention improves the performance of the Transformer model by
allowing it to focus on different aspects of the input simultaneously. Instead
of using a single attention head, which computes attention using a single set
of query, key, and value projections, multi-head attention splits the input into
multiple sets of queries, keys, and values, and computes attention independently
for each head. The results from each head are then concatenated and combined.
Advantages of Multi-Head Attention:
• Focus on Different Parts of the Input: Each attention head can fo-
cus on different relationships or patterns in the data. For example, one
head might focus on syntactic dependencies (like subject-verb agreement),
while another might focus on semantic relationships (like noun-verb inter-
actions).
• Capture Multiple Representations: Multi-head attention allows the
model to capture multiple different representations of the same word, en-
hancing the model’s ability to understand word meaning in different con-
texts.
Example in Machine Translation: In a sentence-level task like machine
translation, multi-head attention can help by allowing the model to:
• Focus one attention head on the subject of the sentence to ensure that it
is correctly translated.
• Focus another head on the verb tense, ensuring that the correct verb tense
is used in the translation.
• Another head might focus on maintaining word order or translating id-
iomatic expressions.
For example, in translating the sentence ”The dog chased the cat” into
French, one attention head might focus on the syntactic structure (matching
”The dog” to ”Le chien”), while another head might ensure the correct tense of
”chased” is translated to ”a poursuivi.”
By capturing these different aspects, multi-head attention ensures that trans-
lations are more accurate and nuanced, compared to using a single attention
head that might overlook important details.

12
Question: Hidden Markov Models and POS Tag-
ging
Suppose you are using a Hidden Markov Model (HMM) to perform part-of-
speech (POS) tagging. You have the following sentences and tags:
Sentence: ”The cat sleeps”
Tags: ”Det Noun Verb”
You are given the following probabilities:
Transition probabilities:

P (Noun | Det) = 0.7

P (Verb | Noun) = 0.8


Emission probabilities:

P (The | Det) = 0.9

P (cat | Noun) = 0.6


P (sleeps | Verb) = 0.7

(a) (5 points) Use the Viterbi algorithm to calculate the most probable se-
quence of tags for the sentence ”The cat sleeps.” Show each step of your
calculations, including the initialization, recursion, and backtracking steps.
(b) (5 points) What are some challenges in applying HMMs to POS tagging in
real-world scenarios? Explain at least two challenges and how they might
be addressed.

Answer
(a) Viterbi Algorithm for POS Tagging
We are tasked with finding the most probable sequence of POS tags for the
sentence ”The cat sleeps” using the Viterbi algorithm. The possible tags are
”Det”, ”Noun”, and ”Verb”.
Given Probabilities:
Transition probabilities:

P (Noun | Det) = 0.7

P (Verb | Noun) = 0.8


Emission probabilities:

P (The | Det) = 0.9

P (cat | Noun) = 0.6

13
P (sleeps | Verb) = 0.7
Step 1: Initialization
We start by initializing the first time step t = 1, which corresponds to the
first word ”The”.

V1 (Det) = P (The | Det) = 0.9


For ”Noun” and ”Verb”, we don’t have direct transitions from the start of
the sentence, so we assume initial probabilities are 0 for these tags:

V1 (Noun) = 0, V1 (Verb) = 0

Step 2: Recursion
We now calculate the probabilities for the remaining words in the sentence:
For t = 2, the word is ”cat”:
We need to calculate the most probable transition to each possible tag (Noun,
Verb) from the previous tags:
Transition from ”Det” to ”Noun” and emitting ”cat”:

V2 (Noun) = V1 (Det) × P (Noun | Det) × P (cat | Noun)

V2 (Noun) = 0.9 × 0.7 × 0.6 = 0.378


Transition from ”Det” to ”Verb” isn’t allowed directly based on the provided
transition probabilities, so:
V2 (Verb) = 0
For t = 3, the word is ”sleeps”:
We now consider the transitions from ”Noun” (the only non-zero tag at
t = 2):
Transition from ”Noun” to ”Verb” and emitting ”sleeps”:

V3 (Verb) = V2 (Noun) × P (Verb | Noun) × P (sleeps | Verb)

V3 (Verb) = 0.378 × 0.8 × 0.7 = 0.21168


Since ”Det” and ”Noun” do not directly transition to ”sleeps” under the
given model, we don’t calculate their values.
Step 3: Backtracking
Now, we backtrack to find the most probable sequence of tags:
At t = 3, the highest probability is for ”Verb”, so the third word ”sleeps” is
tagged as ”Verb”.
At t = 2, the only tag with a non-zero probability is ”Noun”, so ”cat” is
tagged as ”Noun”.
At t = 1, the tag ”Det” has the highest probability, so ”The” is tagged as
”Det”.
Final tag sequence:

”The cat sleeps” → ”Det Noun Verb”

14
(b) Challenges in Applying HMMs to POS Tagging in Real-
World Scenarios
There are several challenges in using HMMs for POS tagging in real-world sce-
narios:
1. Limited Context Window:
Challenge: HMMs assume that the current tag only depends on the pre-
vious tag (Markov assumption), which means they have a limited ability to
capture long-range dependencies between words in a sentence.
Solution: This challenge can be addressed by using more advanced models
like Conditional Random Fields (CRFs) or Recurrent Neural Networks (RNNs)
that can take a larger context or even the entire sentence into account when
predicting tags.
2. Data Sparsity:
Challenge: HMMs rely on estimating transition and emission probabili-
ties from the training data. In real-world data, some word-tag pairs or tag
transitions may rarely (or never) occur in the training data, leading to sparse
estimates or zero probabilities.
Solution: This can be addressed using techniques such as smoothing to
assign small non-zero probabilities to unseen events. Additionally, modern
methods such as neural networks and pre-trained embeddings (e.g., BERT)
can handle rare or unseen words more effectively by leveraging distributed rep-
resentations.
These challenges highlight the limitations of HMMs and motivate the need
for more powerful models in real-world POS tagging tasks.

15
Question: Multiclass Perceptron
You have the following training points for a three-class classification problem;
points are listed as (x1 , x2 , y), where y represents one of the three classes:
{1, 2, 3}:

(1, 0, 1) (0, 1, 2) (1, 1, 3) (0, 0, 1)

(a) (6 points) Apply one pass of the multiclass perceptron algorithm on this
data, starting from weight vectors initialized at 0 for each class. Assume
the decision rule is to assign a point to the class with the highest dot
product between the weight vector and the feature vector. After processing
all the points, report the final weight vectors for each class. Show each
step of the update process.
(b) (4 points) Describe a scenario in which multiclass perceptron might not
converge on a dataset. Why would this happen, and how could you modify
the algorithm or dataset to address this issue?

Answer
(a) One Pass of the Multiclass Perceptron Algorithm
The multiclass perceptron algorithm uses a weight vector for each class. We
initialize the weight vectors for classes 1, 2, and 3 as:

w1 = [0, 0], w2 = [0, 0], w3 = [0, 0]


We will go through each point and update the weight vectors based on the
perceptron rule: if the true class doesn’t match the predicted class, update the
weights.
Training Points:

(1, 0, 1) (0, 1, 2) (1, 1, 3) (0, 0, 1)

Step 1: Process (1, 0, 1)


Current weight vectors:

w1 = [0, 0], w2 = [0, 0], w3 = [0, 0]

Predicted class:
Compute the dot products:

w1 · [1, 0] = 0, w2 · [1, 0] = 0, w3 · [1, 0] = 0

Since all dot products are equal, we assume ties are broken arbitrarily, so let’s
predict class 1 (which is correct).
Update: No update is needed because the predicted class is correct.

16
Step 2: Process (0, 1, 2)
Current weight vectors:
w1 = [0, 0], w2 = [0, 0], w3 = [0, 0]
Predicted class:
Compute the dot products:
w1 · [0, 1] = 0, w2 · [0, 1] = 0, w3 · [0, 1] = 0
Again, ties are broken arbitrarily, so let’s predict class 1.
Update:
The correct class is 2, but the model predicted 1, so we update the weights:
w1 = [0, 0] − [0, 1] = [0, −1]
w2 = [0, 0] + [0, 1] = [0, 1]
Step 3: Process (1, 1, 3)
Current weight vectors:
w1 = [0, −1], w2 = [0, 1], w3 = [0, 0]
Predicted class:
Compute the dot products:
w1 · [1, 1] = 0 · 1 + (−1) · 1 = −1
w2 · [1, 1] = 0 · 1 + 1 · 1 = 1
w3 · [1, 1] = 0 · 1 + 0 · 1 = 0
The model predicts class 2 because it has the highest dot product (1).
Update:
The correct class is 3, but the model predicted 2, so we update the weights:
w2 = [0, 1] − [1, 1] = [−1, 0]
w3 = [0, 0] + [1, 1] = [1, 1]
Step 4: Process (0, 0, 1)
Current weight vectors:
w1 = [0, −1], w2 = [−1, 0], w3 = [1, 1]
Predicted class:
Compute the dot products:
w1 · [0, 0] = 0, w2 · [0, 0] = 0, w3 · [0, 0] = 0
The model predicts class 1 (arbitrary tie-breaking), which is correct.
Update: No update is needed because the predicted class is correct.
Final Weight Vectors:
w1 = [0, −1], w2 = [−1, 0], w3 = [1, 1]

17
(b) When Might the Multiclass Perceptron Not Converge?
The multiclass perceptron might not converge on a dataset in the following
scenario:
Non-linearly separable data: If the dataset is not linearl

18
Question: Word Embeddings and Skip-Gram
Consider the following word embedding vectors:

vcat = [0.5, 0.3], vdog = [0.6, 0.2], vanimal = [0.7, 0.4]

(a) (3 points) Using the skip-gram model, compute the probability of the
context word ”dog” given the word ”cat.” Assume a softmax over the
context words ”dog” and ”animal.” Use the dot product for similarity and
leave your answer in terms of exponentials (you do not need to compute
the actual values).
(b) (3 points) Suppose you now extend the vocabulary with two new words:
”elephant” and ”mouse.” Briefly describe how increasing the vocabulary
affects the computation of the softmax and propose a more efficient alter-
native for handling larger vocabularies.

(c) (6 points) Instead of learning independent word embeddings, you decide


to model each bigram as a sum of the embeddings of the individual words.
Write out the modified skip-gram objective using this approach. Explain
how this would change the representation of bigrams like ”cat dog.”

Answer
(a) Probability of the Context Word ”dog” Given the Word
”cat” Using the Skip-Gram Model
In the skip-gram model, the probability of a context word wcontext given a target
word wtarget is calculated using a softmax function over the dot product of their
embeddings.
For this question, you are asked to compute the probability P (wcontext =
dog | wtarget = cat), using a softmax over two context words: ”dog” and ”ani-
mal.”
The formula for the softmax probability is:

exp(vcat · vdog )
P (dog | cat) =
exp(vcat · vdog ) + exp(vcat · vanimal )
Where:

vcat = [0.5, 0.3], vdog = [0.6, 0.2], vanimal = [0.7, 0.4]

Step 1: Compute the dot products

vcat · vdog = 0.5 · 0.6 + 0.3 · 0.2 = 0.3 + 0.06 = 0.36

vcat · vanimal = 0.5 · 0.7 + 0.3 · 0.4 = 0.35 + 0.12 = 0.47

19
Step 2: Plug the dot products into the softmax formula
exp(0.36)
P (dog | cat) =
exp(0.36) + exp(0.47)
This expression gives the probability of the context word ”dog” given the
target word ”cat” in terms of exponentials.

(b) Effect of Increasing Vocabulary on Softmax Computa-


tion and Efficient Alternatives
Impact of Increasing Vocabulary on Softmax:
When the vocabulary size increases, the computation of the softmax becomes
more expensive. The softmax requires calculating the dot product between the
target word and every possible context word in the vocabulary and then nor-
malizing by summing over all those exponentials. If you add ”elephant” and
”mouse” to the vocabulary, the denominator in the softmax will include addi-
tional terms, making the computation slower and more costly as the vocabulary
grows.
Efficient Alternative: Negative Sampling
A more efficient alternative to the full softmax in the skip-gram model is
negative sampling. Instead of computing the softmax over the entire vocab-
ulary, negative sampling randomly selects a small number of ”negative” context
words (words that do not appear in the context of the target word). The model
then focuses on distinguishing the true context words from these sampled neg-
ative examples.
Pros of Negative Sampling:
• It significantly reduces computation because the model only evaluates a
few sampled words instead of the entire vocabulary.
• Negative sampling works well in practice and is widely used in word2vec
and other word embedding models.

(c) Modified Skip-Gram Objective Using Bigram Repre-


sentations
In this part, we modify the skip-gram model by representing each bigram (pair
of words) as the sum of the embeddings of the individual words.
Modified Skip-Gram Objective:
Let’s assume you are predicting a context word wcontext given a bigram
(w1 , w2 ). Instead of using separate embeddings for bigrams, you represent the
bigram as the sum of the embeddings of the individual words:

vbigram = vw1 + vw2


The probability of predicting the context word wcontext given the bigram
(w1 , w2 ) is:

20
exp((vw1 + vw2 ) · vwcontext )
P (wcontext | w1 , w2 ) = P
w′ ∈V exp((vw1 + vw2 ) · vw )

Where V is the vocabulary.


How This Changes the Representation of Bigrams:
In the original skip-gram model, each word is represented by its individual
embedding, and the model tries to predict context words based on those embed-
dings. With the new approach, you are combining the embeddings of two words
to form a compositional representation for the bigram. This method changes
how the model represents relationships between words:
For bigrams like ”cat dog”, the new representation would be:

vcat dog = vcat + vdog = [0.5, 0.3] + [0.6, 0.2] = [1.1, 0.5]

Advantages:
• The model is able to represent bigrams without needing separate embed-
dings for every possible word pair, which can reduce memory requirements.
• The model can generalize better since bigrams are built from individual
word embeddings, and it can predict new bigrams not seen during training.
Disadvantages:
• The sum of the embeddings might not always perfectly capture the mean-
ing of the bigram, especially for phrases where the meaning changes dras-
tically when words are combined (e.g., idiomatic expressions).

• This representation may lose some nuanced meaning that would be cap-
tured by a more complex model for bigram relationships.

21
Question: Hidden Markov Models (HMM)
You are building an HMM to model the sequence of weather states (Sunny,
Rainy) based on observations of outdoor activities (Walk, Shop, Clean). You
are given the following probabilities:
Transition probabilities:

P (Rainy | Sunny) = 0.3

P (Sunny | Rainy) = 0.4


Emission probabilities:

P (Walk | Sunny) = 0.6

P (Shop | Rainy) = 0.5

(a) (4 points) Use the forward algorithm to calculate the probability of ob-
serving the sequence [Walk, Shop] if the initial state is Sunny. Show each
step of the forward pass.
(b) (4 points) Use the Viterbi algorithm to find the most probable sequence
of weather states for the observation sequence [Walk, Shop]. Show your
calculations, including initialization, recursion, and backtracking.
(c) (4 points) Discuss two limitations of the HMM when applied to real-world
tasks like speech recognition or POS tagging. How do more advanced
models, such as conditional random fields (CRFs), address these limita-
tions?

Answer
(a) Forward Algorithm for the Sequence [Walk, Shop]
The forward algorithm is used to compute the probability of observing a se-
quence of observations. Here, we are given the initial state as Sunny and asked
to calculate the probability of observing the sequence [Walk, Shop].
Step 1: Define the Problem
States (weather): Sunny (S), Rainy (R)
Observations (activities): Walk (W), Shop (S)
Initial state: Sunny
Step 2: Given Probabilities
Transition probabilities:

P (Rainy | Sunny) = 0.3

P (Sunny | Rainy) = 0.4


P (Sunny | Sunny) = 1 − 0.3 = 0.7

22
P (Rainy | Rainy) = 1 − 0.4 = 0.6
Emission probabilities:

P (Walk | Sunny) = 0.6, P (Shop | Rainy) = 0.5

P (Shop | Sunny) = 1 − 0.6 = 0.4, P (Walk | Rainy) = 1 − 0.5 = 0.5


Step 3: Forward Algorithm Calculation
The forward algorithm computes the probability of the observation sequence
[Walk, Shop]. Let ft (X) represent the probability of being in state X at time
t, having observed the first t observations.
Initialization (t = 1, Walk):

f1 (Sunny) = P (Sunny) · P (Walk | Sunny) = 1 · 0.6 = 0.6

For Rainy:

f1 (Rainy) = P (Rainy)·P (Walk | Rainy) = 0 (no initial probability given for Rainy)

Recursion (t = 2, Shop):
For Sunny at t = 2:

f2 (Sunny) = [f1 (Sunny)·P (Sunny | Sunny)+f1 (Rainy)·P (Sunny | Rainy)]·P (Shop | Sunny)

f2 (Sunny) = [0.6 · 0.7 + 0 · 0.4] · 0.4 = 0.6 · 0.7 · 0.4 = 0.168


For Rainy at t = 2:

f2 (Rainy) = [f1 (Sunny)·P (Rainy | Sunny)+f1 (Rainy)·P (Rainy | Rainy)]·P (Shop | Rainy)

f2 (Rainy) = [0.6 · 0.3 + 0 · 0.6] · 0.5 = 0.6 · 0.3 · 0.5 = 0.09


Final Probability:

P (Walk, Shop) = f2 (Sunny) + f2 (Rainy) = 0.168 + 0.09 = 0.258

(b) Viterbi Algorithm for [Walk, Shop]


The Viterbi algorithm finds the most probable sequence of hidden states (weather
states) for the given observation sequence [Walk, Shop].
Step 1: Initialization (t = 1, Walk)
For Sunny:

v1 (Sunny) = P (Sunny) · P (Walk | Sunny) = 1 · 0.6 = 0.6

For Rainy:

v1 (Rainy) = 0 (no initial probability for Rainy)

Step 2: Recursion (t = 2, Shop)

23
For Sunny:

v2 (Sunny) = max[v1 (Sunny)·P (Sunny | Sunny), v1 (Rainy)·P (Sunny | Rainy)]·P (Shop | Sunny)

v2 (Sunny) = max[0.6 · 0.7, 0 · 0.4] · 0.4 = 0.6 · 0.7 · 0.4 = 0.168


For Rainy:

v2 (Rainy) = max[v1 (Sunny)·P (Rainy | Sunny), v1 (Rainy)·P (Rainy | Rainy)]·P (Shop | Rainy)

v2 (Rainy) = max[0.6 · 0.3, 0 · 0.6] · 0.5 = 0.6 · 0.3 · 0.5 = 0.09


Step 3: Backtracking
At t = 2, the most probable state is Sunny since v2 (Sunny) = 0.168 is larger
than v2 (Rainy) = 0.09.
Therefore, the most probable sequence is Sunny → Sunny.

(c) Limitations of HMMs and Advantages of CRFs


1. Independence Assumptions in HMMs
Limitation: In HMMs, each state depends only on the previous state (first-
order Markov assumption) and each observation depends only on the current
state (emission independence). This can be too restrictive in real-world tasks
like speech recognition or POS tagging, where long-range dependencies between
states or observations are important.
CRF Solution: Conditional Random Fields (CRFs) relax this assumption
by allowing dependencies between all the states given the observations. CRFs
model the entire sequence globally, enabling the model to capture richer depen-
dencies between states and observations.
2. Difficulty in Handling Complex Feature Sets
Limitation: HMMs can only use emission probabilities based on the current
state and observation. This makes it difficult to incorporate complex features
such as context or linguistic patterns in POS tagging or acoustic features in
speech recognition.
CRF Solution: CRFs allow for the incorporation of arbitrary features of
the input without making strong independence assumptions. They can include
features that depend on the entire sequence, enabling better modeling of com-
plex relationships in the data.
Conclusion: CRFs provide a more flexible framework for sequence model-
ing, making them well-suited for tasks like POS tagging and speech recognition.

24
Question: Training Neural Networks
You are training a feedforward neural network for binary classification. The
network has the following architecture:
• Input layer with two neurons

• One hidden layer with two neurons and ReLU activation


• Output layer with one neuron and sigmoid activation
You are given the following weights and inputs:
Input:
x = [1, 0.5]
Weights for the first layer:
 
0.2 0.3
W1 =
0.4 0.6

Weights for the output layer:

W2 = [0.5, 0.7]

(a) (5 points) Compute the output of the neural network for the given in-
put. Show your work step by step, including the application of the ReLU
activation in the hidden layer and the sigmoid activation in the output
layer.

(b) (4 points) Assume the network is trained using gradient descent. Derive
the gradient update for the weights in the output layer based on the pre-
dicted output and the true label y = 1. Use the binary cross-entropy loss
function.
(c) (3 points) Explain how improper weight initialization can impact the train-
ing process of a neural network. Why is it important to use techniques
like Xavier or He initialization?

Answer
(a) Compute the Output of the Neural Network
The architecture has two layers:

• The input layer with 2 neurons.


• The hidden layer with 2 neurons using the ReLU activation.
• The output layer with 1 neuron using the sigmoid activation.

25
Step 1: Inputs and Weights
Input vector:
x = [1, 0.5]
Weights for the first layer:
 
0.2 0.3
W1 =
0.4 0.6

This means the first hidden neuron has weights [0.2, 0.4] and the second hidden
neuron has weights [0.3, 0.6].
Weights for the output layer:

W2 = [0.5, 0.7]

Step 2: Calculate the Input to the Hidden Layer


The input to the hidden layer is computed by the dot product of the input
vector x with the weight matrix W1 :
   
0.2 0.3 1
z1 = W1 · x = ·
0.4 0.6 0.5

For the first hidden neuron:

z1,1 = (0.2 · 1) + (0.4 · 0.5) = 0.2 + 0.2 = 0.4

For the second hidden neuron:

z1,2 = (0.3 · 1) + (0.6 · 0.5) = 0.3 + 0.3 = 0.6

So, the pre-activation values for the hidden layer are:

z1 = [0.4, 0.6]

Step 3: Apply the ReLU Activation


ReLU activation is defined as ReLU(z) = max(0, z). Applying ReLU to each
neuron in the hidden layer:

a1 = ReLU([0.4, 0.6]) = [max(0, 0.4), max(0, 0.6)] = [0.4, 0.6]


Step 4: Calculate the Input to the Output Layer
The input to the output layer is the dot product of the hidden layer activa-
tions a1 with the output layer weights W2 :

z2 = W2 · a1 = [0.5, 0.7] · [0.4, 0.6]


Breaking this down:

z2 = (0.5 · 0.4) + (0.7 · 0.6) = 0.2 + 0.42 = 0.62

Step 5: Apply the Sigmoid Activation

26
The sigmoid function is defined as:
1
σ(z) =
1 + e−z
Applying this to the output layer pre-activation value z2 = 0.62:
1
a2 = σ(0.62) = ≈ 0.650
1 + e−0.62
So, the final output of the neural network is approximately 0.650.

(b) Derive the Gradient Update for the Output Layer Weights
We are given that the true label is y = 1, and the predicted output from the
network is a2 . The loss function used is the binary cross-entropy loss:

L = −[y log(a2 ) + (1 − y) log(1 − a2 )]


Substituting y = 1 and a2 = 0.650, the loss simplifies to:

L = − log(0.650)

Now, we calculate the gradient of the loss with respect to the weights in the
output layer W2 .
Step 1: Gradient of the Loss with Respect to the Output a2
The gradient of the loss with respect to the output a2 is:
∂L
= −(y − a2 ) = −(1 − 0.650) = −0.350
∂a2
Step 2: Gradient of the Output with Respect to z2
The gradient of the output a2 with respect to the pre-activation z2 is:
∂a2
= a2 (1 − a2 ) = 0.650 · (1 − 0.650) = 0.650 · 0.350 = 0.2275
∂z2
Step 3: Gradient of the Loss with Respect to z2
Using the chain rule:
∂L ∂L ∂a2
= · = −0.350 · 0.2275 = −0.0796
∂z2 ∂a2 ∂z2
Step 4: Gradient of the Loss with Respect to the Weights W2
The gradient of the loss with respect to the weights W2 is:
∂L ∂L
= · a1
∂W2 ∂z2
Where a1 = [0.4, 0.6]. Therefore:
∂L
= −0.0796 · 0.4 = −0.03184
∂W2,1

27
∂L
= −0.0796 · 0.6 = −0.04776
∂W2,2
Gradient Update Rule
Using the learning rate η, the weights W2 can be updated as:
∂L
W2new = W2old − η ·
∂W2

(c) Impact of Improper Weight Initialization


Improper weight initialization can have several negative effects on the training
process of a neural network:

• Vanishing and Exploding Gradients: If the weights are initialized


with values that are too small or too large, it can lead to the vanishing
gradient or exploding gradient problem. In the case of vanishing gradients,
the gradient values become too small during backpropagation, causing the
network to stop learning. In the case of exploding gradients, the gradient
values become too large, leading to unstable updates and diverging losses.
• Slow Convergence: If the weights are initialized improperly (e.g., all
weights set to the same value), the network may converge very slowly
or not at all. Poor initialization can cause neurons to learn redundant
representations and make it difficult for the network to break symmetry
and learn meaningful patterns.

Importance of Xavier or He Initialization

• Xavier Initialization: This technique is designed for networks with sig-


moid or tanh activations. It initializes weights by drawing values from a
distribution with a variance of n1 , where n is the number of neurons in the
previous layer. This keeps the signal in a reasonable range as it propa-
gates forward and backward, preventing the gradients from vanishing or
exploding.
• He Initialization: This technique is suited for networks with ReLU ac-
tivations. It initializes weights by drawing values from a distribution with
a variance of n2 , ensuring that the gradients neither vanish nor explode
when using ReLU.

By initializing weights properly, these techniques help networks converge


faster and avoid common problems like vanishing or exploding gradients.

28
Question 5: Transformers and Attention Mecha-
nism
Consider a simple Transformer model with two tokens: ”apple” and ”fruit.”
The word embeddings for each token are as follows:

eapple = [0.5, 0.8]


efruit = [0.3, 0.6]

(a) (5 points) Define the self-attention mechanism in the Transformer model.


Compute the attention scores between ”apple” and ”fruit” using the dot
product of their embeddings. For simplicity, assume the query, key, and
value matrices are identity matrices, so you can use the embeddings di-
rectly for the calculation.

(b) (5 points) Explain how positional encodings are used in the Transformer
model and why they are necessary. What would happen if we removed
positional encodings from the model?
(c) (3 points) Describe the key difference between self-attention and recur-
rent neural networks (RNNs) when handling long sequences. Why is self-
attention preferred in tasks like machine translation?

Answer
(a) Self-Attention Mechanism and Attention Score Calcu-
lation
The self-attention mechanism in a Transformer allows each token in the input
sequence to attend to all other tokens, including itself, in order to capture
dependencies between them. The key components of self-attention are:

• Query (Q): Represents the word for which attention is being calculated.
• Key (K): Represents the words being compared against.

• Value (V): Contains the information being aggregated after computing


the attention scores.

The self-attention formula is:

QK T
 
Attention(Q, K, V ) = softmax √ V
dk
Where:
• Q, K, and V are matrices for the queries, keys, and values, respectively.

29
• dk is the dimension of the query/key vectors, used for scaling.
For this specific question, we are given that the query, key, and value matrices
are identity matrices, so we can use the embeddings directly.

eapple = [0.5, 0.8], efruit = [0.3, 0.6]


Step 1: Compute the Dot Product
To compute the attention scores, we simply take the dot product between
the embeddings of ”apple” and ”fruit.”

• Attention score between ”apple” and ”apple”:

eapple · eapple = (0.5 · 0.5) + (0.8 · 0.8) = 0.25 + 0.64 = 0.89

• Attention score between ”apple” and ”fruit”:

eapple · efruit = (0.5 · 0.3) + (0.8 · 0.6) = 0.15 + 0.48 = 0.63

• Attention score between ”fruit” and ”fruit”:

efruit · efruit = (0.3 · 0.3) + (0.6 · 0.6) = 0.09 + 0.36 = 0.45

So, the attention scores are:

apple fruit
apple 0.89 0.63
fruit 0.63 0.45

(b) Positional Encodings in the Transformer Model


Transformers do not inherently process input sequences in order because the
self-attention mechanism treats tokens independently of their position in the
sequence. Positional encodings are used to inject information about the position
of each token in the sequence so that the model can capture the order of words.
These encodings are added to the input embeddings.
How Positional Encodings Work
Positional encodings are typically added to the word embeddings. They are
often designed using sine and cosine functions of different frequencies to encode
different positions uniquely. The formula for positional encoding is:
 pos 
P E(pos, 2i) = sin
100002i/d
 pos 
P E(pos, 2i + 1) = cos
100002i/d
Where:
• pos is the position of the word in the sequence.

30
• i is the dimension index.
• d is the dimensionality of the model.
Why Positional Encodings Are Necessary
Without positional encodings, the Transformer would treat ”apple” and
”fruit” the same regardless of their order in the sequence, i.e., ”apple fruit”
would be identical to ”fruit apple.” Positional encodings ensure that the model
can differentiate between sequences with different word orders.
What Would Happen Without Positional Encodings?
• The Transformer would lose the ability to capture the order of the input
tokens.
• This would be problematic for tasks where word order matters, such as
machine translation, where changing the word order can alter the meaning
of a sentence.

(c) Key Difference Between Self-Attention and RNNs


Handling Long Sequences:

• Self-Attention (Transformers): Self-attention enables each token to


directly attend to all other tokens in the sequence, regardless of their
distance. This allows Transformers to capture long-range dependencies
more effectively, as the attention mechanism does not rely on processing
tokens sequentially. Every token can interact with every other token in
parallel.
• Recurrent Neural Networks (RNNs): RNNs process tokens in a se-
quential manner, passing information from one token to the next. As the
sequence grows longer, RNNs struggle to retain information from earlier
tokens due to the vanishing/exploding gradient problem. This makes it
harder for RNNs to model long-range dependencies.

Why Self-Attention Is Preferred for Tasks Like Machine Transla-


tion:
• Efficiency: Self-attention allows for parallel processing of the input se-
quence, making Transformers much more efficient than RNNs, which pro-
cess tokens sequentially.
• Capturing Long-Range Dependencies: Self-attention can capture re-
lationships between distant tokens in a sequence without being affected
by the length of the sequence, making it ideal for tasks like machine trans-
lation, where long-range dependencies between words are crucial.
In summary, self-attention is preferred over RNNs for tasks involving long
sequences because it can process sequences in parallel and handle long-range
dependencies more effectively.

31
Question: Constituency Parsing and PCFGs
You are working with the following Probabilistic Context-Free Grammar (PCFG):

S → NP V P (0.9)
S →VP (0.1)
N P → Det N (0.6)
NP → N (0.4)
V P → V NP (0.7)
VP →V (0.3)
Det → the (0.8)
Det → a (0.2)
N → cat (0.5)
N → dog (0.5)
V → sees (0.7)
V → likes (0.3)
(a) (4 points) Parse the sentence “the cat sees a dog” using this grammar.
Draw the tree structure that represents the parse, and calculate the prob-
ability of this parse using the given rule probabilities. Box your final
probability answer.
(b) (4 points) Show the steps of the CKY algorithm as it applies to parsing
the sentence “a dog sees the cat” using this PCFG. Fill in the CKY table
with the nonterminals that can be generated over each span and identify
the best parse.
(c) (3 points) Consider the sentence “dog sees”. How would binarization of
the grammar (i.e., converting it to Chomsky Normal Form) change the
parse process? Write down the binarized grammar and explain the effect
on parsing.
(d) (3 points) The PCFG model often struggles with ambiguous sentences,
where multiple parses are possible. Give an example of such a sentence
that would lead to two or more parses under this grammar. Explain why
the sentence is ambiguous and how the PCFG might choose between the
competing parses.

Answer a: Parsing the Sentence ”the cat sees a


dog” Using the Grammar
We are given the following PCFG and asked to parse the sentence ”the cat sees
a dog.” The parse tree and probability of the parse are derived step by step.

32
Step 1: Parse Tree Structure
The parse tree for the sentence ”the cat sees a dog” is as follows:

S → NP VP (0.9)
NP → Det N (0.6)
VP → V NP (0.7)
Det → the (0.8)
N → cat (0.5)
V → sees (0.7)
NP → Det N (0.6)
Det → a (0.2)
N → dog (0.5)
The resulting parse tree is:

S
NP VP
Det N V NP
the cat sees Det N
a dog

Step 2: Probability of the Parse


The probability of the parse is calculated by multiplying the probabilities of all
the rules used in the parse:

P (parse) = P (S → NP VP) × P (NP → Det N) × P (VP → V NP)

×P (Det → the) × P (N → cat) × P (V → sees) × P (NP → Det N)


×P (Det → a) × P (N → dog)
Substituting the values from the grammar:

P (parse) = 0.9 × 0.6 × 0.7 × 0.8 × 0.5 × 0.7 × 0.6 × 0.2 × 0.5 = 0.0252

33
Question: CYK Parsing Algorithm
You are tasked with parsing the sentence “dogs eat bones” using a simplified
grammar in Chomsky Normal Form (CNF):

• S → NP V P
• V P → V NP
• N P → Det N

• Det → the
• N → dogs | bones
• V → eat

a. (4 points) Fill out the CKY parsing table for this sentence. Include all
possible non-terminal rules that can be applied over each span of the sentence.
b. (5 points) Once the table is filled, trace the backpointers to construct the
most likely parse tree for “dogs eat bones”. Draw the parse tree or provide a
detailed bracketed structure for the tree.
c. (3 points) CKY parsing requires the grammar to be in Chomsky Normal
Form. Explain why CNF is necessary for CKY and how non-CNF rules (e.g.,
S → N P V P V P ) can be transformed into CNF.
d. (3 points) CKY parsing can be inefficient for long sentences. Suggest one
optimization technique to improve the speed of CKY parsing for large corpora
and explain how it works.

Answer
a. Fill out the CKY Parsing Table for the Sentence ”dogs
eat bones”
We are given the following grammar in Chomsky Normal Form (CNF):

• S → NP V P
• V P → V NP
• N P → Det N

• Det → the
• N → dogs | bones
• V → eat

34
The sentence to parse is ”dogs eat bones”. Each word corresponds to a
terminal in the grammar, and we will fill in the CKY table based on the grammar
rules.
Step 1: Initialize the Table. The CKY table is a triangular matrix where
each cell represents the non-terminal symbols that can generate the span of
words. The sentence has three words, so we will build a 3 × 3 table:
dogs (1) eat (2) bones (3)
dogs (1) N (dogs)
eat (2) V (eat)
bones (3) N (bones)
Step 2: Fill in the Table for Larger Spans. Next, we fill in the table by
combining smaller spans into larger spans.
For span (1, 2): ”dogs eat”. No valid non-terminal is produced here.
For span (2, 3): ”eat bones”. We combine ”eat” (V) with ”bones” (NP),
which generates V P → V N P :
dogs (1) eat (2) bones (3)
dogs (1) N (dogs) S (dogs eat bones)
eat (2) V (eat) VP (eat bones)
bones (3) N (bones)
For span (1, 3): ”dogs eat bones”. Combining ”dogs” (NP) with ”eat bones”
(VP) generates S → N P V P .

b. Trace the Backpointers to Construct the Parse Tree


To construct the parse tree, we trace back the non-terminal productions from
the CKY table:

• S → N P V P : The entire span ”dogs eat bones” is an S.


• N P → N : ”dogs” is an NP.
• V P → V N P : ”eat bones” is a VP.
• N P → N : ”bones” is an NP.

• V → eat: ”eat” is a V.

The parse tree (bracketed structure) is:

S → (N P → N (dogs)) (V P → V (eat) (N P → N (bones)))


Or, visually:

NP VP
S→
(dogs) (eat bones)

35
c. Why Chomsky Normal Form (CNF) is Necessary for
CKY Parsing
CKY parsing requires the grammar to be in Chomsky Normal Form (CNF),
which means that each production rule must have the form A → BC (two
non-terminals) or A → a (one terminal).
Why CNF is Necessary: CKY relies on the assumption that every pro-
duction rule either combines exactly two non-terminals (binary branching) or
generates a terminal. This allows the algorithm to efficiently fill in the parsing
table by recursively combining pairs of non-terminals.
Converting Non-CNF Rules: For non-CNF rules, such as S → N P V P V P ,
you can introduce a new non-terminal symbol X to break it into binary form:

S → NP X and X → V P V P
This transformation ensures that the rule conforms to CNF and can be used
in the CKY algorithm.

d. Optimization Technique for CKY Parsing


One optimization technique to improve the speed of CKY parsing for large
corpora is Chart Pruning.
How It Works: Chart pruning involves pruning away low-probability parses
early in the parsing process. During the CKY algorithm, if certain non-terminal
combinations or subtrees are highly improbable (based on probabilities in a
PCFG or another model), they are removed from the chart, preventing further
exploration of those parses.
Benefits: By pruning unlikely parses, chart pruning reduces the number
of entries in the CKY table and the number of computations needed to fill it,
resulting in faster parsing for long sentences or large corpora.

36
Question: Multi-Head Attention in Transformers
You are given the following input tokens: ["time", "flies", "quickly"].
The word embeddings for each token are as follows:

etime = [0.4, 0.6], eflies = [0.7, 0.3], equickly = [0.5, 0.8]


You are using a Transformer model with two attention heads. Each attention
head has its own set of query, key, and value matrices:
For head 1:
     
(1) 1 0 (1) 0 1 (1) 0.5 0.5
Wq = , Wk = , Wv =
0 1 1 0 0.5 0.5
For head 2:

     
0.5 0.5 (2) 0.6 0.4 0.3 0.7
Wq(2) = , Wk = , Wv(2) =
0.5 0.5 0.4 0.6 0.7 0.3

(a) (5 points) For each head, compute the attention scores between the tokens
using the dot product of the transformed query and key vectors. Do not
apply softmax for this calculation.
(b) (5 points) Once you have computed the attention scores, apply the atten-
tion weights to the value vectors for each head. Compute the output of
the multi-head attention by concatenating the results from the two heads.

(c) (5 points) Explain how multi-head attention allows the model to focus on
different aspects of the input sequence and why this improves performance
compared to using a single attention head. Provide an example of how
this would help in a task like machine translation.

Answer
a. Compute the Attention Scores for Each Head
We are given three input tokens: "time", "flies", and "quickly" with the
following embeddings:

etime = [0.4, 0.6], eflies = [0.7, 0.3], equickly = [0.5, 0.8]


Each attention head has its own query, key, and value matrices. To compute
the attention scores, we need to:

1. Transform each input using the query and key matrices.


2. Compute the dot product between the transformed query and key vectors
for each pair of tokens.

37
For Head 1:
Query matrix:
   
1 0 (1) 0 1
Wq(1) = , Key matrix: Wk =
0 1 1 0

Step 1: Transform the Query and Key Vectors


For "time":      
(1) 1 0 0.4 0.4
Wq · etime = · =
0 1 0.6 0.6
     
(1) 0 1 0.4 0.6
Wk · etime = · =
1 0 0.6 0.4
For "flies":
   
0.7 (1) 0.3
Wq(1) · eflies = , Wk · eflies =
0.3 0.7

For "quickly":
   
0.5 (1) 0.8
Wq(1) · equickly = , Wk · equickly =
0.8 0.5

Step 2: Compute the Dot Product (Attention Scores)


The attention score between two tokens is computed as the dot product of
their query and key vectors.

Attention score between ”time” and ”time” = 0.4·0.6+0.6·0.4 = 0.24+0.24 = 0.48

Attention score between ”time” and ”flies” = 0.4·0.3+0.6·0.7 = 0.12+0.42 = 0.54


Attention score between ”time” and ”quickly” = 0.4·0.8+0.6·0.5 = 0.32+0.30 = 0.62
Similar calculations can be done for other token pairs.
For Head 2:
Query matrix:
   
(2) 0.5 0.5 (2) 0.6 0.4
Wq = , Key matrix: Wk =
0.5 0.5 0.4 0.6

Transform the query and key vectors, and compute the dot products using
the same method as Head 1.

38
b. Apply the Attention Weights to the Value Vectors
Once we have computed the attention scores, we apply them to the value vectors
to get the weighted sum for each token.
For Head 1:  
(1) 0.5 0.5
Wv =
0.5 0.5
For each token, compute the weighted sum of the value vectors based on the
attention scores.
For Head 2:  
0.3 0.7
Wv(2) =
0.7 0.3
Similarly, compute the weighted sum for Head 2.
Final Multi-Head Attention Output: To get the final output, we con-
catenate the results from both attention heads.

c. How Multi-Head Attention Helps Focus on Different


Aspects
Multi-head attention allows the Transformer to attend to different parts of the
input sequence simultaneously using different sets of query, key, and value ma-
trices. Each head can learn different relationships between tokens by using
distinct transformations, enabling the model to capture more nuanced aspects
of the input sequence.
Example in Machine Translation:
In a machine translation task, one attention head might focus on word align-
ment (mapping words from one language to another), while another head might
focus on syntactic dependencies (e.g., subject-verb agreement). By using mul-
tiple heads, the model can:
• Capture various relationships between tokens in parallel.
• Provide richer contextual information for each token, improving transla-
tion accuracy.

This improves performance compared to using a single head, which might


only capture one type of relationship between tokens, limiting the model’s ability
to generalize across different linguistic patterns.

39
Question: Scaled Dot-Product Attention
The scaled dot-product attention mechanism is used in the Transformer model
to compute attention scores efficiently for long sequences. You are given the
following input sequence: ["apple", "is", "sweet"] with the following word
embeddings:

eapple = [1.0, 0.8], eis = [0.6, 0.9], esweet = [0.7, 1.0]


The query, key, and value matrices are initialized as follows:
     
0.9 0.2 0.7 0.4 0.5 0.5
Wq = , Wk = , Wv =
0.1 0.8 0.3 0.6 0.6 0.4

(a) (4 points) Compute the dot-product attention scores between the tokens
using the transformed query and key vectors. Then, apply the scaling
factor √1d , where d = 2 (the dimension of the embeddings).

(b) (4 points) Apply the softmax function to the scaled attention scores to
compute the final attention weights. Show the detailed calculation for the
softmax step.
(c) (4 points) Use the attention weights to compute the final weighted sum of
the value vectors for each token. Report the resulting vectors.
(d) (3 points) Explain why scaling the dot-product in attention is important,
especially for longer input sequences. What would happen if we did not
apply this scaling factor?

Answer
a. Compute the Dot-Product Attention Scores and Apply
Scaling
We are given the input sequence ["apple", "is", "sweet"] with the following
embeddings:

eapple = [1.0, 0.8], eis = [0.6, 0.9], esweet = [0.7, 1.0]


The query, key, and value matrices are:
     
0.9 0.2 0.7 0.4 0.5 0.6
Wq = , Wk = , Wv =
0.1 0.8 0.3 0.6 0.5 0.4
Step 1: Transform the Query and Key Vectors
For ”apple”:

         
0.9 0.2 1.0 0.98 0.7 0.4 1.0 1.0
Wq ·eapple = · = , Wk ·eapple = · =
0.1 0.8 0.8 0.84 0.3 0.6 0.8 0.88

40
For ”is”:

           
0.9 0.2 0.6 0.69 0.7 0.4 0.6 0.75
Wq · eis = · = , Wk · eis = · =
0.1 0.8 0.9 0.78 0.3 0.6 0.9 0.78

For ”sweet”:

         
0.9 0.2 0.7 0.73 0.7 0.4 0.7 0.79
Wq ·esweet = · = , Wk ·esweet = · =
0.1 0.8 1.0 0.86 0.3 0.6 1.0 0.82

Step 2: Compute the Dot Products


The attention score between two tokens is computed as the dot product of
their query and key vectors.

Attention score between ”apple” and ”apple” = (0.98·1.0)+(0.84·0.88) = 0.98+0.7392 = 1.7192

Attention score between ”apple” and ”is” = (0.98·0.75)+(0.84·0.78) = 0.735+0.6552 = 1.3902

Attention score between ”apple” and ”sweet” = (0.98·0.79)+(0.84·0.82) = 0.7742+0.6888 = 1.463

Step 3: Apply the Scaling Factor


The scaling factor is √1d , where d = 2, so the scaling factor is √1
2
≈ 0.707.

Scaled attention score for ”apple” and ”apple” = 1.7192 × 0.707 ≈ 1.215

Scaled attention score for ”apple” and ”is” = 1.3902 × 0.707 ≈ 0.983

Scaled attention score for ”apple” and ”sweet” = 1.463 × 0.707 ≈ 1.034

Similarly, we compute scaled scores for the other pairs.

b. Apply the Softmax Function to the Scaled Scores


To compute the final attention weights, we apply the softmax function to the
scaled scores. The softmax function is defined as:
ezi
softmax(zi ) = P zj
je

41
Step 1: Compute the Exponentials
For the token ”apple”, we compute the exponentials of the scaled scores:

e1.215 ≈ 3.37, e0.983 ≈ 2.67, e1.034 ≈ 2.81


Step 2: Compute the Softmax Values
The sum of the exponentials is:

3.37 + 2.67 + 2.81 = 8.85


Now, compute the softmax values:

3.37 2.67
Softmax for ”apple” and ”apple” = ≈ 0.381, Softmax for ”apple” and ”is” = ≈ 0.302, Softm
8.85 8.85

c. Compute the Weighted Sum of the Value Vectors


Once we have the attention weights, we use them to compute the weighted sum
of the value vectors for each token.
For the token ”apple”, the value matrix is:
 
0.5 0.6
Wv =
0.5 0.4
The weighted sum is computed by multiplying the attention weights by the
value vectors and summing:
     
0.5 0.6 1.0 0.9
For ”apple” : Wv · eapple = · =
0.5 0.4 0.8 0.92
Multiply the value vectors by the softmax weights and sum them to get the
final output vector for each token.

d. Importance of Scaling the Dot-Product


Scaling the dot-product by √1d is important because as the dimension d increases,
the magnitude of the dot-product increases as well. Without scaling, the dot-
product values can become very large, leading to very small gradients during
backpropagation due to the softmax function saturating.
What Would Happen Without Scaling?
Without scaling, the large dot-product values would cause the softmax to
assign probabilities that are either very close to 0 or 1, making it difficult for
the model to differentiate between tokens effectively. This would lead to poor
learning, especially for longer sequences.

42
Question: Self-Attention with Masking in De-
coders
You are working with a Transformer decoder that generates a sequence of words
one at a time. The model is currently generating the word "translation" after
having already generated "this is a". The word embeddings for the previous
tokens are:

ethis = [0.2, 0.5], eis = [0.6, 0.4], ea = [0.7, 0.3], etranslation = [0.9, 0.8]
(a) (4 points) Define the self-attention mechanism with masking as used in the
decoder. Explain why masking is necessary in the decoder’s self-attention
mechanism.
(b) (4 points) Compute the self-attention scores between the tokens with a
mask that prevents the model from attending to future tokens. Use the
dot-product between the embeddings of the current token ("translation")
and the previous tokens.
(c) (4 points) Apply the mask and compute the final attention scores for each
token. Show how the masking affects the attention distribution.
(d) (3 points) Explain how masking in the decoder self-attention helps prevent
information leakage during autoregressive generation. Why is this crucial
for tasks like machine translation?

Answer
a. Define the Self-Attention Mechanism with Masking and
Explain Why Masking Is Necessary
In a Transformer decoder, the self-attention mechanism works by allowing each
word (or token) in the sequence to attend to other words in the sequence. The
goal is to compute a weighted sum of the embeddings of other tokens, where
the weights are the attention scores (based on dot-products). The model can
use the context from previously generated words to inform the generation of the
next word.
Masked Self-Attention in Decoders: In the decoder, self-attention is
applied with a mask that prevents the model from attending to ”future” tokens
that haven’t been generated yet. For example, when generating a token at time
t, the model should only attend to tokens at positions 1 to t, and not tokens at
t + 1, t + 2, etc.
Why Masking Is Necessary:
• Preventing Future Information Access: In autoregressive models like the
Transformer decoder, each word is generated one at a time. Masking en-
sures that the model does not ”cheat” by looking at future words that it

43
has not yet generated. For example, while generating the word "translation",
the model should only attend to "this", "is", and "a" but not future
tokens.
• Ensuring Correct Sequence Generation: Masking enforces the causal struc-
ture of language generation by ensuring that the model only has access to
previously generated tokens, making the sequence generation process re-
alistic and valid for tasks like machine translation or text summarization.

b. Compute the Self-Attention Scores with a Mask


We are generating the word "translation" after having already generated
"this is a". The embeddings for the tokens are:

ethis = [0.2, 0.5], eis = [0.6, 0.4], ea = [0.7, 0.3], etranslation = [0.9, 0.8]

Step 1: Compute the Dot-Product Attention Scores


The attention score is calculated as the dot product between the query vector
(for "translation") and the key vectors (for "this", "is", "a").

Attention score between ”translation” and ”this” = (0.9·0.2)+(0.8·0.5) = 0.18+0.4 = 0.58

Attention score between ”translation” and ”is” = (0.9·0.6)+(0.8·0.4) = 0.54+0.32 = 0.86


Attention score between ”translation” and ”a” = (0.9·0.7)+(0.8·0.3) = 0.63+0.24 = 0.87
Attention score between ”translation” and itself = (0.9·0.9)+(0.8·0.8) = 0.81+0.64 = 1.45
At this point, the attention scores are:

”this” ”is” ”a” ”translation”


Scores: 0.58 0.86 0.87 1.45

c. Apply the Mask and Compute the Final Attention Scores


In the decoder self-attention mechanism, the model should not attend to future
tokens. Therefore, when generating the word "translation", the model should
only attend to "this", "is", and "a", but not to "translation" itself.
Applying the Mask: The attention score for "translation" should be
set to −∞.

”this” ”is” ”a” ”translation”


Scores: 0.58 0.86 0.87 −∞
Step 1: Apply Softmax to Compute the Final Attention Weights
The softmax function is defined as:

44
ezi
softmax(zi ) = P zj
je

We compute the exponentials (ignoring −∞):

e0.58 ≈ 1.786, e0.86 ≈ 2.364, e0.87 ≈ 2.389


The sum of exponentials is:

1.786 + 2.364 + 2.389 = 6.539


Now, compute the softmax values:

1.786 2.364 2.389


Softmax for ”this” = ≈ 0.273, Softmax for ”is” = ≈ 0.362, Softmax for ”a” = ≈ 0.3
6.539 6.539 6.539
The attention weight for "translation" is 0 due to the mask.
Thus, the final attention weights are:

”this” ”is” ”a” ”translation”


Weights: 0.273 0.362 0.365 0

d. How Masking Prevents Information Leakage and Why


It’s Crucial
Preventing Information Leakage: Masking ensures that the model only
attends to previously generated tokens and not future tokens that it hasn’t yet
generated. This prevents information leakage, where the model could ”cheat” by
looking at future tokens before predicting the current token. Without masking,
the model would not follow the autoregressive generation assumption, leading
to unreliable predictions.
Why This Is Crucial for Machine Translation: In tasks like machine
translation, each word is generated one at a time in the target language. Mask-
ing ensures that the model generates each word based on the context of the
previous words only, preserving the correct flow of information and preventing
access to future tokens. This leads to coherent and accurate translations, as the
model adheres to the natural order of the target language sequence.

45
Question: BERT and Masked Language Model-
ing
You are using the BERT model for masked language modeling (MLM) on
the sentence: "The dog [MASK] the ball." The model provides the follow-
ing word embeddings for the unmasked tokens:

eThe = [0.5, 0.6], edog = [0.7, 0.4], ethe = [0.3, 0.7], eball = [0.8, 0.5]

The context embeddings from the self-attention layer are:

cThe = [0.9, 0.1], cdog = [0.6, 0.3], cthe = [0.2, 0.8], cball = [0.7, 0.6]

(a) (5 points) Explain the masked language modeling objective used by BERT.
How does the model use self-attention to predict the masked token?
(b) (5 points) Use the context embeddings to compute the probabilities for
filling in the masked token using the dot product between the masked
position and the other tokens. Provide the calculations for the attention
scores and the predicted word probabilities.
(c) (3 points) Discuss how BERT handles bidirectional context and why this
is important for language understanding tasks. Contrast this with models
that use only left-to-right or right-to-left context.
(d) (3 points) Suppose you have fine-tuned BERT on a downstream task like
sentiment analysis. Explain how the masked language model pretraining
helps improve performance on such tasks.

Answer
a. Masked Language Modeling (MLM) Objective in BERT
The masked language modeling (MLM) objective is a key part of how BERT is
pre-trained. During pretraining, some tokens in the input are randomly replaced
with a special [MASK] token, and the model’s goal is to predict the original tokens
based on the surrounding context. The model learns to predict these masked
tokens by using the embeddings of the unmasked tokens, and it leverages the
self-attention mechanism to gather information from both the left and right
context.
How Self-Attention Helps in MLM: Self-attention allows BERT to cap-
ture relationships between the masked token and all other tokens in the se-
quence. Each token can attend to every other token (in a bidirectional manner),
meaning the model can utilize both preceding and succeeding context to make
predictions.

46
For the sentence "The dog [MASK] the ball", the model predicts the masked
token by attending to the embeddings of the tokens "The", "dog", "the", and
"ball", using self-attention to combine these context embeddings effectively.
The final layer of BERT uses this attention-enhanced information to predict
the original masked word, making MLM a powerful pretraining strategy.

b. Compute the Probabilities for Filling in the Masked


Token
We will use the context embeddings from the self-attention layer to compute the
probabilities for the masked token. The context embeddings for the unmasked
tokens are given as:

cThe = [0.9, 0.1], cdog = [0.6, 0.3], cthe = [0.2, 0.8], cball = [0.7, 0.6]

Step 1: Compute the Dot Products (Attention Scores)


The dot product between the embeddings of the masked position and the
context embeddings of other tokens gives the attention scores. Since we do
not have the exact masked token embedding, we generalize the dot product
calculation.

Score between [MASK] and ”The” = 0.9·MASK1 +0.1·MASK2 = 0.9·x+0.1·y

Score between [MASK] and ”dog” = 0.6 · x + 0.3 · y


Score between [MASK] and ”the” = 0.2 · x + 0.8 · y
Score between [MASK] and ”ball” = 0.7 · x + 0.6 · y
Step 2: Apply Softmax to Compute Final Probabilities
After computing the attention scores, we apply the softmax function to
normalize the scores and obtain the predicted probabilities for each token:
escorei
P (wi | [MASK]) = P scorej
je

The resulting probabilities indicate the likelihood of each token filling the
masked position.

c. Bidirectional Context in BERT


BERT uses bidirectional context, meaning it can attend to tokens on both the
left and right sides of the masked token during pretraining. This is a significant
improvement over models that only consider left-to-right or right-to-left context.
Importance of Bidirectional Context:

47
• More Complete Understanding: By attending to both preceding and fol-
lowing tokens, BERT can better capture the full meaning of a word in its
context. For example, the word ”bank” could mean a financial institution
or a riverbank depending on the surrounding words.

• Improved Performance: Bidirectional context allows BERT to outper-


form unidirectional models in tasks requiring deep language understand-
ing, such as question-answering and sentiment analysis.
Contrast with Unidirectional Models:

• Unidirectional Models: Models like GPT process the sequence from left
to right. When predicting a word, they only have access to the preceding
context. This limits the model’s ability to fully understand the word’s
meaning, as it cannot look at the following words.
• BERT’s Advantage: BERT’s ability to look at both past and future to-
kens in the sequence leads to a more holistic understanding of language,
improving performance across various NLP tasks.

d. Fine-Tuning BERT for Downstream Tasks (e.g., Senti-


ment Analysis)
When fine-tuning BERT on a downstream task like sentiment analysis, the pre-
training on masked language modeling (MLM) plays a crucial role in improving
performance.
How MLM Pretraining Helps:
• Contextual Understanding: MLM teaches BERT to understand the rela-
tionships between words by forcing the model to predict masked words
using surrounding context. This helps BERT learn a deep understanding
of word meanings and contextual dependencies.
• Transfer Learning: After pretraining, BERT can transfer its general lan-
guage understanding to specific tasks like sentiment analysis. During fine-
tuning, BERT adapts to the particular task while leveraging its rich, pre-
learned contextual knowledge, resulting in better performance with less
training data.
• Handling Ambiguity: For sentiment analysis, capturing the tone and mean-
ing of phrases (like detecting sarcasm or emotion) is critical. BERT’s abil-
ity to use both left and right context from MLM allows it to excel in these
subtleties.

In summary, MLM pretraining enables BERT to capture a rich understand-


ing of language, making it highly effective when fine-tuned for specific tasks like
sentiment analysis.

48

You might also like