A3 Handout
A3 Handout
A3 Handout
In this assignment, you will build a neural dependency parser using PyTorch. For a review of the fundamen-
tals of PyTorch, please check out the PyTorch review session on Canvas. In Part 1, you will learn about two
general neural network techniques (Adam Optimization and Dropout). In Part 2, you will implement and
train a dependency parser using the techniques from Part 1, before analyzing a few erroneous dependency
parses.
Please tag the questions correctly on Gradescope, the TAs will take points off if you don’t tag questions.
1. Machine Learning & Neural Networks (8 points)
(a) (4 points) Adam Optimizer
Recall the standard Stochastic Gradient Descent update rule:
where t + 1 is the current timestep, θ is a vector containing all of the model parameters, (θ t is
the model parameter at time step t, and θ t+1 is the model parameter at time step t + 1), J is the
loss function, ∇θ Jminibatch (θ) is the gradient of the loss function with respect to the parameters
on a minibatch of data, and α is the learning rate. Adam Optimization1 uses a more sophisticated
update rule with two additional steps.2
i. (2 points) First, Adam uses a trick called momentum by keeping track of m, a rolling average
of the gradients:
where β1 is a hyperparameter between 0 and 1 (often set to 0.9). Briefly explain in 2–4
sentences (you don’t need to prove mathematically, just give an intuition) how using m stops
the updates from varying as much and why this low variance may be helpful to learning, overall.
ii. (2 points) Adam extends the idea of momentum with the trick of adaptive learning rates by
keeping track of v, a rolling average of the magnitudes of the gradients:
where ⊙ and / denote elementwise multiplication and division (so z⊙z is elementwise squaring)
and β2 is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update
√
by v, which of the model parameters will get larger updates? Why might this help with
learning?
(b) (4 points) Dropout3 is a regularization technique. During training, dropout randomly sets units
in the hidden layer h to zero with probability pdrop (dropping different units each minibatch), and
then multiplies h by a constant γ. We can write this as:
hdrop = γd ⊙ h
1 Kingma and Ba, 2015, https://arxiv.org/pdf/1412.6980.pdf
2 The actual Adam update uses a few additional tricks that are less important, but we won’t worry about them here. If you
want to learn more about it, you can take a look at: http://cs231n.github.io/neural-networks-3/#sgd
3 Srivastava et al., 2014, https://www.cs.toronto.edu/ hinton/absps/JMLRdropout.pdf
˜
1
CS 224n Assignment 3 Page 2 of 8
where d ∈ {0, 1}Dh (Dh is the size of h) is a mask vector where each entry is 0 with probability
pdrop and 1 with probability (1 − pdrop ). γ is chosen such that the expected value of hdrop is h:
Epdrop [hdrop ]i = hi
Before you begin, please follow the README to install all the needed dependencies for the assignment.
We will be using PyTorch 1.13.1 from https://pytorch.org/get-started/locally/ with the
CUDA option set to None, and the tqdm package – which produces progress bar visualizations through-
out your training process. The official PyTorch website is a great resource that includes tutorials for
understanding PyTorch’s Tensor library and neural networks.
A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between
head words, and words which modify those heads. There are multiple types of dependency parsers,
including transition-based parsers, graph-based parsers, and feature-based parsers. Your implementation
will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step
it maintains a partial parse, which is represented as follows:
Initially, the stack only contains ROOT, the dependencies list is empty, and the buffer contains all words
of the sentence in order. At each step, the parser applies a transition to the partial parse until its buffer
is empty and the stack size is 1. The following transitions can be applied:
• SHIFT: removes the first word from the buffer and pushes it onto the stack.
• LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of
the first item and removes the second item from the stack, adding a first word → second word
dependency to the dependency list.
• RIGHT-ARC: marks the first (most recently added) item on the stack as a dependent of the second
item and removes the first item from the stack, adding a second word → first word dependency to
the dependency list.
On each step, your parser will decide among the three transitions using a neural network classifier.
(a) (4 points) Go through the sequence of transitions needed for parsing the sentence “I attended lec-
tures in the NLP class”. The dependency tree for the sentence is shown below. At each step, give
the configuration of the stack and buffer, as well as what transition was applied this step and what
new dependency was added (if any). The first three steps are provided below as an example.
(b) (2 points) A sentence containing n words will be parsed in how many steps (in terms of n)? Briefly
explain in 1–2 sentences why.
(c) (6 points) Implement the init and parse step functions in the PartialParse class in
parser transitions.py. This implements the transition mechanics your parser will use. You
can run basic (non-exhaustive) tests by running python parser transitions.py part c.
(d) (8 points) Our network will predict which transition should be applied next to a partial parse. We
could use it to parse a single sentence by applying predicted transitions until the parse is complete.
However, neural networks run much more efficiently when making predictions about batches of data
at a time (i.e., predicting the next transition for any different partial parses simultaneously). We
can parse sentences in minibatches with the following algorithm.
Initialize partial parses as a list of PartialParses, one for each sentence in sentences
Initialize unfinished parses as a shallow copy of partial parses
while unfinished parses is not empty do
Take the first batch size parses in unfinished parses as a minibatch
Use the model to predict the next transition for each partial parse in the minibatch
Perform a parse step on each partial parse in the minibatch with its predicted transition
Remove the completed (empty buffer and stack of size 1) parses from unfinished parses
end while
Return: The dependencies for each (now completed) parse in partial parses.
Implement this algorithm in the minibatch parse function in parser transitions.py. You
can run basic (non-exhaustive) tests by running python parser transitions.py part d.
Note: You will need minibatch parse to be correctly implemented to evaluate the model you will
build in part (e). However, you do not need it to train the model, so you should be able to complete
most of part (e) even if minibatch parse is not implemented yet.
(e) (12 points) We are now going to train a neural network to predict, given the state of the stack,
buffer, and dependencies, which transition should be applied next.
First, the model extracts a feature vector representing the current state. We will be using the feature
set presented in the original neural dependency parsing paper: A Fast and Accurate Dependency
Parser using Neural Networks.4 The function extracting these features has been implemented for
you in utils/parser utils.py. This feature vector consists of a list of tokens (e.g., the last
word in the stack, first word in the buffer, dependent of the second-to-last word in the stack if there
is one, etc.). They can be represented as a list of integers w = [w1 , w2 , . . . , wm ] where m is the
number of features and each 0 ≤ wi < |V | is the index of a token in the vocabulary (|V | is the
vocabulary size). Then our network looks up an embedding for each word and concatenates them
into a single input vector:
where E ∈ R|V |×d is an embedding matrix with each row Ew as the vector for a particular word w.
4 Chen and Manning, 2014, https://nlp.stanford.edu/pubs/emnlp2014-depparser.pdf
CS 224n Assignment 3 Page 5 of 8
h = ReLU(xW + b1 )
l = hU + b2
ŷ = softmax(l)
where h is referred to as the hidden layer, l is referred to as the logits, ŷ is referred to as the
predictions, and ReLU(z) = max(z, 0)). We will train the model to minimize cross-entropy loss:
3
X
J(θ) = CE(y, ŷ) = − yi log ŷi
i=1
To compute the loss for the training set, we average this J(θ) across all training examples.
We will use UAS score as our evaluation metric. UAS refers to Unlabeled Attachment Score, which
is computed as the ratio between number of correctly predicted dependencies and the number of
total dependencies despite of the relations (our model doesn’t predict this).
In parser model.py you will find skeleton code to implement this simple neural network using
PyTorch. Complete the init , embedding lookup and forward functions to implement the
model. Then complete the train for epoch and train functions within the run.py file.
Finally execute python run.py to train your model and compute predictions on test data from
Penn Treebank (annotated with Universal Dependencies).
Note:
• For this assignment, you are asked to implement Linear layer and Embedding layer. Please
DO NOT use torch.nn.Linear or torch.nn.Embedding module in your code, otherwise
you will receive deductions for this problem.
• Please follow the naming requirements in our TODO if there are any, e.g. if there are explicit
requirements about variable names you have to follow them in order to receive full credits. You
are free to declare other variable names if not explicitly required.
Hints:
• Each of the variables you are asked to declare (self.embed to hidden weight,
self.embed to hidden bias, self.hidden to logits weight,
self.hidden to logits bias) corresponds to one of the variables above (W, b1 , U, b2 ).
• It may help to work backwards in the algorithm (start from ŷ) and keep track of the ma-
trix/vector sizes.
• Once you have implemented embedding lookup (e) or forward (f) you can call python
parser model.py with flag -e or -f or both to run sanity checks with each function. These
sanity checks are fairly basic and passing them doesn’t mean your code is bug free.
• When debugging, you can add a debug flag: python run.py -d. This will cause the code
to run over a small subset of the data, so that training the model won’t take as long. Make
sure to remove the -d flag to run the full model once you are done debugging.
• When running with debug mode, you should be able to get a loss smaller than 0.2 and a UAS
larger than 65 on the dev set (although in rare cases your results may be lower, there is some
randomness when training).
• It should take about 1 hour to train the model on the entire the training dataset, i.e., when
debug mode is disabled.
CS 224n Assignment 3 Page 6 of 8
• When debug mode is disabled, you should be able to get a loss smaller than 0.08 on the train set
and an Unlabeled Attachment Score larger than 87 on the dev set. For comparison, the model
in the original neural dependency parsing paper gets 92.5 UAS. If you want, you can tweak
the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of
epochs, etc.) to improve the performance (but you are not required to do so).
Deliverables:
• Working implementation of the transition mechanics that the neural dependency parser uses in
parser transitions.py.
• Working implementation of minibatch dependency parsing in parser transitions.py.
• Working implementation of the neural dependency parser in parser model.py. (We’ll look
at and run this code for grading).
• Working implementation of the functions for training in run.py. (We’ll look at and run this
code for grading).
• Report the best UAS your model achieves on the dev set and the UAS it achieves
on the test set in your writeup.
(f) (12 points) We’d like to look at example dependency parses and understand where parsers like ours
might be wrong. For example, in this sentence:
root
punct
nmod
nsubj dobj case
the dependency of the phrase into Afghanistan is wrong, because the phrase should modify sent (as
in sent into Afghanistan) not troops (because troops into Afghanistan doesn’t make sense, unless
there are somehow weirdly some troops that stan Afghanistan). Here is the correct parse:
root
punct
nmod
nsubj dobj case
• Coordination Attachment Error: In the sentence Would you like brown rice or garlic naan?,
the phrases brown rice and garlic naan are both conjuncts and the word or is the coordinating
conjunction. The second conjunct (here garlic naan) should be attached to the first conjunct
(here brown rice). A Coordination Attachment Error is when the second conjunct is attached
to the wrong head word (in this example, the correct head word is rice). Other coordinating
conjunctions include and, but and so.
In this question are four sentences with dependency parses obtained from a parser. Each sentence
has one error type, and there is one example of each of the four types above. For each sentence,
state the type of error, the incorrect dependency, and the correct dependency. While each sentence
should have a unique error type, there may be multiple possible correct dependencies for some of
the sentences. To demonstrate: for the example above, you would write:
• Error type: Prepositional Phrase Attachment Error
• Incorrect dependency: troops → Afghanistan
• Correct dependency: sent → Afghanistan
Note: There are lots of details and conventions for dependency annotation. If you want to
learn more about them, you can look at the UD website: http://universaldependencies.
org 7 or the short introductory slides at: http://people.cs.georgetown.edu/nschneid/
p/UD-for-English.pdf. Note that you do not need to know all these details in order to do
this question. In each of these cases, we are asking about the attachment of phrases and it should
be sufficient to see if they are modifying the correct head. In particular, you do not need to look at
the labels on the the dependency edges – it suffices to just look at the edges themselves.
i.
root
punct
nmod
obj advcl case
det nsubj det punct obj det acl
The university blocked the acquisition , citing concerns about the risks involved .
DET NOUN VERB DET NOUN PUNCT VERB NOUN ADP DET NOUN VERB PUNCT
ii.
root
punct
nsubj obl:tmod
advmod
Many managers and traders had already left their offices early Friday afternoon .
ADJ NOUN CCONJ NOUN AUX ADV VERB PRON NOUN ADV PROPN NOUN PUNCT
iii.
7 But note that in the assignment we are actually using UDv1, see: http://universaldependencies.org/docsv1/
CS 224n Assignment 3 Page 8 of 8
root
punct
nmod
obl case
Investment Canada declined to comment on the reasons for the goverment decision .
NOUN PROPN VERB PART VERB ADP DET NOUN ADP DET NOUN NOUN PUNCT
iv.
conj
obl obj
root case nummod
People benefit from a separate move that affects three US car plants and one in Quebec
NOUN VERB ADP DET ADJ NOUN PRON VERB NUM PROPN NOUN NOUN CCONJ NUM ADP PROPN
(g) (2 points) Recall in part (e), the parser uses features which includes words and their part-of-speech
(POS) tags. Explain the benefit of using part-of-speech tags as features in the parser?
Submission Instructions
You shall submit this assignment on GradeScope as two submissions – one for “Assignment 3 [coding]” and
another for ‘Assignment 3 [written]”: