Cs 224N Default Final Project: Question Answering On Squad 2.0
Cs 224N Default Final Project: Question Answering On Squad 2.0
Contents
1 Overview 2
1.1 The SQuAD Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 This project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Getting Started 4
2.1 Code overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6 Alternative Goals 17
8 Grading Criteria 20
9 Honor Code 21
10 FAQs 22
10.1 How are out-of-vocabulary words handled? . . . . . . . . . . . . . . . . . . . . . . 22
10.2 How are padding and truncation handled? . . . . . . . . . . . . . . . . . . . . . . . 22
10.3 Which parts of the code can I change? . . . . . . . . . . . . . . . . . . . . . . . . . 22
1
1 Overview
In the default final project, you will explore deep learning techniques for question answering on
the Stanford Question Answering Dataset (SQuAD) [1]. The project is designed to enable you to
dive right into deep learning experiments without spending too much time getting set up. You will
have the chance to implement current state-of-the-art techniques and experiment with your own
novel designs. This year’s project will use the updated version of SQuAD, named SQuAD 2.0 [2],
which extends the original dataset with unanswerable questions.
In fact, in the official dev and test set, every answerable SQuAD question has three answers
provided – each answer from a different crowd worker. The answers don’t always completely agree,
which is partly why ‘human performance’ on the SQuAD leaderboard is not 100%. Performance
is measured via two metrics: Exact Match (EM) score and F1 score.
• Exact Match is a binary measure (i.e. true/false) of whether the system output matches
the ground truth answer exactly. For example, if your system answered a question with
‘Einstein’ but the ground truth answer was ‘Albert Einstein’, then you would get an EM
score of 0 for that example. This is a fairly strict metric!
• F1 is a less strict metric – it is the harmonic mean of precision and recall1 . In the ‘Einstein’
example, the system would have 100% precision (its answer is a subset of the ground truth
answer) and 50% recall (it only included one out of the two words in the ground truth output),
thus a F1 score of 2×prediction×recall/(precision+recall) = 2∗50∗100/(100+50) = 66.67%.
1 Read more about F1 here: https://en.wikipedia.org/wiki/F1_score
2
• When a question has no answer, both the F1 and EM score are 1 if the model predicts
no-answer, and 0 otherwise.
• For questions that do have answers, when evaluating on the dev or test sets, we take the
maximum F1 and EM scores across the three human-provided answers for that question.
This makes evaluation more forgiving – for example, if one of the human annotators did
answer ‘Einstein’, then your system will get 100% EM and 100% F1 for that example.
Finally, the EM and F1 scores are averaged across the entire evaluation dataset to get the final
reported scores.
3
2 Getting Started
For this project, you will need a machine with GPUs to train your models efficiently. For this, you
have access to Azure, similarly to Assignments 4 and 5 – remember you can refer to the Azure
Guide and Practical Guide to VMs linked on the class webpage. As before, remember that Azure
credit is charged for every minute that your VM is on, so it’s important that your VM is only
turned on when you are actually training your models.
We advise that you develop your code on your local machine (or one of the Stanford
machines, like rice), using PyTorch without GPUs, and move to your Azure VM only once you’ve
debugged your code and you’re ready to train. We advise that you use GitHub to manage your
codebase and sync it between the two machines (and between team members) – the Practical Guide
to VMs has more information on this.
When you work through this Getting Started section for the first time, do so on your local
machine. You will then repeat the process on your Azure VM. Once you are on an appropriate
machine, clone the project Github repository with the following command.
This repository contains the starter code and the version of SQuAD that we will be using. We
encourage you to git clone our repository, rather than simply downloading it, so that you can
easily integrate any bug fixes that we make to the code. In fact, you should periodically check
whether there are any new fixes that you need to download. To do so, navigate to the squad
directory and run the git pull command.
Note: If you use GitHub to manage your code, you must keep your repository private.
• models.py: The starter model, and any others you might add.
• setup.py: Downloads pretrained GloVe vectors and preprocesses the data.
• train.py: Top-level entrypoint for training the model.
• test.py: Top-level entrypoint for testing the model and generating submissions for the
leaderboard.
• util.py: Utility functions and classes.
In addition, you will notice two directories:
• data/: Contains our custom SQuAD dataset, both the unprocessed JSON files, and (after
running setup.py), all preprocessed files.
• save/: Location for saving all checkpoints and logs. For example, if you train the baseline
with python train.py -n baseline, then the logs, checkpoints, and TensorBoard events
will be saved in save/train/baseline-01. The suffix number will increment if you train
another model with the same name.
4
2.2 Setup
Once you are on an appropriate machine and have cloned the project repository, it’s time to run
the setup commands.
• Make sure you have Anaconda or Miniconda installed.
5
3 The SQuAD Data
3.1 Data splits
The official SQuAD 2.0 dataset has three splits: train, dev and test. The train and dev sets are
publicly available and the test set is entirely secret. To compete on the official SQuAD leaderboards,
researchers submit their models, and the SQuAD team runs the models on the secret test set.
For simplicity and scalability, we are instead running our class leaderboard ‘Kaggle-style’, i.e.,
we release test set’s (context, question) pairs to students, and they submit their model-produced
answers in a CSV file. We then compare these CSV files to the true test set answers and report
scores in a leaderboard. Clearly, we cannot release the official test set’s (context, question) pairs
because they are secret. Therefore in this project, we will be using custom dev and test sets, which
are obtained by splitting the official dev set in half.
Given that the official SQuAD dev set contains our test set, you must make sure not to use
the official SQuAD dev set in any way. You may only use our training set and our dev set to
train, tune and evaluate your models. If you use the official SQuAD dev set to train, to
tune or evaluate your models, or to modify your CSV solutions in any way, you are
committing an honor code violation. To detect cheating of this kind, we have produced a
small amount of new SQuAD 2.0 examples whose answers are not publicly available, and added
them to our test set – your relative performance on these examples, compared to the rest of our test
set, would reveal any cheating. If you always use the provided GitHub repository and setup.py
script to set up your SQuAD dataset, and don’t use the official SQuAD dev set at all, you will be
safe.
To summarize, we have the following splits:
• train (129,941 examples): All taken from the official SQuAD 2.0 training set.
• dev (6078 examples): Roughly half of the official dev set, randomly selected.
• test (5915 examples): The remaining examples from the official dev set, plus hand-labeled
examples.
From now on we will refer to these splits as ‘the train set’, ‘the dev set’ and ‘the test set’, and
always refer to the official splits as ‘the official train set’, ‘the official dev set’, and ‘the official test
set’.
You will use the train set to train your model and the dev set to tune hyperparameters and
measure progress locally. Finally, you will submit your test set solutions to a class leaderboard,
which will calculate and display your scores on the test set – see Section 7 for more information.
3.2 Terminology
The SQuAD dataset contains many (context, question, answer) triples2 – see an example in Section
1.1. Each context (sometimes called a passage, paragraph or document in other papers) is an excerpt
from Wikipedia. The question (sometimes called a query in other papers) is the question to be
answered based on the context. The answer is a span (i.e. excerpt of text) from the context.
2 As described in Section 1.1, the dev and test sets actually have three human-provided answers for each question.
But the training set only has one answer per question.
6
4 Training the Baseline
As a starting point, we have provided you with the complete code for a baseline model, which uses
deep learning techniques you learned in class. In this section we will describe the baseline model
and show you how to train it.
g = σ(Wg hi + bg ) ∈ RH
t = ReLU(Wt hi + bt ) ∈ RH
h0i = g t + (1 − g) hi ∈ RH ,
h0i,fwd = LSTM(h0i−1 , hi ) ∈ RH
h0i,rev = LSTM(h0i+1 , hi ) ∈ RH
h0i = [h0i,fwd ; h0i,rev ] ∈ R2H .
Note in particular that h0i is of dimension 2H, as it is the concatenation of forward and backward
hidden states at timestep i.
3 A word index is an integer that tells you which row (or column) of the embedding matrix contains the word’s
7
Attention Layer (layers.BiDAFAttention)
The core part of the BiDAF model is the bidirectional attention flow layer, which we will describe
here. The main idea is that attention should flow both ways – from the context to the question
and from the question to the context.
Assume we have context hidden states c1 , . . . , cN ∈ R2H and question hidden states q1 , . . . , qM ∈
R . We compute the similarity matrix S ∈ RN ×M , which contains a similarity score Sij for each
2H
Here, ci ◦ qj is an elementwise product and wsim ∈ R6H is a weight vector. In the starter code,
the get_similarity_matrix method of the layers.BiDAFAttention class is a memory-efficient
implementation of this operation. We encourage you to walk through the implementation of
get_similarity_matrix and convince yourself that it indeed computes the similarity matrix as
described above.
Since the similarity matrix S contains information for both the question and context, we can use
it to normalize across either the row or the column in order to attend to the question or context,
respectively.
First, we perform Context-to-Question (C2Q) Attention. We take the row-wise softmax of S
to obtain attention distributions S̄, which we use to take weighted sums of the question hidden
states qj , yielding C2Q attention outputs ai . In equations, this is:
S̄i,: = softmax(Si,: ) ∈ RM ∀i ∈ {1, . . . , N }
M
X
ai = S̄i,j qj ∈ R2H ∀i ∈ {1, . . . , N }.
j=1
N
X
0
bi = Si,j cj ∈ R2H ∀i ∈ {1, . . . , N }.
j=1
Lastly, for each context location i ∈ {1, . . . , N } we obtain the output gi of the bidirectional
attention flow layer by combining the context hidden state ci , the C2Q attention output ai , and
the Q2C attention output bi :
gi = [ci ; ai ; ci ◦ ai ; ci ◦ bi ] ∈ R8H ∀i ∈ {1, . . . , N }
where ◦ represents elementwise multiplication.
8
Output Layer (layers.BiDAFOutput)
The output layer is tasked with producing a vector of probabilities corresponding to each position
in the context: pstart , pend ∈ RN . As the notation suggests, pstart (i) is the predicted probability
that the answer span starts at position i, and similarly pend (i) is the predicted probability that
the answer span ends at position i. (See the ‘Predicting no-answer’ section below for details on
no-answer predictions).
Concretely, the output layer takes as input the attention layer outputs g1 , . . . , gN ∈ R8H and
the modeling layer outputs m1 , . . . , mN ∈ R2H . The output layer applies a bidirectional LSTM
to the modeling layer outputs, producing a vector m0i for each mi given by
m0i,fwd = LSTM(m0i−1 , mi ) ∈ RH
m0i,rev = LSTM(m0i+1 , mi ) ∈ RH
m0i = [m0i,fwd ; m0i,rev ] ∈ R2H .
Now let G ∈ R8H×N be the matrix with columns g1 , . . . , gN , and let M , M 0 ∈ R2H×N similarly
be matrices with columns m1 . . . , mN and m01 , . . . , m0N , respectively. To finally produce pstart and
pend , the output layer computes
where Wstart , Wend ∈ R1×10H are learnable parameters. In the code, notice that the softmax
operation uses the context mask, and we compute all probabilities in log-space for numerical
stability and because the F.nll_loss function expects log-probabilities.
Training Details
Loss Function
Our loss function is the sum of the negative log-likelihood (cross-entropy) loss for the start and
end locations. That is, if the gold start and end locations are i ∈ {1, . . . , N } and j ∈ {1, . . . , N }
respectively, then the loss for a single example is:
During training, we average across the batch and use the Adadelta optimizer [6] to minimize the
loss.
Inference Details
Discretized Predictions
At test time, we discretize the soft predictions of the model to get start and end indices. We choose
the pair (i, j) of indices that maximizes pstart (i) · pend (j) subject to i ≤ j and j − i + 1 ≤ Lmax ,
where Lmax is a hyperparameter which sets the maximum length of a predicted answer. We set
Lmax to 15 by default. Code can be found in the discretize function in util.py.
Predicting no-answer
To allow our model to make no-answer predictions, we adopt an approach that was originally
introduced in Section 5 of [7]. In particular, we prepend a OOV (Out of Vocabulary) token to
the beginning of each context. The model outputs pstart and pend soft-predictions as usual, so
no adaptation is needed within the model. When discretizing a prediction, if pstart (0) · pend (0)
is greater than any predicted answer span, the model predicts no-answer. Otherwise the model
predicts the highest probability span as usual. We keep the same NLL loss function.
Intuitively, this approach allows the model to predict a per-example confidence score that the
question is unanswerable. If the model is highly confident that there is no answer, we predict no
answer. In all cases the model continues to predict the most likely span if that answer exists.
9
Exponential Moving Average of Parameters
As recommended in the BiDAF paper, we use an exponentially weighted moving average of the
model parameters during evaluation (with decay rate 0.999). Intuitively, this is similar to using
an ensemble of multiple checkpoints sampled from one training run. The details can be found in
the util.EMA class, and you will notice calls to ema.assign and ema.resume in train.py. It is
worth experimenting with removing the exponential moving average or changing the decay rate
when you train your own models.
After some initialization, you should see the model begin to log information like the following:
You should see the loss – shown as NLL for negative log-likelihood – begin to drop. On a single
Azure NV6 instance, you should expect training to take about 22 minutes per epoch. Note that
the starter code will automatically use more than one GPU if your machine has more available.
You should also see that there is a new directory under save/train/baseline-01. This is
where you can find all data relating to this experiment. In particular, you will (eventually) see:
• log.txt: A record of all information logged during training. This includes a complete print-
out of the arguments at the very top, which can be useful when trying to reproduce results.
• events.out.tfevents.*: These files contain information (like the loss over time), which our
code has logged so it can be visualized by TensorBoard.
• step_N.pth.tar: These are checkpoint files, that contain the weights of the model at check-
points which achieved the highest validation metrics. The number N corresponds to how many
training iterations had been completed when the model was saved. By default a checkpoint
is saved every 50,000 iterations, but you can save checkpoints more frequently by changing
the eval_steps flag.
• best.pth.tar: The best checkpoint throughout training. The metric used to determine
which checkpoint is ‘best’ is defined by the metric_name flag. Typically you will load this
checkpoint for use by test.py, which you can do by setting the load_path flag.
If you are training on your local machine, now open http://localhost:5678/ in your browser. If
you are training on a remote machine (e.g. Azure), then run the following command on your local
machine:
ssh -N -f -L localhost:1234:localhost:5678 <user>@<remote>
10
where <user>@<remote> is the address that you ssh to for your remote machine. Then on your
local machine, open http://localhost:1234/ in your browser.
You should see TensorBoard load with plots of the loss, AvNA, EM, and F1 for both train
and dev sets. EM and F1 are the official SQuAD evaluation metrics, and AvNA is a useful metric
we added for debugging purposes. In particular, AvNA stands for Answer vs. No Answer and
it measures the classification accuracy of your model when only considering its answer (any span
predicted) vs. no-answer predictions.
The dev plots may take some time to appear because they are logged less frequently than the
train plots. However, you should see training set loss decreasing from the very start. Here is the
view after training the baseline model for 25 epochs:
• The dev AvNA reaches about 65, the dev F1 reaches about 58, and the dev EM score reaches
around 55.
• Although the dev NLL improves throughout the training period, the dev EM and F1 scores
initially get worse at the start of training, before then improving. We elaborate on this point
below.
Regarding the last bullet point, this does not necessarily indicate a bug, but rather can be
explained because we directly optimize the NLL loss, not F1 or EM: Early in training, the NLL is
quickly reduced by always predicting no-answer. Since roughly half of the SQuAD examples are
no-answer, a model predicting all no-answer will get close to 50% AvNA. In addition, the SQuAD
2.0 metrics define F1 and EM for no-answer examples to be 1 if the model predicts no answer and
0 otherwise. If we assume the model gets 0 F1 and EM on answerable examples, this results in a
mean F1/EM score of roughly 50% very early in training.
We advise you to reproduce this experiment, i.e., train the baseline and obtain results similar
to those we report above. This will give you something to compare your improved models against.
In particular, TensorBoard will plot your new experiments overlaid with your baseline experiment
– this will enable you to see how your improved models train over time, compared to the baseline.
11
Viewing these examples can be extremely helpful to debug your model, understand its strengths
and weaknesses, and as a starting point for your analysis in your final report.
12
5 More SQuAD Models and Techniques
From here, the project is open-ended! As explained in Section 1.2, in this section we provide
you with an overview of models, model families, and other common techniques that are used in
high-performing SQuAD systems. Your job is to read about some of these models and techniques,
understand them, choose some to implement, carefully train them, and analyze their performance
– ultimately building the best SQuAD system you can. Implementation is an open-ended task:
there are multiple valid implementations of a single model, and sometimes a paper won’t even
provide all the details – requiring you to make some decisions by yourself. To learn more about
project expectations and grading, see Section 8.
2. PCE approaches are likely to outperform the best non-PCE models by a large margin.
Though high-performing, the act of loading BERT alone is an uncreative endeavor within the
context of this project. We want to ensure that teams who choose not to use PCE can still
be competitive on the leaderboard. We hope that hosting a non-PCE division will encourage
many teams to focus on more creative ways to boost performance.
Note: If you do want to use PCE, keep in mind that simply loading BERT and getting a high
score will not earn you a high grade on the project. It’s up to you to think creatively about how
to improve upon a standard implementation.
In traditional word embeddings such as Word2Vec [11], GloVe [12], and FastText [13], each word in
the vocabulary is mapped to a fixed vector, regardless of its context. The core idea of ELMo is to
address this weakness by using the context in which a word appears to construct the embedding.
In practice, ELMo implements this idea by training a two-layer bidirectional LSTM for language
modeling on a large-scale corpus. This pretrained bi-LSTM can then be used as the embedding
layer for any model (e.g. a SQuAD model) which takes text as input. The remaining layers of the
new model are trained for the new task (e.g. SQuAD).
When it was released in November 2017, ELMo improved the state of the art on six different
NLP benchmarks. This established the utility of pretrained contextual embeddings, but many
architectural advances have happened since November 2017. As such, ELMo is unlikely to compete
with more recent PCE methods such as BERT.
5.1.2 BERT
Original paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Modeling [8]
13
We briefly saw in class that Transformers [14] have in some cases supplanted RNNs as the dominant
model family in deep learning for NLP. Therefore one might expect that ELMo’s pretrained RNNs
would be outperformed by pretrained Transformers. Indeed, BERT showed exactly that—sby
training deep Transformers on a carefully designed bidirectional language modeling task, BERT
achieves state-of-the-art results on a wide variety of NLP benchmarks including SQuAD 2.0. To
enable other researchers to use BERT as an embedding module, the BERT authors released an
open-source TensorFlow implementation and pretrained weights.4
5.1.3 ALBERT
Original paper: ALBERT: A Lite BERT for Self-Supervised Learning of Language Representa-
tions [9]
More recently, ALBERT established new state-of-the-art results on various benchmarks including
SQuAD while having fewer parameters compared to BERT-large. Their key idea is to allocate
the model’s capacity more efficiently and to share parameters across the layers. Like BERT, the
authors released an open-source TensorFlow implementation and pretrained weights.5
PyTorch implementation of BERT and ALBERT. To use these weights with PyTorch, you
need an op-for-op reimplementation, such as this one:
https://github.com/huggingface/transformers.
You are welcome to adapt one of these implementations (or another one) for this project. We
highly recommend that you read the BERT paper, as they give a thorough description of how to
adapt BERT for various applications, including question answering.
5.2.2 Self-attention
Appears in: R-Net: Machine Reading Comprehension with Self-Matching Networks6
Self-attention is a phrase that can have slightly different meanings depending on the setting. In a
RNN-based language model, self-attention often means that the hidden state ht attends to all the
previous hidden states so far h1 , . . . , ht−1 . In a context where you are encoding some text length
n, self-attention might mean that ht attends to all the hidden states h1 , . . . , hn (even including
itself). Transformers are built on a kind of self-attention. The main idea of self-attention is that
the query vector is from the same set as the set of value vectors.
R-Net is a simple but high-performing SQuAD model that has both a Context-to-Question
attention layer (similar to our baseline), and a self-attention layer (which they call Self-Matching
Attention). Both layers are simple applications of additive attention (as described in lectures).
The R-Net paper is one of the easier ones to understand.
4 https://github.com/google-research/bert
5 https://github.com/google-research/ALBERT
6 https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf
14
5.2.3 Transformers
Appears in: QANet: Combining Local Convolution with Global Self-Attention for Reading Com-
prehension [15]
QANet adapts ideas from the Transformer [14] and applies them to question answering, doing
away with RNNs and replacing them entirely with self-attention and convolution. The main
component of the QANet model is called an Encoder Block. The Encoder Block draws inspiration
from the Transformer: The two modules are similar in their use of positional encoding, residual
connections, layer normalization, self-attention sublayers, and feed-forward sublayers. However,
an Encoder Block differs from the Transformer in its use of stacked convolutional sublayers, which
use depthwise-separable convolution to capture local dependencies in the input sequence. Prior to
BERT, QANet had state-of-the-art performance for SQuAD 1.1. We will have a whole lecture to
learn about Transformers in week 7.
5.2.4 Transformer-XL
Original paper: Transformer-XL: Language Modeling with Longer-Term Dependency [16]
5.2.5 Reformer
Original paper: Reformer: The Efficient Transformer [17]
Going further, most recent model Reformer is designed to handle 1 million words in context
windows, while using only 16GB of memory. It combines two techniques: locality-sensitive hashing
to reduce the sequence-length complexity as well as reversible residual layers to reduce storage
requirements. One idea is to find relevant text from the Internet, append to context (original
excerpt from Wikipedia), and process it with Reformer.
Although Deep Learning is able to learn end-to-end without the need for feature engineering, it
turns out that using the right input features can still boost performance significantly. For example,
the DrQA model significantly boosts performance on SQuAD by including some simple but useful
input features (for example, a word in the SQuAD context passage is represented not only by its
word vector, but is also tagged with features representing its frequency, part-of-speech tag, named
entity type, etc.). If you implement a model like this, reflect on the tradeoff between feature
engineering and end-to-end learning, and comment on it in your report.
15
5.4 Other improvements
There are many other things besides architecture changes that you can do to improve your per-
formance. The suggestions in this section are mostly quick to implement, but it will take time to
run the necessary experiments and draw the necessary comparisons. Remember that we will be
grading your experimental thoroughness, so do not neglect the hyperparameter search!
• Regularization. The baseline code uses dropout. Experiment with different values of
dropout and different types of regularization.
• Sharing weights. The baseline code uses the same RNN encoder weights for both the
context and the question. This can be useful to enrich both the context and the question
representations. Are there other parts of the model that could share weights? Are there
conditions under which it’s better to not share weights?
• Word vectors. By default, the baseline model uses 300-dimensional pre-trained GloVe
embeddings to represent words, and these embeddings are held constant during training.
You can experiment with other sizes or types of word embeddings, or try retraining or fine-
tuning the embeddings.
• Combining forward and backward states. In the baseline, we concatenate the forward
and backward hidden states from the bidirectional RNN. You could try adding, averaging or
max pooling them instead.
• Types of RNN. Our baseline uses a bidirectional LSTM. You could try a GRU instead –
it might be faster.
• Model size and number of layers. With any model, you can try increasing the model
size, usually at the cost of slower runtime.
• Optimization algorithms. The baseline uses the Adadelta optimizer. PyTorch supports
many other optimization algorithms. You might also experiment with learning rate annealing.
You should also try varying the learning rate.
16
6 Alternative Goals
For most students, the goal of this assignment is to build a system that performs well at SQuAD
2.0. However, listed here are some alternative or additional goals you could pursue instead of or in
addition to the standard task. We would love to see some students attack these very interesting and
important research problems. Doing these would feel a bit like doing something halfway between
a default and a custom project.
If you would like to work on these problems, you will still need to submit to the leaderboard.
However, make it clear in your writeup that you are pursuing these alternative goals, and include
the relevant performance metrics. We will grade your project holistically, taking into account your
creativity, success and effort in pursuing the alternative goals.
• Robust SQuAD: Most recent models achieve human-level performance when trained and
tested on the same dataset. However, what happens if they are tested on out-of-domain
(OOD) data? Can you come up with a way to make a SQuAD model more robust against
OOD data? One way to help generalization is to train on multiple datasets simultaneously.
To evaluate your model, you will make use of test sets in order domains to demonstrate your
improvement on the robustness of the model. A good starting point is NewsQA [19] where
questions and answers are extracted from CNN articles, and see if you can improve OOD
performance while keeping the performance decrease on SQuAD minimal.
• Interpretable SQuAD: How can we understand the properties of contextual word repre-
sentations? Gaining insights into neural representations in NLP may help us to understand
why our models succeed and fail (on benchmarks like SQuAD). One way to understand prop-
erties of representations is by probing. Probing methods train supervised models to predict
linguistic properties from representations of language [20]. Your evaluation will consist of
comparing the performance of word representations resulting from the models you train on
SQuAD according to one or more probing tasks, and analyzing whether they correlate with
SQuAD performance.
• Adversarial-proof SQuAD: It was recently shown that state-of-the-art SQuAD models
can be fooled by the placement of distracting sentences in the context paragraph [21]. This
exposes a worrying weakness in our SQuAD systems. Can you build a SQuAD model that
is more robust against adversarial examples? To evaluate your model, you will consider not
only F1 and EM score on the standard SQuAD test set, but also F1 and EM scores on the
publicly-released test set of adversarial examples.8 Though SQuAD 2.0 attempts to address
this problem by redefining the task to include unanswerable questions (thus more heavily
penalizing guessing), this is still an open research problem.
• Fast SQuAD: Due to their recurrent nature, RNNs are slow, and they do not scale well
when processing longer pieces of text. Can you find faster, perhaps non-recurrent deep
learning solutions for SQuAD? As an example, QANet [15] (described above) has a large
speed advantage over BiDAF and other RNN-based approaches. If you want to focus on
building a fast SQuAD model, you should review the non-recurrent and quasi-recurrent
model families we saw in lectures. To evaluate your model, you will consider not only F1 and
EM score, but also how fast (big-O computational complexity, and seconds per example in
practice) you can generate answers.
• Semi-supervised SQuAD: The need for labeled data is arguably the greatest challenge
facing Deep Learning today. Finding ways to get more from less data is an important research
direction. How well can you perform on the SQuAD test set using only 30% of the training
data? 20%? 10%? How might you use large amounts of unlabeled data to approach this task
in an semi-supervised way? To evaluate your model, you will consider not only F1 and EM
score after training on the full training set, but also F1 and EM score after training on only
a fraction of the training set.
• Low-storage SQuAD: Deep Learning models typically have millions of parameters, which
requires a large amount of storage space – this can be a problem, especially for mobile
applications. Model compression is an active area of Deep Learning research [22], with many
8 https://worksheets.codalab.org/worksheets/0xc86d3ebe69a3427d91f9aaa63f7d1e7d/
17
existing techniques for compression. Can you design a SQuAD model that requires less
storage space? To evaluate your model, you will consider not only your F1 and EM score,
but also the storage size of your model (in number of parameters, or in megabytes).
18
7 Submitting to the Leaderboard
7.1 Overview
We are hosting four leaderboards on Gradescope, where you can compare your performance against
that of your classmates. F1 score is the performance metric we will use to rank submissions,
although both EM and F1 scores will be displayed. The leaderboards can be found at the following
links:
1. Pretrained Contextual Embedding (PCE) Division.
(a) Dev: https://www.gradescope.com/courses/77592/assignments/347457/leaderboard
(b) Test: https://www.gradescope.com/courses/77592/assignments/347465/leaderboard
2. Non-PCE Division.
(a) Dev: https://www.gradescope.com/courses/77592/assignments/347537/leaderboard
(b) Test: https://www.gradescope.com/courses/77592/assignments/347538/leaderboard
Recall that you should choose a division based on whether you use PCE (BERT, ALBERT, etc.)
or not. Within your division, you may submit to the dev leaderboard as many times as you like,
but you will only be allowed 3 successful submissions to the test leaderboard. For your
final report, we will ask you to choose a single test leaderboard submission to consider for your
final performance. Therefore you must make at least one submission to the test leaderboard, but
be careful not to use up your test submissions before you have finished developing your best model.
Submitting to the leaderboard is similar to submitting any other assignment on Gradescope,
except that your submission is a CSV file of answers on the dev/test set. You may use the starter
code’s test.py script to generate a submission file of the correct format, or see lines 128-135 for
example code to generate a submission file. At a high level, the submission file should look like
the following:
Id,Predicted
001fefa37a13cdd53fd82f617,Governor Vaudreuil
00415cf9abb539fbb7989beba,May 1754
00a4cc38bd041e9a4c4e545ff,
...
fffcaebf1e674a54ecb3c39df,1755
The header is required, and each subsequent row must contain two columns: the first column is a
25-digit hexadecimal ID for the question/answer example (IDs defined in {dev,test}-v2.0.json),
and the second column is your predicted answer (or the empty string for no answer). The rows
can be in any order. For the test leaderboard, you must submit a prediction for every example,
and for the dev leaderboard, you must submit predictions for at least 95% of the examples (e.g.,
to allow for the default preprocessing in setup.py which throws away long examples in the dev
set).
19
8 Grading Criteria
The final project will be graded holistically. This means we will look at many factors when
determining your grade: the creativity, complexity and technical correctness of your approach,
your thoroughness in exploring and comparing various approaches, the strength of your results,
the effort you applied, and the quality of your write-up, evaluation, and error analysis. Generally,
implementing more complicated models represents more effort, and implementing more unusual
models (e.g. ones that we have not mentioned in this handout) represents more creativity. You are
not required to pursue original ideas, but the best projects in this class will go beyond the ideas
described in this handout, and may in fact become published work themselves!
There is no pre-defined F1 or EM score to ensure a good grade. Though we have run some
preliminary tests to get some ballpark scores, it is impossible to say in advance what distribution of
scores will be reasonably achievable for students in the provided timeframe. As in previous years,
we will have to grade performance relative to the leaderboard as a whole (though, comparing only
within the PCE and non-PCE divisions, as described in Section 5.1).
For similar reasons, there is no pre-defined rule for which of the models in Section 5 (or else-
where) would ensure a good grade. Implementing a small number of things with good results and
thorough experimentation/analysis is better than implementing a large number of things that don’t
work, or barely work. In addition, the quality of your writeup and experimentation is important:
we expect you to convincingly show that your techniques are effective and describe why they work
(or the cases when they don’t work).
In the analysis section of your report, we want to see you go beyond the simple F1 and EM
results of your model. Try breaking down the scores – for example, how does your model perform
on questions that start with ‘who’ ? Questions that start ‘when’ ? Questions that start ‘why’ ?
What are the other categories? Can you categorize the types of errors made by your model?
As with all final projects, larger teams are expected to do correspondingly larger projects. We
will expect more complex things implemented, more thorough experimentation, and better results
from teams with more people.
20
9 Honor Code
Any honor code guidelines that apply for the final project in general also apply for the default final
project. Here are some guidelines that are specifically relevant to the default final project:
1. You are allowed to use whatever existing code, libraries, or data you wish, with one exception
for the non-PCE division (see Rule 2). However, you must clearly cite your sources and
indicate which parts of the project are not your work. If you use or borrow code from any
external libraries, describe how you use the external code, and provide a link to the source
in your writeup. You will be graded based on what you add on top of others’ work.
2. In the non-PCE division, you may not use a pre-existing implementation for the SQuAD
challenge as your starting point unless you wrote that implementation yourself. If you believe
you have a good reason to use a pre-existing SQuAD implementation as your starting point
(for example, you have a specific cutting-edge research idea that would build on the state-
of-the-art), make a Piazza post to get permission.
(a) Note: This rule does not apply to the PCE division. You may use whatever code you
like in the PCE division, provided you acknowledge it explicitly in your code and paper.
3. As described in Section 3.1, it is an honor code violation to use the official SQuAD dev set
in any way.
4. You are free to discuss ideas and implementation details with other teams (in fact, we en-
courage it!). However, under no circumstances may you look at another CS224n team’s code,
or incorporate their code into your project.
5. Do not share your code publicly (e.g., in a public GitHub repo) until after the class has
finished.
21
10 FAQs
10.1 How are out-of-vocabulary words handled?
Our baseline represents input words using pre-trained fixed GloVe embeddings. Out-of-vocabulary
words (i.e. those that don’t have a GloVe embedding) are represented by the UNK token. This is
a special token in the vocabulary that has its own fixed word vector (set to a small random value).
This is something you could improve.
22
References
[1] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+
questions for machine comprehension of text. CoRR, abs/1606.05250, 2016.
[2] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
questions for squad. arXiv preprint arXiv:1806.03822, 2018.
[3] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional
attention flow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.
[4] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv
preprint arXiv:1505.00387, 2015.
[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[6] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[7] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction
via reading comprehension. arXiv preprint arXiv:1706.04115, 2017.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[9] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In
International Conference on Learning Representations, 2020.
[10] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-
ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365, 2018.
[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural information
processing systems, pages 3111–3119, 2013.
[12] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pages 1532–1543, 2014.
[13] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin.
Advances in pre-training distributed word representations. In Proceedings of the International
Conference on Language Resources and Evaluation (LREC 2018), 2018.
[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Infor-
mation Processing Systems, pages 5998–6008, 2017.
[15] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi,
and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading
comprehension. arXiv preprint arXiv:1804.09541, 2018.
[16] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and
Ruslan Salakhutdinov. Transformer-xl: Language modeling with longer-term dependency.
2018.
[17] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
International Conference on Learning Representations (ICLR), 2020.
[18] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer
open-domain questions. arXiv preprint arXiv:1704.00051, 2017.
23
[19] Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bach-
man, and Kaheer Suleman. Newsqa: A machine comprehension dataset. ACL 2017, page 191,
2017.
[20] John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
2019.
[21] Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension sys-
tems. arXiv preprint arXiv:1707.07328, 2017.
[22] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and
acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
24