Final Project Practical Tips
Final Project Practical Tips
February 2019
Contents
1 Introduction 1
3 Project Advice 5
3.1 Define your goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Processing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Data hygiene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Build strong baselines . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Training and debugging neural models . . . . . . . . . . . . . . . 7
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Introduction
These notes complement the information provided in Lecture 9 of CS224n 2019,
Practical Tips for Final Projects. In particular this document contains lots of
links to useful resources. This document does not contain a detailed specifica-
tion for the project deliverables (proposal, milestone, poster and report) – those
specifications will be released separately.
1
2 Choosing a Project Topic
Project Suitability You can choose any topic related to Deep Learning for
NLP. That means your project should make substantive use of deep learning
and substantive use of human language data.
2.1 Expectations
Your project should aim to provide some kind of scientific knowledge gain,
similar to typical NLP research papers (see Section 2.2 on where to find them).
A typical case is that your project will show that your proposed method provides
good results on an NLP task you’re dealing with.
Given that you only have a few weeks to work on your project, it is not
necessary that your method beats the state-of-the-art performance, or works
better than previous methods. But it should at least show performance broadly
expected of the kinds of methods you’re using.
In any case, your paper should try to provide reasoning explaining the be-
haviour of your model. You will need to provide some qualitative analysis,
which will be useful for supporting or testing your explanations. This will be
particularly important when your method is not working as well as expected.
Ultimately, your project will be graded holistically, taking into account many
criteria: originality, performance of your methods, complexity of the techniques
you used, thoroughness of your evaluation, amount of work put into the project,
analysis quality, writeup quality, etc.
1 https://docs.google.com/document/d/1Ytncuq6tpiSGHsJBkdzskMf0nw4_
x2AJ1rZ7RvpOv5E/edit?usp=sharing
2
2.2 Finding existing research
Generally, it’s much easier to define your project (see Section 3.1) if there is
existing published research using the same or similar task, dataset, approaches,
and/or evaluation metrics. Identifying existing relevant research (and even ex-
isting code) will ultimately save you time, as it will provide a blueprint of how
you might sensibly approach the project. There are many ways to find relevant
research papers:
• You could browse recent publications at any of the top venues where
NLP and/or Deep Learning research is published: ACL, EMNLP, TACL,
NAACL, EACL, NIPS, ICLR, ICML (Not exhaustive!)
• In particular, publications at many NLP venues are indexed at
http://www.aclweb.org/anthology/
• Try a keyword search at:
– http://arxiv.org/
– http://scholar.google.com
– http://dl.acm.org/
– http://aclasb.dfki.de/
• Look at publications from the Stanford NLP group
https://nlp.stanford.edu/pubs/
3
• Datahub has lots of datasets, though not all of it is Machine Learning
focused.
https://datahub.io/collections
• Microsoft Research has a collection of datasets (look under the ‘Dataset
directory’ tab):
http://research.microsoft.com/en-US/projects/data-science-initiative/
datasets.aspx
• A script to search arXiv papers for a keyword, and extract important
information such as performance metrics on a task.
https://huyenchip.com/2018/10/04/sotawhat.html
• A collection of links to more collections of links to datasets!
http://kevinchai.net/datasets
• A collection of papers with code on many NLP tasks.
https://paperswithcode.com/sota
• Datasets for machine translation.
http://statmt.org
• Syntactic corpora for many languages.
https://universaldependencies.org
4
3 Project Advice
3.1 Define your goals
At the very beginning of your project, it’s important to clearly define your
goals in your mind and make sure everyone in your team understands them. In
particular:
• Clearly define the task. What’s the input and what’s the output? Can
you give an example? If the task can’t be framed as input and output,
what exactly are you trying to achieve?
• What dataset(s) will you use? Is that dataset already organized into the
input and output sections described above? If not, what’s your plan to
obtain the data in the format that you need?
• What is your evaluation metric (or metrics)? This needs to be a well-
defined, numerical metric (e.g. ROUGE score), not a vague idea (e.g.
‘summary quality’). See section 3.6 for more detail on how to evaluate
your methods.
• What does success look like for your project? For your chosen evaluation
metrics, what numbers represent expected performance, based on previous
research? If you’re doing an analysis or theoretical project, define your
hypothesis and figure out how your experiments will confirm or negate
your hypothesis.
• spaCy, another Python package that can do preprocessing, but also in-
cludes neural models (e.g. Language Models):
https://spacy.io/
5
3.3 Data hygiene
At the beginning of your project, split your data set into training data (most of
your data), development data (also known as validation data) and test data.
A typical train/dev/test split might be 90/5/5 percent (assigned randomly).
Many NLP datasets come with predefined splits, and if you want to compare
against existing work on the same dataset, you should use the same split as used
in that work. Here is how you should use these data splits in your project:
1. Training data: Use this (and only this data!) to optimize the parameters
of your neural model.
2. Development data: This has two main uses. The first is to compare the
performance of your different models (or versions of the same model) by
computing the evaluation metric on the development data. This enables
you to choose the best hyperparameters and/or architectural choices that
should be evaluated on the test data. The second important usage of
development data is to decide when to stop training your model. Two
simple and common methods for deciding when to stop training are:
(a) Every epoch (or every N training iterations, where N is predefined),
record performance of the current model on the development set and
store the current model as a checkpoint. If development performance
is worse than on the last previous iteration (alternatively, if it fails
to beat best performance M times in a row, where M is predefined),
stop training and keep the best checkpoint.
(b) Train for E epochs (where E is some predefined number) and, after
each epoch, record performance of the current model on the develop-
ment set and store the current model as a checkpoint. Once the E
epochs are finished, stop training and keep the best checkpoint.
3. Test data: At the end of your project, evaluate your best trained model(s)
on the test data to compute your final performance metric. To be scien-
tifically honest, you should only use the training and development data to
select which models to evaluate on the test set.
The reason we use data splits is to avoid overfitting. If you simply selected
the model that does best on your training set, then you wouldn’t know how
well your model would perform on new samples of data – you’d be overfitting
to the training set. In NLP, powerful neural models are particularly prone to
overfitting to their training data, so this is especially important.
Similarly, if you look at the test set before you’ve chosen your final archi-
tecture and hyperparameters, that might impact your decisions and lead your
project to overfit to the test data. Thus, in the interest of science, it is extremely
important that you don’t touch the test set until the very end of your project.
This will ensure that the quantitative performance that you report will be an
honest unbiased estimate of how your method will do on new samples of data.
6
It’s even possible to overfit to the development set. If we train many differ-
ent model variants, and only keep the hyperparameters and architectures that
perform best on the development set, then we may be overfitting to the devel-
opment set. To fix this, you can even use two separate development sets (one of
them called the tuning set, the other one the development set). The tuning
set is used for optimizing hyperparameters, the development set for measuring
overall progress. If you optimize your hyperparameters a lot and do many it-
erations, you may even want to create multiple distinct development sets (dev,
dev2, ...) to avoid overfitting.
7
to tune these parameters (see Section 3.3). Though you probably won’t
have time for a very exhaustive hyperparameter search, try to identify the
most sensitive/important hyperparameters and tune those.
• Due to their power, neural networks overfit easily. Regularization (dropout,
weight decay) and stopping criteria based on the development set (e.g.
early stopping, see Section 3.3) are extremely important for ensuring your
model will do well on new unseen data samples.
• A more complicated neural model can ‘fail silently’: It may get decent
performance by relying on the simple parts, when the really interesting
components (e.g. a cool attention mechanism) fail due to a bug. Use
ablation experiments (see Section 3.4) to diagnose whether different parts
of the model are adding value.
• During training, randomize the order of training samples, or at least re-
move obvious ordering (e.g., don’t train your Seq2Seq system on data by
going through the corpus in the order of sentence length – instead, make
sure the lengths of the sentences in subsequent minibatches are uncorre-
lated). SGD relies on the assumption that the samples come in random
order.
• There are many online resources containing practical advice for building
neural network models. Note, most of this type of advice is based more
on personal experience than rigorous theory, and these ideas evolve over
time – so take it with a grain of salt! After all, your project could open
new perspectives by disproving some commonly-held belief.
– A Twitter thread on most common neural net mistakes (June 2018):
https://twitter.com/karpathy/status/1013244313327681536
– Deep Learning for NLP Best Practices blog post (July 2017):
http://ruder.io/deep-learning-nlp-best-practices/
– Practical Advice for Building Deep Neural Networks (October 2017):
https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/
3.6 Evaluation
In your project, carrying out meaningful evaluation is as important as designing
and building your neural models. Meaningful evaluation means that you should
carefully compare the performance of your methods using appropriate evaluation
metrics.
Choosing evaluation metrics You must have at least one evaluation metric
(which should be a numerical metric that can be automatically computed) to
measure the performance of your methods. If there is existing published work
on the same dataset and/or task, you should use the same metric(s) as that
work (though you can evaluate on additional metrics if you think it’s useful).
8
Human evaluation Human evaluation is often necessary in research areas
that lack good, comprehensive automatic evaluation metrics (e.g. some natural
language generation tasks). If you want to use human judgment as an evaluation
metric, you are welcome to do so (though you may find it difficult to find the time
and/or funding to collect many human evaluations). Collecting a small number
of human judgments could be a valuable addition to your project, but you must
have at least one automatic evaluation metric – even if it is an imperfect metric.
What to compare You should use your evaluation metric(s) to (a) com-
pare your model against previous work, (b) compare your model against your
baselines (see Section 3.4), and (c) compare different versions of your model.
When comparing against previous work, make sure to get the details right – for
example, did the previous work compute the BLEU metric in a case-sensitive
or case-insensitive way? If you calculate your evaluation metric differently to
previous work, the numbers are not comparable!
9
• If your model uses attention, you can create a plot or visualization of the
attention distribution to see what the model attended to on particular
examples.
If your method is successful, qualitative evaluation is important to understand
the reason behind the numbers, and identify areas for improvement. If your
method is unsuccessful, qualitative evaluation is even more important to under-
stand what went wrong.
10