0% found this document useful (0 votes)
17 views

Final Project Practical Tips

Uploaded by

hns83418
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Final Project Practical Tips

Uploaded by

hns83418
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Practical Tips for Final Projects Notes

February 2019

Contents
1 Introduction 1

2 Choosing a Project Topic 2


2.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Finding existing research . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Finding datasets and tasks . . . . . . . . . . . . . . . . . . . . . 3
2.4 Obtaining access to datasets via Stanford . . . . . . . . . . . . . 4
2.5 Collecting your own data . . . . . . . . . . . . . . . . . . . . . . 4

3 Project Advice 5
3.1 Define your goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Processing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Data hygiene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Build strong baselines . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Training and debugging neural models . . . . . . . . . . . . . . . 7
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1 Introduction
These notes complement the information provided in Lecture 9 of CS224n 2019,
Practical Tips for Final Projects. In particular this document contains lots of
links to useful resources. This document does not contain a detailed specifica-
tion for the project deliverables (proposal, milestone, poster and report) – those
specifications will be released separately.

Section 2 contains information on how to choose and formulate a project topic.


It is mostly of use to Custom Project teams, though some Default Project teams
may find parts of it useful too.

Section 3 contains information useful for all teams.

This document may be updated a few times during Week 5.

1
2 Choosing a Project Topic
Project Suitability You can choose any topic related to Deep Learning for
NLP. That means your project should make substantive use of deep learning
and substantive use of human language data.

Project types Here is a (non-exhaustive!) list of possible project types:


1. Applying an existing neural model to a new task
2. Implementing a complex neural architecture

3. Proposing a new neural model (or a new variation of an existing model)


4. Proposing a new training, optimization, or evaluation scheme
5. Experimental and/or theoretical analysis of neural models

Project ideas from Stanford researchers We have collected a list of


project ideas from members of the Stanford AI Lab.1 These are a great op-
portunity to work on an interesting research problem with an external mentor.
If you want to do these, get started early!

2.1 Expectations
Your project should aim to provide some kind of scientific knowledge gain,
similar to typical NLP research papers (see Section 2.2 on where to find them).
A typical case is that your project will show that your proposed method provides
good results on an NLP task you’re dealing with.
Given that you only have a few weeks to work on your project, it is not
necessary that your method beats the state-of-the-art performance, or works
better than previous methods. But it should at least show performance broadly
expected of the kinds of methods you’re using.
In any case, your paper should try to provide reasoning explaining the be-
haviour of your model. You will need to provide some qualitative analysis,
which will be useful for supporting or testing your explanations. This will be
particularly important when your method is not working as well as expected.
Ultimately, your project will be graded holistically, taking into account many
criteria: originality, performance of your methods, complexity of the techniques
you used, thoroughness of your evaluation, amount of work put into the project,
analysis quality, writeup quality, etc.
1 https://docs.google.com/document/d/1Ytncuq6tpiSGHsJBkdzskMf0nw4_

x2AJ1rZ7RvpOv5E/edit?usp=sharing

2
2.2 Finding existing research
Generally, it’s much easier to define your project (see Section 3.1) if there is
existing published research using the same or similar task, dataset, approaches,
and/or evaluation metrics. Identifying existing relevant research (and even ex-
isting code) will ultimately save you time, as it will provide a blueprint of how
you might sensibly approach the project. There are many ways to find relevant
research papers:
• You could browse recent publications at any of the top venues where
NLP and/or Deep Learning research is published: ACL, EMNLP, TACL,
NAACL, EACL, NIPS, ICLR, ICML (Not exhaustive!)
• In particular, publications at many NLP venues are indexed at
http://www.aclweb.org/anthology/
• Try a keyword search at:
– http://arxiv.org/
– http://scholar.google.com
– http://dl.acm.org/
– http://aclasb.dfki.de/
• Look at publications from the Stanford NLP group
https://nlp.stanford.edu/pubs/

2.3 Finding datasets and tasks


There are lots of publicly-available datasets on the web. Here are some useful
resources to find datasets:
• A repository to track progress in NLP, including listings for the datasets
and the current state-of-the-art performance for the most common NLP
tasks.
https://nlpprogress.com/
• A small list of well-known standard datasets for common NLP tasks:
https://machinelearningmastery.com/datasets-natural-language-processing/
• An alphabetical list of free or public domain text datasets:
https://github.com/niderhoff/nlp-datasets
• Wikipedia has a list of machine learning text datasets, tabulated with
useful information such as dataset size.
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_
research#Text_data
• Kaggle has many datasets, though some of them are too small for Deep
Learning. Try searching for ‘nlp’.
https://www.kaggle.com/datasets

3
• Datahub has lots of datasets, though not all of it is Machine Learning
focused.
https://datahub.io/collections
• Microsoft Research has a collection of datasets (look under the ‘Dataset
directory’ tab):
http://research.microsoft.com/en-US/projects/data-science-initiative/
datasets.aspx
• A script to search arXiv papers for a keyword, and extract important
information such as performance metrics on a task.
https://huyenchip.com/2018/10/04/sotawhat.html
• A collection of links to more collections of links to datasets!
http://kevinchai.net/datasets
• A collection of papers with code on many NLP tasks.
https://paperswithcode.com/sota
• Datasets for machine translation.
http://statmt.org
• Syntactic corpora for many languages.
https://universaldependencies.org

2.4 Obtaining access to datasets via Stanford


Stanford has purchased access to a lot of natural language data:
• The Linguistic Data Consortium (LDC) is a very large repository of data
from many languages, including lots of annotated data. Stanford has
purchased many of the datasets available there.
http://www.ldc.upenn.edu/
• List of datasets that Stanford has access to (not exhaustive):
https://linguistics.stanford.edu/resources/corpora/corpus-inventory
To obtain access to data, please follow the instructions in this link (make sure
to CC the CS224n staff when you email):
https://linguistics.stanford.edu/resources/corpora/accessing-corpora.

2.5 Collecting your own data


It is possible to collect your own data for your project. However, data collec-
tion is often a time-consuming and messy process that is more difficult than it
appears. Given the limited timeframe, we generally don’t recommend collecting
your own data. If you really do want to collect your own data, make sure to
budget the data collection time into your project. Remember, your project must
have a substantial Deep Learning component, so if you spend all your time on
data collection and none on building neural models, we can’t give you a good
grade.

4
3 Project Advice
3.1 Define your goals
At the very beginning of your project, it’s important to clearly define your
goals in your mind and make sure everyone in your team understands them. In
particular:
• Clearly define the task. What’s the input and what’s the output? Can
you give an example? If the task can’t be framed as input and output,
what exactly are you trying to achieve?

• What dataset(s) will you use? Is that dataset already organized into the
input and output sections described above? If not, what’s your plan to
obtain the data in the format that you need?
• What is your evaluation metric (or metrics)? This needs to be a well-
defined, numerical metric (e.g. ROUGE score), not a vague idea (e.g.
‘summary quality’). See section 3.6 for more detail on how to evaluate
your methods.
• What does success look like for your project? For your chosen evaluation
metrics, what numbers represent expected performance, based on previous
research? If you’re doing an analysis or theoretical project, define your
hypothesis and figure out how your experiments will confirm or negate
your hypothesis.

3.2 Processing data


You may need to do (additional) processing of your data (e.g. tokenization,
tagging, or parsing). Here are some tools that may be useful:
• StanfordNLP (new!): a Python library providing tokenization, tagging,
parsing, and other capabilities. Covers 53 languages.
https://stanfordnlp.github.io/stanfordnlp/

• Other software from the Stanford NLP group:


http://nlp.stanford.edu/software/index.shtml
• NLTK, a lightweight Natural Language Toolkit package in Python:
http://nltk.org/

• spaCy, another Python package that can do preprocessing, but also in-
cludes neural models (e.g. Language Models):
https://spacy.io/

5
3.3 Data hygiene
At the beginning of your project, split your data set into training data (most of
your data), development data (also known as validation data) and test data.
A typical train/dev/test split might be 90/5/5 percent (assigned randomly).
Many NLP datasets come with predefined splits, and if you want to compare
against existing work on the same dataset, you should use the same split as used
in that work. Here is how you should use these data splits in your project:
1. Training data: Use this (and only this data!) to optimize the parameters
of your neural model.

2. Development data: This has two main uses. The first is to compare the
performance of your different models (or versions of the same model) by
computing the evaluation metric on the development data. This enables
you to choose the best hyperparameters and/or architectural choices that
should be evaluated on the test data. The second important usage of
development data is to decide when to stop training your model. Two
simple and common methods for deciding when to stop training are:
(a) Every epoch (or every N training iterations, where N is predefined),
record performance of the current model on the development set and
store the current model as a checkpoint. If development performance
is worse than on the last previous iteration (alternatively, if it fails
to beat best performance M times in a row, where M is predefined),
stop training and keep the best checkpoint.
(b) Train for E epochs (where E is some predefined number) and, after
each epoch, record performance of the current model on the develop-
ment set and store the current model as a checkpoint. Once the E
epochs are finished, stop training and keep the best checkpoint.
3. Test data: At the end of your project, evaluate your best trained model(s)
on the test data to compute your final performance metric. To be scien-
tifically honest, you should only use the training and development data to
select which models to evaluate on the test set.

The reason we use data splits is to avoid overfitting. If you simply selected
the model that does best on your training set, then you wouldn’t know how
well your model would perform on new samples of data – you’d be overfitting
to the training set. In NLP, powerful neural models are particularly prone to
overfitting to their training data, so this is especially important.
Similarly, if you look at the test set before you’ve chosen your final archi-
tecture and hyperparameters, that might impact your decisions and lead your
project to overfit to the test data. Thus, in the interest of science, it is extremely
important that you don’t touch the test set until the very end of your project.
This will ensure that the quantitative performance that you report will be an
honest unbiased estimate of how your method will do on new samples of data.

6
It’s even possible to overfit to the development set. If we train many differ-
ent model variants, and only keep the hyperparameters and architectures that
perform best on the development set, then we may be overfitting to the devel-
opment set. To fix this, you can even use two separate development sets (one of
them called the tuning set, the other one the development set). The tuning
set is used for optimizing hyperparameters, the development set for measuring
overall progress. If you optimize your hyperparameters a lot and do many it-
erations, you may even want to create multiple distinct development sets (dev,
dev2, ...) to avoid overfitting.

3.4 Build strong baselines


A baseline is a simpler method to compare your more complex neural system
against. Baselines are important so that we can understand the performance of
our systems in context.
For example, suppose you’re building a multilayer LSTM-based network with
attention to do binary sentiment analysis (classification of sentences as positive
or negative). The simplest baseline is the guessing baseline, which would achieve
50% accuracy (assuming the dataset is 50% positive and 50% negative). A more
complex baseline would be a simple non-neural Machine Learning algorithm,
such as a Naive Bayes classifier. You could also have simple neural baselines
– for example, encoding the sentence using an average of word embeddings.
Lastly, you should compare against simpler versions of your full model, such as
a vanilla RNN version, a single-layer version, or a version without attention.
These last few options are also called ablation experiments. An ablation ex-
periment removes some part of the full model and measures the performance –
this is useful to quantify how much different parts of the network help perfor-
mance. Ablation experiments are an excellent way analyze your model.
Building strong baselines is very important. Too often, researchers and
practitioners fall into the trap of making baselines that are too weak, or failing
to define any baselines at all. In this case, we cannot tell whether our complex
neural systems are adding any value at all. Sometimes, strong baselines perform
much better than you expected, and this is important to know.

3.5 Training and debugging neural models


Unfortunately, neural networks are notoriously difficult to debug. However, here
are some tips:
• To debug a neural model, train on a small toy dataset (e.g., small fraction
of training data, or a hand-created toy dataset) to sanity-check and diag-
nose bugs. For example, if your model is unable to overfit (e.g. achieve
near-zero training loss) on this toy dataset, then you probably have a bug
in your implementation.
• Hyperparameters (e.g. learning rate, number of layers, dropout rate, etc.)
often impact results significantly. Use performance on the development set

7
to tune these parameters (see Section 3.3). Though you probably won’t
have time for a very exhaustive hyperparameter search, try to identify the
most sensitive/important hyperparameters and tune those.
• Due to their power, neural networks overfit easily. Regularization (dropout,
weight decay) and stopping criteria based on the development set (e.g.
early stopping, see Section 3.3) are extremely important for ensuring your
model will do well on new unseen data samples.
• A more complicated neural model can ‘fail silently’: It may get decent
performance by relying on the simple parts, when the really interesting
components (e.g. a cool attention mechanism) fail due to a bug. Use
ablation experiments (see Section 3.4) to diagnose whether different parts
of the model are adding value.
• During training, randomize the order of training samples, or at least re-
move obvious ordering (e.g., don’t train your Seq2Seq system on data by
going through the corpus in the order of sentence length – instead, make
sure the lengths of the sentences in subsequent minibatches are uncorre-
lated). SGD relies on the assumption that the samples come in random
order.
• There are many online resources containing practical advice for building
neural network models. Note, most of this type of advice is based more
on personal experience than rigorous theory, and these ideas evolve over
time – so take it with a grain of salt! After all, your project could open
new perspectives by disproving some commonly-held belief.
– A Twitter thread on most common neural net mistakes (June 2018):
https://twitter.com/karpathy/status/1013244313327681536
– Deep Learning for NLP Best Practices blog post (July 2017):
http://ruder.io/deep-learning-nlp-best-practices/
– Practical Advice for Building Deep Neural Networks (October 2017):
https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/

3.6 Evaluation
In your project, carrying out meaningful evaluation is as important as designing
and building your neural models. Meaningful evaluation means that you should
carefully compare the performance of your methods using appropriate evaluation
metrics.

Choosing evaluation metrics You must have at least one evaluation metric
(which should be a numerical metric that can be automatically computed) to
measure the performance of your methods. If there is existing published work
on the same dataset and/or task, you should use the same metric(s) as that
work (though you can evaluate on additional metrics if you think it’s useful).

8
Human evaluation Human evaluation is often necessary in research areas
that lack good, comprehensive automatic evaluation metrics (e.g. some natural
language generation tasks). If you want to use human judgment as an evaluation
metric, you are welcome to do so (though you may find it difficult to find the time
and/or funding to collect many human evaluations). Collecting a small number
of human judgments could be a valuable addition to your project, but you must
have at least one automatic evaluation metric – even if it is an imperfect metric.

What to compare You should use your evaluation metric(s) to (a) com-
pare your model against previous work, (b) compare your model against your
baselines (see Section 3.4), and (c) compare different versions of your model.
When comparing against previous work, make sure to get the details right – for
example, did the previous work compute the BLEU metric in a case-sensitive
or case-insensitive way? If you calculate your evaluation metric differently to
previous work, the numbers are not comparable!

Qualitative evaluation So far this section has discussed quantitative eval-


uation – numerical performance measures. Qualitative evaluation, or analysis,
seeks to understand your system (how it works, when it succeeds and when it
fails) by measuring or inspecting key characteristics or outputs of your model.
You will be expected to include some qualitative evaluation in your final report.
Here are some types of qualitative evaluation:

• A simple kind of qualitative evaluation is to include some examples (e.g.


input and model output) in your report. However, don’t just provide ran-
dom examples without comment – find interesting examples that support
your paper’s overall arguments, and comment on them.2

• Error analysis (as seen in some of the assignments) is another important


type of qualitative evaluation. Try to identify categories of errors.
• Break down the performance metric by some criteria. For example, if you
think a translation model is especially bad at translating long sentences,
show that by plotting the BLEU score as a function of source sentence
length.
• Compare the performance of two systems beyond the single evaluation
metric number. For example, what examples does your model get right
that the baseline gets wrong, and vice versa? Can these examples be
characterized by some quality? If so, substantiate that claim by measuring
or plotting the quality.
2 It can be useful to provide a true random sample of model outputs, especially for e.g.
natural language generation papers. This gives readers a true, non-cherry-picked qualitative
overview of your model’s output. If you wish to do this, make it clear that the selection is
random, and comment on it.

9
• If your model uses attention, you can create a plot or visualization of the
attention distribution to see what the model attended to on particular
examples.
If your method is successful, qualitative evaluation is important to understand
the reason behind the numbers, and identify areas for improvement. If your
method is unsuccessful, qualitative evaluation is even more important to under-
stand what went wrong.

10

You might also like