cs224n 2022 Lecture08 Final Project
cs224n 2022 Lecture08 Final Project
Christopher Manning
Lecture 8: Final Projects; Practical Tips
Lecture Plan
Lecture 8: Finish last time – final Projects – practical tips!
1. Attention [25 mins]
2. Final bit of neural machine translation [10 mins]
– Mini Break –
3. Final project types and details; assessment revisited [15 mins]
4. Finding research topics; a couple of examples [20 mins]
5. Finding data [10 mins]
6. Care with datasets and in model development [10 mins]
2
1. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
Decoder RNN
il a m’ entarté <START> he hit me with a pie
3
1. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN
Decoder RNN
il a m’ entarté <START> he hit me with a pie
4
Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence
• First, we will show via diagram (no equations), then we will show with equations
5
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
6
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
7
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
8
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
9
Source sentence (input)
Sequence-to-sequence with attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
10
Source sentence (input)
Sequence-to-sequence with attention
Attention Use the attention distribution to take a
output weighted sum of the encoder hidden states.
scores distribution
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
11
Source sentence (input)
Sequence-to-sequence with attention
Attention he
output
Concatenate attention output
scores distribution
𝑦!! with decoder hidden state, then
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
12
Source sentence (input)
Sequence-to-sequence with attention
Attention hit
output
scores distribution
𝑦!"
Attention Attention
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
14
Source sentence (input)
Sequence-to-sequence with attention
Attention with
output
scores distribution 𝑦!$
Attention Attention
Decoder RNN
Encoder
RNN
15
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution 𝑦!%
Attention Attention
Decoder RNN
Encoder
RNN
16
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution 𝑦!&
Attention Attention
Decoder RNN
Encoder
RNN
17
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:
• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the
attention output
18
Attention is great!
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
with
me
pie
he
hit
• By inspecting attention distribution, we see what the decoder was focusing on
a
il
• We get (soft) alignment for free!
a
• This is cool because we never explicitly trained an alignment system
m’
• The network just learned alignment by itself
entarté
19
There are several attention variants
• We have some values and a query
thus obtaining the attention output a (sometimes called the context vector)
20
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 4!
• Reduced-rank multiplicative attention: 𝑒! = 𝑠 " 𝑼" 𝑽 ℎ! = (𝑼𝑠)" (𝑽ℎ! ) Remember this when we look
at Transformers next week!
• For low rank matrices 𝑼 ∈ ℝ#×%' , 𝑽 ∈ ℝ#×%( , 𝑘 ≪ 𝑑& , 𝑑'
22
Attention is a general Deep Learning technique
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.
Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).
Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in all deep learning models. A new idea from after 2010! From NMT!
23
2. So, is Machine Translation solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
• Failures to accurately capture sentence meaning
• Pronoun (or zero pronoun) resolution errors
• Morphological agreement errors
?
25
So is Machine Translation solved?
• Nope!
• NMT picks up biases in training data
Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be-ce1f7c8c683c
26
So is Machine Translation solved?
Source: https://blog.google/products/translate/reducing-gender-bias-google-translate/
27
So is Machine Translation solved?
• Nope!
• Uninterpretable systems can do strange things
• (But, AFAICS, this problem has been fixed in Google Translate by 2021.)
29
Cherokee
• Cherokee originally lived in western North Carolina and eastern Tennessee
• Most speakers now in Oklahoma, following the Trail of Tears; some in NC
• Writing system invented by Segwoya (often written Sequoyah) around
1820 – someone who grew up illiterate
• Very effective: In the following decades Cherokee literacy was higher
than for white people in the southeastern United States
• https://www.cherokee.org
30
NMT research continues
NMT is an important use case for NLP Deep Learning
• NMT research pioneered many of the recent innovations of NLP Deep Learning
• But, overall, in the last few years more of the excitement has moved to question
answering, semantics, inference, natural language generation, ….
31
3. Course work and grading policy
• 5 x 1-week Assignments: 6% + 4 x 12%: 54%
• Final Default or Custom Course Project (1–3 people): 43%
• Project proposal: 5%; milestone: 5%; summary paragraph + image: 3%; report: 30%
• Participation: 3%
• Guest speaker lectures, Ed, our course evals, karma – see website!
• Late day policy
• 6 free late days; then 1% of total off per day; max 3 late days per assignment
• Collaboration policy: Read the website and the Honor Code!
• For projects: It’s okay to use existing code/resources, but you must document it, and you will be
graded on your value-add
• If multi-person: Include a brief statement on the work of each team-mate
• In almost all cases, each team member gets the same score, but we reserve the right to
differentiate in egregious cases
32
The Final Project
• For FP, you either
• Do the default project, which is SQuAD question answering (2 sub-variants)
• Open-ended but an easier start; a good choice for most
• Propose a custom final project, which we must approve
• You will receive feedback from a mentor (TA/prof/postdoc/PhD)
34
The Default Final Project
• There are two handouts on the web about it now!
• Two variant question answering (QA) tasks
1. Building a textual question answering architecture for SQuAD from scratch
• Stanford Question Answering Dataset: https://rajpurkar.github.io/SQuAD-explorer/
• Provided starter code in PyTorch. J Attempting SQuAD 2.0 (has unanswerable Qs).
2. Building a Robust QA system which works on different QA datasets/domains
• You train on SQuAD, NewsQA and Natural Questions; test sets are DuoRC, Race and ZSRE by RC
• Starting point is large pre-trained LM (DistilBERT); you work mainly on robustness methods
• We will discuss question answering later in the course (week 6). Example:
T: [Bill] Aiken, adopted by Mexican movie actress Lupe Mayorga, grew up in the neighboring town of
Madera and his song chronicled the hardships faced by the migrant farm workers he saw as a child.
Q: In what town did Bill Aiken grow up?
A: Madera [But Google’s BERT says <No Answer>!]
35
Why Choose The Default Final Project?
• If you:
• Have limited experience with research, don’t have any clear idea of what you want
to do, or want guidance and a goal, … and a leaderboard, even
• Then:
• Do the default final project!
• Many people should do it! (Past statistics: about half of people do DFP.)
• Considerations:
• The two default final project variants give you lots of guidance, scaffolding, and clear
goalposts to aim at
• The path to success is not to do something that looks kinda weak compared to what
you could have done with the DFP.
36
Why Choose The Custom Final Project?
• If you:
• Have some research project that you’re excited about (and are possibly already
working on), which substantively involves human language and neural networks
• You want to try to do something different on your own
• You’re just interested in something other than question answering (that involves
human language material and deep learning)
• You want to see more of the process of defining a research goal, finding data and
tools, and working out something you could do that is interesting, and how to
evaluate it
• Then:
• Do the custom final project!
37
Gamesmanship
• The default final projects are a more guided option, but it’s not that they’re a less work
option
• The default final projects are also open-ended projects where you can explore different
approaches, but to a given problem. Strong default final projects do this.
• There are great default final projects and great custom final projects … and there are
weak default final projects and weak custom final projects. It’s not that either option is
the easy way to get a good grade
• We give Best Project Awards for both default and custom final projects
38
Project Proposal – from every team 5%
1. Find a relevant (key) research paper for your topic
• For DFP, we provide some suggestions, but you might look elsewhere for interesting QA/reading
comprehension work
2. Write a summary of that research paper and what you took away from it as key ideas
that you hope to use
3. Write what you plan to work on and how you can innovate in your final project work
• Suggest a good milestone to have achieved as a halfway point
4. Describe as needed, especially for Custom projects:
• A project plan, relevant existing literature, the kind(s) of models you will use/explore; the data you
will use (and how it is obtained), and how you will evaluate success
You are expected to have implemented some system and to have some initial
experimental results to show by this date (except for certain unusual kinds of projects)
41
Project writeup
• Writeup quality is very important to your grade!!!
• Look at recent years’ prize winners for examples
Analysis &
Data Experiments Results
Conclusion
42
4. Finding Research Topics
Two basic starting points, for all of science:
• [Nails] Start with a (domain) problem of interest and try to find good/better ways to
address it than are currently known/used
• [Hammers] Start with a technical method/approach of interest, and work out good
ways to extend it, improve it, understand it, or find new ways to apply it
43
Project types
45
46
47
48
How to find an interesting place to start?
• Look at ACL anthology for NLP papers:
• https://aclanthology.org/
• Also look at the online proceedings of major ML conferences:
• NeurIPS https://papers.nips.cc, ICML, ICLR https://openreview.net/group?id=ICLR.cc
• Look at past cs224n projects
• See the class website
• Look at online preprint servers, especially:
• https://arxiv.org
https://paperswithcode.com/sota
https://nlpprogress.com/
50
Finding a topic
• Turing award winner and Stanford CS emeritus professor Ed Feigenbaum says to follow
the advice of his advisor, AI pioneer, and Turing and Nobel prize winner Herb Simon:
• “If you see a research area where many people are working, go somewhere else.”
51
Old Deep Learning (NLP), new Deep Learning NLP
• In the early days of the Deep Learning revival (2010-2018), most of the work was in
defining and exploring better deep learning architectures
• Typical paper:
• I can improve a summarization system by not only using attention standardly, but
allowing copying attention – where you use additional attention calculations and an
additional probabilistic gate to simply copy a word from the input to the output
• That’s what a lot of good CS 224N projects did too
• Most work downloads a big pre-trained model (which fixes the architecture)
• Action is in fine-tuning, or domain adaptation followed by fine-tuning, etc., etc.
52
2022 NLP … recommended for all your practical projects J
pip install transformers # By Huggingface 🤗
# not quite runnable code but gives the general idea….
from transformers import BertForSequenceClassification, AutoTokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased’)
model.train()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased’)
fine_tuner = Trainer( model=model, args=training_args, train_dataset=train_dataset,
eval_dataset=test_dataset )
fine_tuner.train()
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
results = evaluate(model, tokenizer, eval_dataset, args)
53
Exciting areas 2022
A lot of what is exciting now is problems that work within or around this world
• Evaluating and improving models for something other than accuracy
• Robustness to domain shift
• Evaluating the robustness of models in general (someone could hack on this new
project as their final project!): https://robustnessgym.com
• Doing empirical work looking at what large pre-trained models have learned
• Working out how to get knowledge and good task performance from large models for
particular tasks without much data (transfer learning, etc.)
• Looking at the bias, trustworthiness, and explainability of large models
• Working on how to augment the data for models to improve performance
• Looking at low resource languages or problems
• Improving performance on the tail of rare stuff, addressing bias
54
Exciting areas 2022
• Scaling models up and down
• Building big models is BIG: GPT-2 and GPT-3 … but just not possible for a cs224n
project – do also be realistic about the scale of compute you can do!
• Building small, performant models is also BIG. This could be a great project
• Model pruning, e.g.:
https://papers.nips.cc/paper/2020/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf
• Model quantization, e.g.: https://arxiv.org/pdf/2004.07320.pdf
• How well can you do QA in 6GB or 500MB? https://efficientqa.github.io
• Looking to achieve more advanced functionalities
• E.g., compositionality, systematic generalization, fast learning (e.g., meta-learning)
on smaller problems and amounts of data, and more quickly
• BabyAI: https://arxiv.org/abs/2007.12770
• gSCAN: https://arxiv.org/abs/2003.05161
55
5. Finding data
• Some people collect their own data for a project – we like that!
• You may have a project that uses “unsupervised” data
• You can annotate a small amount of data
• You can find a website that effectively provides annotations, such as likes, stars,
ratings, responses, etc.
• Let’s you learn about real word challenges of applying ML/NLP!
• But be careful on scoping things so that this doesn’t take most of your time!!!
• Most people make use of an existing, curated dataset built by previous researchers
• You get a fast start and there is obvious prior work and baselines
56
Linguistic Data Consortium
• https://catalog.ldc.upenn.edu/
• Stanford licenses data; you can get access by signing up at:
https://linguistics.stanford.edu/resources/resources-corpora
• Treebanks, named entities, coreference data, lots of clean newswire text, lots of
speech with transcription, parallel MT data, etc.
• Look at their catalog
• Don’t use for non-
Stanford purposes!
57
Machine translation
• http://statmt.org
• Look in particular at the various WMT shared tasks
58
Dependency parsing: Universal Dependencies
• https://universaldependencies.org
59
🤗 Huggingface Datasets
• https://huggingface.co/
datasets
60
Paperswithcode Datasets
• https://www.paperswithcode.com
/datasets?mod=texts&page=1
61
Many, many more
• There are now many other datasets available online for all sorts of purposes
• Look at Kaggle
• Look at research papers to see what data they use
• Look at lists of datasets
• https://machinelearningmastery.com/datasets-natural-language-processing/
• https://github.com/niderhoff/nlp-datasets
• Lots of particular things:
• https://gluebenchmark.com/tasks
• https://nlp.stanford.edu/sentiment/
• https://research.fb.com/downloads/babi/ (Facebook bAbI-related)
• Ask on Ed or talk to course staff
62
6. Care with datasets and in model development
• Many publicly available datasets are released with a train/dev/test structure.
• We're all on the honor system to do test-set runs only when development is
complete.
• Splits like this presuppose a fairly large dataset.
• If there is no dev set or you want a separate tune set, then you create one by splitting
the training data
• We have to weigh the usefulness of it being a certain size against the reduction in
train-set size.
• Cross-validation (q.v.) is a technique for maximizing data when you don’t have much
• Having a fixed test set ensures that all systems are assessed against the same gold data.
This is generally good, but it is problematic when the test set turns out to have unusual
properties that distort progress on the task.
63
Training models and pots of data
• When training, models overfit to what you are training on
• The model correctly describes what happened to occur in particular data you trained
on, but the patterns are not general enough patterns to be likely to apply to new
data
• The way to monitor and avoid problematic overfitting is using independent validation
and test sets …
64
Training models and pots of data
• You build (estimate/train) a model on a training set.
• Often, you then set further hyperparameters on another, independent set of data, the
tuning set
• The tuning set is the training set for the hyperparameters!
• You measure progress as you go on a dev set (development test set or validation set)
• If you do that a lot you overfit to the dev set so it can be good to have a second dev
set, the dev2 set
• Only at the end, you evaluate and present final numbers on a test set
• Use the final test set extremely few times … ideally only once
65
Training models and pots of data
• The train, tune, dev, and test sets need to be completely distinct
• It is invalid to give results testing on material you have trained on
• You will get a falsely good performance.
• We almost always overfit on train
• You need an independent tuning set
• The hyperparameters won’t be set right if tune is same as train
• If you keep running on the same evaluation set, you begin to overfit to that evaluation
set
• Effectively you are “training” on the evaluation set … you are learning things that do and don’t work
on that particular eval set and using the info
• To get a valid measure of system performance you need another untrained on,
independent test set … hence dev2 and final test
66
Getting your neural network to train
• Start with a positive attitude!
• Neural networks want to learn!
• If the network isn’t learning, you’re doing something to prevent it from learning successfully
67
Experimental strategy
• Work incrementally!
• Start with a very simple model and get it to work!
• It’s hard to fix a complex but broken model
• Add bells and whistles one-by-one and get the model working with each of them (or
abandon them)
68
Experimental strategy
69
Details matter!
70
Good luck with your projects!
71