0% found this document useful (0 votes)

152 views71 pages

cs224n 2022 Lecture08 Final Project

The document discusses natural language processing with deep learning and attention mechanisms. It provides an overview of lecture 8 which will cover final projects, practical tips, finding research topics and data. It then focuses on explaining attention mechanisms, how they help solve the bottleneck problem in sequence-to-sequence models by allowing the decoder to focus on relevant parts of the input sequence, and provides diagrams and equations to illustrate how attention is incorporated into the encoder-decoder architecture.

Uploaded by

Luiz Felipe Niedermaier Custódio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views71 pages

cs224n 2022 Lecture08 Final Project

Uploaded by

Luiz Felipe Niedermaier Custódio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 8: Final Projects; Practical Tips
Lecture Plan
Lecture 8: Finish last time – final Projects – practical tips!
1. Attention [25 mins]
2. Final bit of neural machine translation [10 mins]
– Mini Break –
3. Final project types and details; assessment revisited [15 mins]
4. Finding research topics; a couple of examples [20 mins]
5. Finding data [10 mins]
6. Care with datasets and in model development [10 mins]

2
1. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)

he hit me with a pie <END>

Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

Problems with this architecture?

3
1. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

4
Attention
• Attention provides a solution to the bottleneck problem.

• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

• First, we will show via diagram (no equations), then we will show with equations

5
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

6
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

7
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

8
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me

15
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution 𝑦!%
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with

16
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution 𝑦!&
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with a

17
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:

• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)

• We use to take a weighted sum of the encoder hidden states to get the
attention output

• Finally we concatenate the attention output with the decoder hidden

state and proceed as in the non-attention seq2seq model

18
Attention is great!
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability

with
me

pie
he

hit
• By inspecting attention distribution, we see what the decoder was focusing on

a
il
• We get (soft) alignment for free!
a
• This is cool because we never explicitly trained an alignment system
m’
• The network just learned alignment by itself
entarté

19
There are several attention variants
• We have some values and a query

• Attention always involves: There are

1. Computing the attention scores multiple ways
to do this
2. Taking softmax to get attention distribution ⍺:

3. Using attention distribution to take weighted sum of values:

thus obtaining the attention output a (sometimes called the context vector)

20
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 4!

There are several ways you can compute from and :

Basic dot-product attention:

• Note: this assumes . This is the version we saw earlier.

• Multiplicative attention: [Luong, Pham, and Manning 2015]

• Where is a weight matrix. Perhaps better called “bilinear attention”

• Reduced-rank multiplicative attention: 𝑒! = 𝑠 " 𝑼" 𝑽 ℎ! = (𝑼𝑠)" (𝑽ℎ! ) Remember this when we look
at Transformers next week!
• For low rank matrices 𝑼 ∈ ℝ#×%' , 𝑽 ∈ ℝ#×%( , 𝑘 ≪ 𝑑& , 𝑑'

• Additive attention: [Bahdanau, Cho, and Bengio 2014]

• Where are weight matrices and is a weight vector.
• d3 (the attention dimensionality) is a hyperparameter
• “Additive” is a weird/bad name. It’s really using a feed-forward neural net layer.
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
21
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the sequence-to-sequence model
for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)

• More general definition of attention:

• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

• We sometimes say that the query attends to the values.

• For example, in the seq2seq + attention model, each decoder hidden state (query)
attends to all the encoder hidden states (values).

22
Attention is a general Deep Learning technique
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).

Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in all deep learning models. A new idea from after 2010! From NMT!
23
2. So, is Machine Translation solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
• Failures to accurately capture sentence meaning
• Pronoun (or zero pronoun) resolution errors
• Morphological agreement errors

Didn’t specify gender

Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be-ce1f7c8c683c
26
So is Machine Translation solved?

Source: https://blog.google/products/translate/reducing-gender-bias-google-translate/
27
So is Machine Translation solved?
• Nope!
• Uninterpretable systems can do strange things
• (But, AFAICS, this problem has been fixed in Google Translate by 2021.)

Picture source: https://www.vice.com/en_uk/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies

Explanation: https://www.skynettoday.com/briefs/google-nmt-prophecies
28
Assignment 4: Cherokee-English machine translation!
• Cherokee is an endangered Native American language – about 2000 fluent speakers
• Extremely low resource: About 20k parallel sentences available, most from the bible
• ᎪᎯᎩᏴ ᏥᎨᏒᎢ ᎦᎵᏉᎩ ᎢᏯᏂᎢ ᎠᏂᏧᏣ. ᏂᎪᎯᎸᎢ ᏗᎦᎳᏫᎢᏍᏗᎢ ᏩᏂᏯᎡᎢ
ᏓᎾᏁᎶᎲᏍᎬᎢ ᏅᏯ ᎪᏢᏔᏅᎢ ᎦᏆᏗ ᎠᏂᏐᏆᎴᎵᏙᎲᎢ ᎠᎴ ᎤᏓᏍᏈᏗ ᎦᎾᏍᏗ ᎠᏅᏗᏍᎨᎢ
ᎠᏅᏂᎲᎢ.
Long ago were seven boys who used to spend all their time down by the townhouse
playing games, rolling a stone wheel along the ground, sliding and striking it with a stick
• Writing system is a syllabary of symbols for each CV unit (85 letters)
• Many thanks to Shiyue Zhang, Benjamin Frey, and Mohit Bansal
from UNC Chapel Hill for the resources for this assignment!

• Cherokee is not available on Google Translate! 😭

29
Cherokee
• Cherokee originally lived in western North Carolina and eastern Tennessee
• Most speakers now in Oklahoma, following the Trail of Tears; some in NC
• Writing system invented by Segwoya (often written Sequoyah) around
1820 – someone who grew up illiterate
• Very effective: In the following decades Cherokee literacy was higher
than for white people in the southeastern United States

• https://www.cherokee.org

30
NMT research continues
NMT is an important use case for NLP Deep Learning

• NMT research pioneered many of the recent innovations of NLP Deep Learning

• NMT research continues to thrive

• Researchers have found many, many improvements to the “vanilla” seq2seq NMT
system we’ve just presented

• Much work on getting better results on low resource languages

• But, overall, in the last few years more of the excitement has moved to question
answering, semantics, inference, natural language generation, ….

31
3. Course work and grading policy
• 5 x 1-week Assignments: 6% + 4 x 12%: 54%
• Final Default or Custom Course Project (1–3 people): 43%
• Project proposal: 5%; milestone: 5%; summary paragraph + image: 3%; report: 30%
• Participation: 3%
• Guest speaker lectures, Ed, our course evals, karma – see website!
• Late day policy
• 6 free late days; then 1% of total off per day; max 3 late days per assignment
• Collaboration policy: Read the website and the Honor Code!
• For projects: It’s okay to use existing code/resources, but you must document it, and you will be
graded on your value-add
• If multi-person: Include a brief statement on the work of each team-mate
• In almost all cases, each team member gets the same score, but we reserve the right to
differentiate in egregious cases

32
The Final Project
• For FP, you either
• Do the default project, which is SQuAD question answering (2 sub-variants)
• Open-ended but an easier start; a good choice for most
• Propose a custom final project, which we must approve
• You will receive feedback from a mentor (TA/prof/postdoc/PhD)

• You can work in teams of 1–3. Being in a team is encouraged.

• A larger team project or a project used for multiple classes should be larger and
often involves exploring more models or tasks

• You can use any language/framework for your project

• Though we expect most of you to keep using PyTorch
• And our starter code for the default FP is in PyTorch
33
Custom Final Project
• I’m very happy to talk to people about final projects, but the slight problem is that
there’s only one of me….
• Look at TA expertise for custom final projects:
• http://web.stanford.edu/class/cs224n/office_hours.html#staff

34
The Default Final Project
• There are two handouts on the web about it now!
• Two variant question answering (QA) tasks
1. Building a textual question answering architecture for SQuAD from scratch
• Stanford Question Answering Dataset: https://rajpurkar.github.io/SQuAD-explorer/
• Provided starter code in PyTorch. J Attempting SQuAD 2.0 (has unanswerable Qs).
2. Building a Robust QA system which works on different QA datasets/domains
• You train on SQuAD, NewsQA and Natural Questions; test sets are DuoRC, Race and ZSRE by RC
• Starting point is large pre-trained LM (DistilBERT); you work mainly on robustness methods
• We will discuss question answering later in the course (week 6). Example:
T: [Bill] Aiken, adopted by Mexican movie actress Lupe Mayorga, grew up in the neighboring town of
Madera and his song chronicled the hardships faced by the migrant farm workers he saw as a child.
Q: In what town did Bill Aiken grow up?
A: Madera [But Google’s BERT says <No Answer>!]

35
Why Choose The Default Final Project?
• If you:
• Have limited experience with research, don’t have any clear idea of what you want
to do, or want guidance and a goal, … and a leaderboard, even
• Then:
• Do the default final project!
• Many people should do it! (Past statistics: about half of people do DFP.)

• Considerations:
• The two default final project variants give you lots of guidance, scaffolding, and clear
goalposts to aim at
• The path to success is not to do something that looks kinda weak compared to what
you could have done with the DFP.
36
Why Choose The Custom Final Project?
• If you:
• Have some research project that you’re excited about (and are possibly already
working on), which substantively involves human language and neural networks
• You want to try to do something different on your own
• You’re just interested in something other than question answering (that involves
human language material and deep learning)
• You want to see more of the process of defining a research goal, finding data and
tools, and working out something you could do that is interesting, and how to
evaluate it
• Then:
• Do the custom final project!

37
Gamesmanship
• The default final projects are a more guided option, but it’s not that they’re a less work
option

• The default final projects are also open-ended projects where you can explore different
approaches, but to a given problem. Strong default final projects do this.

• There are great default final projects and great custom final projects … and there are
weak default final projects and weak custom final projects. It’s not that either option is
the easy way to get a good grade

• We give Best Project Awards for both default and custom final projects

38
Project Proposal – from every team 5%
1. Find a relevant (key) research paper for your topic
• For DFP, we provide some suggestions, but you might look elsewhere for interesting QA/reading
comprehension work
2. Write a summary of that research paper and what you took away from it as key ideas
that you hope to use

3. Write what you plan to work on and how you can innovate in your final project work
• Suggest a good milestone to have achieved as a halfway point
4. Describe as needed, especially for Custom projects:
• A project plan, relevant existing literature, the kind(s) of models you will use/explore; the data you
will use (and how it is obtained), and how you will evaluate success

3–4 pages, due Tue Feb 8, 3:15pm on Gradescope

39
Project Proposal – from everyone 5%
2. Skill: How to think critically about a research paper
• What were the main novel contributions or points?
• Is what makes it work something general and reusable or a special case?
• Are there flaws or neat details in what they did?
• How does it fit with other papers on similar topics?
• Does it provoke good questions on further or different things to try?
• Grading of research paper review is primarily summative
3. How to do a good job on your project plan
• You need to have an overall sensible idea (!)
• But most project plans that are lacking are lacking in nuts-and-bolts ways:
• Do you have appropriate data or a realistic plant to be able to collect it in a short period of time
• Do you have a realistic way to evaluate your work
• Do you have appropriate baselines or proposed ablation studies for comparisons
• Grading of project proposal is primarily formative
40
Project Milestone – from everyone 5%
• This is a progress report
• You should be more than halfway done!
• Describe the experiments you have run
• Describe the preliminary results you have obtained
• Describe how you plan to spend the rest of your time

You are expected to have implemented some system and to have some initial
experimental results to show by this date (except for certain unusual kinds of projects)

Due Thu Feb 24, 3:15pm on Gradescope

41
Project writeup
• Writeup quality is very important to your grade!!!
• Look at recent years’ prize winners for examples

Abstract Prior related

Model Model
Introduction work

Analysis &
Data Experiments Results
Conclusion

42
4. Finding Research Topics
Two basic starting points, for all of science:
• [Nails] Start with a (domain) problem of interest and try to find good/better ways to
address it than are currently known/used
• [Hammers] Start with a technical method/approach of interest, and work out good
ways to extend it, improve it, understand it, or find new ways to apply it

43
Project types

This is not an exhaustive list, but most projects are one of

1. Find an application/task of interest and explore how to approach/solve it effectively,
often with an existing model
• Could be a task in the wild or some existing Kaggle/bake-off/shared task
2. Implement a complex neural architecture and demonstrate its performance on some
data
3. Come up with a new or variant neural network model or approach and explore its
empirical success
4. Analysis project. Analyze the behavior of a model: how it represents linguistic
knowledge or what kinds of phenomena it can handle or errors that it makes
5. Rare theoretical project: Show some interesting, non-trivial properties of a model
type, data, or a data representation
Stanley Xie, Ruchir Rastogi and Max Chang

45
46
47
48
How to find an interesting place to start?
• Look at ACL anthology for NLP papers:
• https://aclanthology.org/
• Also look at the online proceedings of major ML conferences:
• NeurIPS https://papers.nips.cc, ICML, ICLR https://openreview.net/group?id=ICLR.cc
• Look at past cs224n projects
• See the class website
• Look at online preprint servers, especially:
• https://arxiv.org

• Even better: look for an interesting problem in the world!

• Hal Varian: How to Build an Economic Model in Your Spare Time
https://people.ischool.berkeley.edu/~hal/Papers/how.pdf
49
Want to beat the state
of the art on something?

Great new sites that try to collate

info on the state of the art
• Not always correct, though

https://paperswithcode.com/sota
https://nlpprogress.com/

Specific tasks/topics. Many, e.g.:

https://gluebenchmark.com/leaderboard/
https://www.conll.org/previous-tasks/

50
Finding a topic
• Turing award winner and Stanford CS emeritus professor Ed Feigenbaum says to follow
the advice of his advisor, AI pioneer, and Turing and Nobel prize winner Herb Simon:

• “If you see a research area where many people are working, go somewhere else.”

• But where to go? Wayne Gretzky:

• “I skate to where the puck is going, not where it has been.”

51
Old Deep Learning (NLP), new Deep Learning NLP
• In the early days of the Deep Learning revival (2010-2018), most of the work was in
defining and exploring better deep learning architectures
• Typical paper:
• I can improve a summarization system by not only using attention standardly, but
allowing copying attention – where you use additional attention calculations and an
additional probabilistic gate to simply copy a word from the input to the output
• That’s what a lot of good CS 224N projects did too

• In 2019–2022, that approach is dead

• Well, that’s too strong, but it’s difficult and much rarer

• Most work downloads a big pre-trained model (which fixes the architecture)
• Action is in fine-tuning, or domain adaptation followed by fine-tuning, etc., etc.
52
2022 NLP … recommended for all your practical projects J
pip install transformers # By Huggingface 🤗
# not quite runnable code but gives the general idea….
from transformers import BertForSequenceClassification, AutoTokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased’)
model.train()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased’)
fine_tuner = Trainer( model=model, args=training_args, train_dataset=train_dataset,
eval_dataset=test_dataset )
fine_tuner.train()
eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
results = evaluate(model, tokenizer, eval_dataset, args)
53
Exciting areas 2022
A lot of what is exciting now is problems that work within or around this world
• Evaluating and improving models for something other than accuracy
• Robustness to domain shift
• Evaluating the robustness of models in general (someone could hack on this new
project as their final project!): https://robustnessgym.com
• Doing empirical work looking at what large pre-trained models have learned
• Working out how to get knowledge and good task performance from large models for
particular tasks without much data (transfer learning, etc.)
• Looking at the bias, trustworthiness, and explainability of large models
• Working on how to augment the data for models to improve performance
• Looking at low resource languages or problems
• Improving performance on the tail of rare stuff, addressing bias
54
Exciting areas 2022
• Scaling models up and down
• Building big models is BIG: GPT-2 and GPT-3 … but just not possible for a cs224n
project – do also be realistic about the scale of compute you can do!
• Building small, performant models is also BIG. This could be a great project
• Model pruning, e.g.:
https://papers.nips.cc/paper/2020/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf
• Model quantization, e.g.: https://arxiv.org/pdf/2004.07320.pdf
• How well can you do QA in 6GB or 500MB? https://efficientqa.github.io
• Looking to achieve more advanced functionalities
• E.g., compositionality, systematic generalization, fast learning (e.g., meta-learning)
on smaller problems and amounts of data, and more quickly
• BabyAI: https://arxiv.org/abs/2007.12770
• gSCAN: https://arxiv.org/abs/2003.05161

55
5. Finding data
• Some people collect their own data for a project – we like that!
• You may have a project that uses “unsupervised” data
• You can annotate a small amount of data
• You can find a website that effectively provides annotations, such as likes, stars,
ratings, responses, etc.
• Let’s you learn about real word challenges of applying ML/NLP!
• But be careful on scoping things so that this doesn’t take most of your time!!!

• Some people have existing data from a research project or company

• Fine to use providing you can provide data samples for submission, report, etc.

• Most people make use of an existing, curated dataset built by previous researchers
• You get a fast start and there is obvious prior work and baselines
56
Linguistic Data Consortium
• https://catalog.ldc.upenn.edu/
• Stanford licenses data; you can get access by signing up at:
https://linguistics.stanford.edu/resources/resources-corpora
• Treebanks, named entities, coreference data, lots of clean newswire text, lots of
speech with transcription, parallel MT data, etc.
• Look at their catalog
• Don’t use for non-
Stanford purposes!

57
Machine translation
• http://statmt.org
• Look in particular at the various WMT shared tasks

58
Dependency parsing: Universal Dependencies
• https://universaldependencies.org

59
🤗 Huggingface Datasets
• https://huggingface.co/
datasets

60
Paperswithcode Datasets
• https://www.paperswithcode.com
/datasets?mod=texts&page=1

61
Many, many more
• There are now many other datasets available online for all sorts of purposes
• Look at Kaggle
• Look at research papers to see what data they use
• Look at lists of datasets
• https://machinelearningmastery.com/datasets-natural-language-processing/
• https://github.com/niderhoff/nlp-datasets
• Lots of particular things:
• https://gluebenchmark.com/tasks
• https://nlp.stanford.edu/sentiment/
• https://research.fb.com/downloads/babi/ (Facebook bAbI-related)
• Ask on Ed or talk to course staff

62
6. Care with datasets and in model development
• Many publicly available datasets are released with a train/dev/test structure.
• We're all on the honor system to do test-set runs only when development is
complete.
• Splits like this presuppose a fairly large dataset.
• If there is no dev set or you want a separate tune set, then you create one by splitting
the training data
• We have to weigh the usefulness of it being a certain size against the reduction in
train-set size.
• Cross-validation (q.v.) is a technique for maximizing data when you don’t have much
• Having a fixed test set ensures that all systems are assessed against the same gold data.
This is generally good, but it is problematic when the test set turns out to have unusual
properties that distort progress on the task.

63
Training models and pots of data
• When training, models overfit to what you are training on
• The model correctly describes what happened to occur in particular data you trained
on, but the patterns are not general enough patterns to be likely to apply to new
data
• The way to monitor and avoid problematic overfitting is using independent validation
and test sets …

64
Training models and pots of data
• You build (estimate/train) a model on a training set.
• Often, you then set further hyperparameters on another, independent set of data, the
tuning set
• The tuning set is the training set for the hyperparameters!
• You measure progress as you go on a dev set (development test set or validation set)
• If you do that a lot you overfit to the dev set so it can be good to have a second dev
set, the dev2 set
• Only at the end, you evaluate and present final numbers on a test set
• Use the final test set extremely few times … ideally only once

65
Training models and pots of data
• The train, tune, dev, and test sets need to be completely distinct
• It is invalid to give results testing on material you have trained on
• You will get a falsely good performance.
• We almost always overfit on train
• You need an independent tuning set
• The hyperparameters won’t be set right if tune is same as train
• If you keep running on the same evaluation set, you begin to overfit to that evaluation
set
• Effectively you are “training” on the evaluation set … you are learning things that do and don’t work
on that particular eval set and using the info
• To get a valid measure of system performance you need another untrained on,
independent test set … hence dev2 and final test

66
Getting your neural network to train
• Start with a positive attitude!
• Neural networks want to learn!
• If the network isn’t learning, you’re doing something to prevent it from learning successfully

• Realize the grim reality:

• There are lots of things that can cause neural nets to not learn at all or to not learn
very well
• Finding and fixing them (“debugging and tuning”) can often take more time than implementing
your model

• It’s hard to work out what these things are

• But experience, experimental care, and rules of thumb help!

67
Experimental strategy
• Work incrementally!
• Start with a very simple model and get it to work!
• It’s hard to fix a complex but broken model
• Add bells and whistles one-by-one and get the model working with each of them (or
abandon them)

• Initially run on a tiny amount of data

• You will see bugs much more easily on a tiny dataset … and they train really quickly
• Something like 4–8 examples is good
• Often synthetic data is useful for this
• Make sure you can get 100% on this data (testing on train)
• Otherwise your model is definitely either not powerful enough or it is broken

68
Experimental strategy

• Train and run your model on a large dataset

• It should still score close to 100% on the training data after optimization
• Otherwise, you probably want to consider a more powerful model!
• Overfitting to training data is not something to fear when doing deep learning
• These models are usually good at generalizing because of the way distributed representations
share statistical strength regardless of overfitting to training data
• But, still, you now want good generalization performance:
• Regularize your model until it doesn’t overfit on dev data
• Strategies like L2 regularization can be useful
• But normally generous dropout is the secret to success

69
Details matter!

• Look at your data, collect summary statistics

• Look at your model’s outputs, do error analysis
• Tuning hyperparameters, learning rates, getting initialization right, etc. is
often important to the successes of NNets

70
Good luck with your projects!

Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
F.M.L. Thompson - The Cambridge Social History of Britain, 1750-1950, Vol. 01. Regions and Communities
No ratings yet
F.M.L. Thompson - The Cambridge Social History of Britain, 1750-1950, Vol. 01. Regions and Communities
592 pages
The Social Work Student's Research Handbook - 2nd Edition Instant DOCX Download
100% (15)
The Social Work Student's Research Handbook - 2nd Edition Instant DOCX Download
16 pages
General Tolerances - DIN - IsO - 2768
No ratings yet
General Tolerances - DIN - IsO - 2768
2 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
Cleanrooms and HVAC Systems Design Fundamentals
100% (6)
Cleanrooms and HVAC Systems Design Fundamentals
39 pages
IS 4308 Product Manual
No ratings yet
IS 4308 Product Manual
7 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Dissertation Kant
100% (2)
Dissertation Kant
15 pages
3.-GE11 EntrepreneurialMind FINAL
100% (4)
3.-GE11 EntrepreneurialMind FINAL
15 pages
Bộ đề thi HSG 8
No ratings yet
Bộ đề thi HSG 8
69 pages
Darin Barney The Participatory Condition in The Digital Age
100% (1)
Darin Barney The Participatory Condition in The Digital Age
348 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
The Marketig Plan of Cocoon Viet Nam
No ratings yet
The Marketig Plan of Cocoon Viet Nam
36 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Multimedia SYsytem Unit 1
No ratings yet
Multimedia SYsytem Unit 1
20 pages
Transformers
No ratings yet
Transformers
102 pages
Lesson Plan in Science 6
100% (1)
Lesson Plan in Science 6
6 pages
Cersai: Central Registry of Securitisation Asset Reconstruction and Security Interest of India
No ratings yet
Cersai: Central Registry of Securitisation Asset Reconstruction and Security Interest of India
3 pages
Cashless Economy
No ratings yet
Cashless Economy
9 pages
Power Press
100% (1)
Power Press
7 pages
Invitation PWD Forum
No ratings yet
Invitation PWD Forum
5 pages
350 SX-F Cairoli Replica 2012: Spare Parts Manual: Chassis
No ratings yet
350 SX-F Cairoli Replica 2012: Spare Parts Manual: Chassis
36 pages
Transformer
No ratings yet
Transformer
59 pages
01 The Transformer
No ratings yet
01 The Transformer
64 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Latihan Soal PRDDD
No ratings yet
Latihan Soal PRDDD
73 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Facilities Management Conference Indonesia
No ratings yet
Facilities Management Conference Indonesia
6 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
A1
No ratings yet
A1
11 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Week9 Seq2seq
No ratings yet
Week9 Seq2seq
32 pages
Lecture 5: Self-Attention and Transformers
No ratings yet
Lecture 5: Self-Attention and Transformers
99 pages
Aiayn
No ratings yet
Aiayn
15 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
No ratings yet
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
14 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
AA2 3.2 Attention 2024
No ratings yet
AA2 3.2 Attention 2024
58 pages
Vinija's Notes - Natural Language Processing - Attention
No ratings yet
Vinija's Notes - Natural Language Processing - Attention
27 pages
Attention Based Models
No ratings yet
Attention Based Models
39 pages
Attention
No ratings yet
Attention
15 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Attention
No ratings yet
Attention
12 pages
Original MSG
No ratings yet
Original MSG
9 pages
Online and Linear-Time Attention by Enforcing Monotonic Alignments
No ratings yet
Online and Linear-Time Attention by Enforcing Monotonic Alignments
19 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
Cambridge International AS & A Level: Physics 9702/23
No ratings yet
Cambridge International AS & A Level: Physics 9702/23
12 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
HI5004 Group Assignment Guideline T1.2021
No ratings yet
HI5004 Group Assignment Guideline T1.2021
15 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Unit5 3
No ratings yet
Unit5 3
48 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
FLYFokker Leaflet Lavatory Modifications
No ratings yet
FLYFokker Leaflet Lavatory Modifications
2 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
11 pages
9780374533557RGGReading Group Gold
No ratings yet
9780374533557RGGReading Group Gold
5 pages
2025 Article 58786
No ratings yet
2025 Article 58786
12 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
The Role of Academic Libraries in The Digital Transformation of The Universities
No ratings yet
The Role of Academic Libraries in The Digital Transformation of The Universities
5 pages
Attention Mechanism
No ratings yet
Attention Mechanism
21 pages
Horizontal Circular Prac
No ratings yet
Horizontal Circular Prac
3 pages
Project - Charter - Uber Expansion
No ratings yet
Project - Charter - Uber Expansion
3 pages
AMCA Standard 99-0401-86 Classification For Spark Resistant Construction - REA HVAC
No ratings yet
AMCA Standard 99-0401-86 Classification For Spark Resistant Construction - REA HVAC
2 pages
Uponor Technical Information Smatrix
No ratings yet
Uponor Technical Information Smatrix
1 page
TensorFlow构建机器学习项目: Chinese Edition
From Everand
TensorFlow构建机器学习项目: Chinese Edition
Posts & Telecom Press
No ratings yet
A Novice Guide to Arduino Programming
From Everand
A Novice Guide to Arduino Programming
Ezekiel Ochami
4.5/5 (4)

cs224n 2022 Lecture08 Final Project

Uploaded by

cs224n 2022 Lecture08 Final Project

Uploaded by

Natural Language Processing

with Deep Learning

he hit me with a pie <END>

Source sentence (input)

Problems with this architecture?

Source sentence (input)

On this decoder timestep, we’re

Take softmax to turn the scores

The attention output mostly contains

use to compute 𝑦!! as before

Sometimes we take the

il a m’ entarté <START> he hit

il a m’ entarté <START> he hit me

il a m’ entarté <START> he hit me with

il a m’ entarté <START> he hit me with a

• Finally we concatenate the attention output with the decoder hidden

• Attention always involves: There are

3. Using attention distribution to take weighted sum of values:

There are several ways you can compute from and :

Basic dot-product attention:

• Multiplicative attention: [Luong, Pham, and Manning 2015]

• Additive attention: [Bahdanau, Cho, and Bengio 2014]

• More general definition of attention:

• We sometimes say that the query attends to the values.

Further reading: “Has AI surpassed humans at translation? Not even close!”

Didn’t specify gender

Picture source: https://www.vice.com/en_uk/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies

• Cherokee is not available on Google Translate! 😭

• NMT research continues to thrive

• Much work on getting better results on low resource languages

• You can work in teams of 1–3. Being in a team is encouraged.

• You can use any language/framework for your project

3–4 pages, due Tue Feb 8, 3:15pm on Gradescope

Due Thu Feb 24, 3:15pm on Gradescope

Abstract Prior related

This is not an exhaustive list, but most projects are one of

• Even better: look for an interesting problem in the world!

Great new sites that try to collate

Specific tasks/topics. Many, e.g.:

• But where to go? Wayne Gretzky:

• “I skate to where the puck is going, not where it has been.”

• In 2019–2022, that approach is dead

• Some people have existing data from a research project or company

• Realize the grim reality:

• It’s hard to work out what these things are

• Initially run on a tiny amount of data

• Train and run your model on a large dataset

• Look at your data, collect summary statistics

You might also like