Natural Language Processing With Deep Learning CS224N/Ling284
Natural Language Processing With Deep Learning CS224N/Ling284
Christopher Manning
Lecture 1: Introduction and Word Vectors
Lecture Plan
Lecture 1: Introduction and Word Vectors
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
2
Course logistics in brief
• Instructor: Christopher Manning
• Head TA: Matt Lamm • Coordinator: Amelie Byun
• TAs: Many wonderful people! See website
• Time: TuTh 4:30–5:50, Nvidia Aud (à video)
3
• Slides uploaded before each lecture
What do we hope to teach?
1. An understanding of the effective modern methods for deep
learning
• Basics first, then key methods used in NLP: Recurrent
networks, attention, transformers, etc.
2. A big picture understanding of human languages and the
difficulties in understanding and producing them
3. An understanding of and ability to build systems (in PyTorch)
for some of the major problems in NLP:
• Word meaning, dependency parsing, machine translation,
question answering
4
Course work and grading policy
• 5 x 1-week Assignments: 6% + 4 x 12%: 54%
• HW1 is released today! Due next Tuesday! At 4:30 p.m.
• Please use @stanford.edu email for your Gradescope account
• Final Default or Custom Course Project (1–3 people): 43%
• Project proposal: 5%, milestone: 5%, poster: 3%, report: 30%
• Final poster session attendance expected! (See website.)
• Wed Mar 20, 5pm-10pm (put it in your calendar!)
• Participation: 3%
• (Guest) lecture attendance, Piazza, evals, karma – see website!
• Late day policy
• 6 free late days; afterwards, 1% off course grade per day late
• Assignments not accepted after 3 late days per assignment
• Collaboration policy: Read the website and the Honor Code!
5
Understand allowed ‘collaboration’ and how to document it
High-Level Plan for Problem Sets
• HW1 is hopefully an easy on ramp – an IPython Notebook
• HW2 is pure Python (numpy) but expects you to do
(multivariate) calculus so you really understand the basics
• HW3 introduces PyTorch
• HW4 and HW5 use PyTorch on a GPU (Microsoft Azure)
• Libraries like PyTorch and Tensorflow are becoming the
standard tools of DL
• For FP, you either
• Do the default project, which is SQuAD question answering
• Open-ended but an easier start; a good choice for many
• Propose a custom final project, which we approve
• You will receive feedback from a mentor (TA/prof/postdoc/PhD)
• Can work in teams of 1–3; can use any language
6
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
7
https://xkcd.com/1576/ Randall Munroe CC BY NC 2.5
How do we represent the meaning of a word?
noun: good
[Synset('procyonid.n.01'),
noun: good, goodness
Synset('carnivore.n.01'),
noun: good, goodness
noun: commodity, trade_good, good Synset('placental.n.01'),
Synset('mammal.n.01'),
adj: good
Synset('vertebrate.n.01'),
adj (sat): full, good
Synset('chordate.n.01'),
adj: good
Synset('animal.n.01'),
adj (sat): estimable, good, honorable, respectable
adj (sat): beneficial, good Synset('organism.n.01'),
Synset('living_thing.n.01'),
adj (sat): good
Synset('whole.n.02'),
adj (sat): good, just, upright
Synset('object.n.01'),
…
Synset('physical_entity.n.01'),
adverb: well, good
adverb: thoroughly, soundly, good Synset('entity.n.01')]
11
Problems with resources like WordNet
• Subjective
• Requires human labor to create and adapt
• Can’t compute accurate word similarity à
12
Representing words as discrete symbols
13
Sec. 9.2.2
But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal.
There is no natural notion of similarity for one-hot vectors!
Solution:
• Could try to rely on WordNet’s list of synonyms to get similarity?
• But it is well-known to fail badly: incompleteness, etc.
•14 Instead: learn to encode similarity in the vectors themselves
Representing words by their context
0.286
0.792
−0.177
−0.107
banking = 0.109
−0.542
0.349
0.271
0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487
17
3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning
word vectors
Idea:
• We have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word
c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
18
Word2Vec Overview
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8
𝑃 𝑤8>= | 𝑤8 𝑃 𝑤89= | 𝑤8
𝑃 𝑤8>< | 𝑤8 𝑃 𝑤89< | 𝑤8
19
Word2Vec Overview
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8
𝑃 𝑤8>= | 𝑤8 𝑃 𝑤89= | 𝑤8
𝑃 𝑤8>< | 𝑤8 𝑃 𝑤89< | 𝑤8
20
Word2vec: objective function
For each position 𝑡 = 1, … , 𝑇, predict context words within a
window of fixed size m, given center word 𝑤: .
I
Likelihood = 𝐿 𝜃 = G G 𝑃 𝑤89: | 𝑤8 ; 𝜃
8H< >JK:KJ
:LM
𝜃 is all variables
to be optimized
sometimes called cost or loss function
exp(𝑢YI 𝑣Z )
𝑃 𝑜𝑐 = I𝑣 )
22
∑U∈] exp(𝑢U Z
Word2Vec Overview with Vectors
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8
• 𝑃 𝑢^_Y`abJc | 𝑣de8Y short for P 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑖𝑛𝑡𝑜 ; 𝑢^_Y`abJc , 𝑣de8Y , 𝜃
23
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
𝑢I 𝑣 = 𝑢. 𝑣 = ∑edH< 𝑢d 𝑣d
exp(𝑢YI 𝑣Z ) Larger dot product = larger probability
𝑃 𝑜𝑐 = I𝑣 )
∑U∈] exp(𝑢U Z
③ Normalize over entire vocabulary
to give probability distribution
25
To train the model: Compute all vector gradients!
• Recall: 𝜃 represents all model parameters, in one long vector
• In our case with d-dimensional vectors and V-many words:
27
Chain Rule
• Simple example:
𝑑𝑦
= 20(𝑥 | + 7)|. 3𝑥 =
𝑑𝑥
28
Interactive Whiteboard Session!
exp(𝑢Y I 𝑣Z )
log 𝑝 𝑜 𝑐 = log ]
∑UH< exp(𝑢U I 𝑣Z )
You then also need the gradient for context words (it’s similar;
left for homework). That’s all of the parameters 𝜃 here.
29
Calculating all gradients!
• We went through gradient for each center vector v in a window
• We also need gradients for outside vectors u
• Derive at home!
• Generally in each window we will compute updates for all
parameters that are being used in that window. For example:
Note: Our
objectives
may not
be convex
like this :(
32
Gradient Descent
• Update equation (in matrix notation):
• Algorithm:
33
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus
(potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!
34
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
35
36