0% found this document useful (0 votes)

34 views42 pages

Christopher Manning Lecture 1: Introduction and Word Vectors

Here are a few key problems with representing words as discrete symbols: 1. Sparse and high-dimensional. With a large vocabulary, word vectors become very sparse and high-dimensional, wasting a lot of memory and computation. 2. Doesn't capture similarity. Discrete representations don't capture the fact that "motel" and "hotel" have similar meanings. They are treated as completely unrelated. 3. Doesn't generalize to new words. We cannot represent the meaning of an out-of-vocabulary word like "Seattlish motel". The system has no way to relate it to known words. 4. Doesn't support word relations. It's difficult to model semantic relations like synonyms, antonyms,

Uploaded by

Muhammad Arshad Awan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views42 pages

Christopher Manning Lecture 1: Introduction and Word Vectors

Uploaded by

Muhammad Arshad Awan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 1: Introduction and Word Vectors
Lecture Plan
Lecture 1: Introduction and Word Vectors
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

Key learning today: The (really surprising!) result that word meaning can be represented
rather well by a (high-dimensional) vector of real numbers

2
Course logistics in brief
• Instructor: Christopher Manning
• Head TA: Anna Goldie
• Coordinator: Amelie Byun
• TAs: Many wonderful people! See website
• Time: Tu/Th 3:15–4:45 Pacific time, Zoom U. (à video)

• We’ve put a lot of other important information on the class webpage. Please read it!
• http://cs224n.stanford.edu/
a.k.a., http://www.stanford.edu/class/cs224n/
• TAs, syllabus, help sessions/office hours, Ed (for all course questions/discussion)
• Office hours start Thursday evening!
• Python/numpy and then PyTorch tutorials: First two Fridays 1:30–2:30 Pacific time on Zoom U.
• Slide PDFs uploaded before each lecture
3
4
What do we hope to teach? (A.k.a. “learning goals”)
1. The foundations of the effective modern methods for deep learning applied to NLP
• Basics first, then key methods used in NLP: Word vectors, feed-forward networks,
recurrent networks, attention, encoder-decoder models, transformers, etc.

2. A big picture understanding of human languages and the difficulties in understanding

and producing them via computers

3. An understanding of and ability to build systems (in PyTorch) for some of the major
problems in NLP:
• Word meaning, dependency parsing, machine translation, question answering

5
Course work and grading policy
• 5 x 1-week Assignments: 6% + 4 x 12%: 54%
• HW1 is released today! Due next Tuesday! At 3:15 p.m.
• Submitted to Gradescope in Canvas (i.e., using @stanford.edu email for your Gradescope account)
• Final Default or Custom Course Project (1–3 people): 43%
• Project proposal: 5%, milestone: 5%, poster or web summary: 3%, report: 30%
• Participation: 3%
• Guest lecture reactions, Ed, course evals, karma – see website!
• Late day policy
• 6 free late days; afterwards, 1% off course grade per day late
• Assignments not accepted more than 3 days late per assignment unless given permission in advance
• Collaboration policy: Please read the website and the Honor Code!
Understand allowed collaboration and how to document it: Don’t take code off the
web; acknowledge working with other students; write your own assignment solutions
6
High-Level Plan for Assignments (to be completed individually!)
• Ass1 is hopefully an easy on ramp – a Jupyter/IPython Notebook
• Ass2 is pure Python (numpy) but expects you to do (multivariate) calculus so you really
understand the basics
• Ass3 introduces PyTorch, building a feed-forward network for dependency parsing
• Ass4 and Ass5 use PyTorch on a GPU (Microsoft Azure)
• Libraries like PyTorch and Tensorflow are now the standard tools of DL
• For Final Project, more details presented later, but you either:
• Do the default project, which is a question answering system
• Open-ended but an easier start; a good choice for many
• Propose a custom final project, which we approve
• You will receive feedback from a mentor (TA/prof/postdoc/PhD)
• Can work in teams of 1–3; can use any language/packages
7
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)

8
https://xkcd.com/1576/ Randall Munroe CC BY NC 2.5
Trained on text data, neural machine translation is quite good!

https://kiswahili.tuko.co.ke/
GPT-3: A first step on the path to foundation models
The SEC said, “Musk, your tweets are a S: I broke the window.
blight. Q: What did I break?
S: I gracefully saved the day.
They really could cost you your job,
Q: What did I gracefully save?
if you don't stop all this tweeting at night.” S: I gave John flowers.
Then Musk cried, “Why? Q: Who did I give flowers to?
The tweets I wrote are not mean, S: I gave her a rose and a guitar.
I don't use all-caps Q: Who did I give a rose and a guitar to?
and I'm sure that my tweets are clean.” How many users have signed up since the start of 2020?
“But your tweets can move markets SELECT count(id) FROM users
WHERE created_at > ‘2020-01-01’
and that's why we're sore.
What is the average number of influencers each user is
You may be a genius and a billionaire, subscribed to?
but it doesn't give you the right to SELECT avg(count) FROM ( SELECT user_id, count(*)
be a bore!” FROM subscribers GROUP BY user_id )
AS avg_subscriptions_per_user
How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)

• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

signifier (symbol) ⟺ signified (idea or thing)
= denotational semantics

tree ⟺ {🌳, 🌲, 🌴, …}
13
How do we have usable meaning in a computer?
Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of
synonym sets and hypernyms (“is a” relationships)
e.g., synonym sets containing “good”: e.g., hypernyms of “panda”:
from nltk.corpus import wordnet as wn from nltk.corpus import wordnet as wn
poses = { 'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv'}
for synset in wn.synsets("good"): panda = wn.synset("panda.n.01")
print("{}: {}".format(poses[synset.pos()], hyper = lambda s: s.hypernyms()
", ".join([l.name() for l in synset.lemmas()]))) list(panda.closure(hyper))

noun: good
[Synset('procyonid.n.01'),
noun: good, goodness
Synset('carnivore.n.01'),
noun: good, goodness
noun: commodity, trade_good, good Synset('placental.n.01'),
Synset('mammal.n.01'),
adj: good
Synset('vertebrate.n.01'),
adj (sat): full, good
Synset('chordate.n.01'),
adj: good
Synset('animal.n.01'),
adj (sat): estimable, good, honorable, respectable
adj (sat): beneficial, good Synset('organism.n.01'),
Synset('living_thing.n.01'),
adj (sat): good
Synset('whole.n.02'),
adj (sat): good, just, upright
Synset('object.n.01'),
…
Synset('physical_entity.n.01'),
adverb: well, good
adverb: thoroughly, soundly, good Synset('entity.n.01')]

14
Problems with resources like WordNet
• A useful resource but missing nuance:
• e.g., “proficient” is listed as a synonym for “good”
This is only correct in some contexts
• Also, WordNet list offensive synonyms in some synonym sets without any
coverage of the connotations or appropriateness of words
• Missing new meanings of words:
• e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
• Impossible to keep up-to-date!
• Subjective
• Requires human labor to create and adapt
• Can’t be used to accurately compute word similarity (see following slides)

15
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols:
hotel, conference, motel – a localist representation

Means one 1, the rest 0s

Such symbols for words can be represented by one-hot vectors:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000+)

16
Sec. 9.2.2

Problem with words as discrete symbols

Example: in web search, if a user searches for “Seattle motel”, we would like to match
documents containing “Seattle hotel”

But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal
There is no natural notion of similarity for one-hot vectors!

Solution:
• Could try to rely on WordNet’s list of synonyms to get similarity?
• But it is well-known to fail badly: incompleteness, etc.
• Instead: learn to encode similarity in the vectors themselves
17
Representing words by their context
• Distributional semantics: A word’s meaning is given
by the words that frequently appear close-by
• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
• One of the most successful ideas of modern statistical NLP!
• When a word w appears in a text, its context is the set of words that appear nearby
(within a fixed-size window).
• We use the many contexts of w to build up a representation of w

!"#$%&'(%')+%,)-&#,.%(/)0&'1'"1')# ,2'31'" 4&1/%/2/52--%'%+1'6778!

!/291'"*)52)*:0&#-%*'%%+/*0'1;1%+ ,2'31'" &%"0.2)1#'*)#*&%-.24%*)5%*5#+"%-#+"%!
!<'+12*52/*=0/)*"1$%'*1)/ ,2'31'" /9/)%(*2*/5#)*1'*)5%*2&(!

18 These context words will represent banking

Word vectors
We will build a dense vector for each word, chosen so that it is similar to vectors of words
that appear in similar contexts, measuring similarity as the vector dot (scalar) product

0.286 0.413
0.792 0.582
−0.177 −0.007
banking = −0.107 monetary = 0.247
0.109 0.216
−0.542 −0.718
0.349 0.147
0.271 0.051

Note: word vectors are also called (word) embeddings or (neural) word representations
They are a distributed representation
19
Word meaning as a neural word vector – visualization

0.286
0.792
−0.177
−0.107
expect = 0.109
−0.542
0.349
0.271
0.487

20
3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

Idea:
• We have a large corpus (“body”) of text: a long list of words
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word c and context
(“outside”) words o
• Use the similarity of the word vectors for c and o to calculate the probability of o given
c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

21
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤!"# | 𝑤!

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2

22
Word2Vec Overview
Example windows and process for computing 𝑃 𝑤!"# | 𝑤!

𝑃 𝑤!%$ | 𝑤! 𝑃 𝑤!"$ | 𝑤!
𝑃 𝑤!%# | 𝑤! 𝑃 𝑤!"# | 𝑤!

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2

23
Word2vec: objective function
For each position 𝑡 = 1, … , 𝑇, predict context words within a window of fixed size m,
given center word 𝑤! . Data likelihood:
&
Likelihood = 𝐿 𝜃 = , , 𝑃 𝑤!"# | 𝑤! ; 𝜃
𝜃 is all variables !$% '()#)(
to be optimized #*+

sometimes called a cost or loss function

The objective function 𝐽 𝜃 is the (average) negative log likelihood:

&
1 1
𝐽 𝜃 = − log 𝐿(𝜃) = − 5 5 log 𝑃 𝑤!"# | 𝑤! ; 𝜃
𝑇 𝑇
!$% '()#)(
#*+
Minimizing objective function ⟺ Maximizing predictive accuracy
24
Word2vec: objective function
• We want to minimize the objective function:
'
1
𝐽 𝜃 =− * * log 𝑃 𝑤!"* | 𝑤! ; 𝜃
𝑇
!&# %()*)(
*+,

• Question: How to calculate 𝑃 𝑤!"# | 𝑤! ; 𝜃 ?

• Answer: We will use two vectors per word w:
• 𝑣- when w is a center word
• 𝑢- when w is a context word
• Then for a center word c and a context word o:

exp(𝑢-& 𝑣. )
𝑃 𝑜𝑐 = &𝑣 )
∑/∈1 exp(𝑢/ .
25
Word2Vec with Vectors
• Example windows and process for computing 𝑃 𝑤!"* | 𝑤!
• 𝑃 𝑢./0123(4 | 𝑣56!0 short for P 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑖𝑛𝑡𝑜 ; 𝑢./0123(4 , 𝑣56!0 , 𝜃

All words vectors 𝜃

appear in denominator

𝑃 𝑢-(!./0+ | 𝑣%#'( 𝑃 𝑢)%+%+ |𝑣%#'(

𝑃 𝑢',#%#& | 𝑣%#'( 𝑃 𝑢!"#$%#& |𝑣%#'(

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2

26
Word2vec: prediction function
② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
𝑢1 𝑣 = 𝑢. 𝑣 = ∑#%23 𝑢% 𝑣%
exp(𝑢-& 𝑣. ) Larger dot product = larger probability
𝑃 𝑜𝑐 = &𝑣 )
∑/∈1 exp(𝑢/ .
③ Normalize over entire vocabulary
to give probability distribution

• This is an example of the softmax function ℝ6 → (0,1)6 Open

region
exp(𝑥7 )
softmax 𝑥7 = 6 = 𝑝7
∑#$% exp(𝑥# )
• The softmax function maps arbitrary values 𝑥7 to a probability distribution 𝑝7
• “max” because amplifies probability of largest 𝑥5
But sort of a weird name
• “soft” because still assigns some probability to smaller 𝑥5 because it returns a distribution!
• Frequently used in Deep Learning

27
To train the model: Optimize value of parameters to minimize loss
To train a model, we gradually adjust parameters to minimize a loss

• Recall: 𝜃 represents all the

model parameters, in one
long vector
• In our case, with
d-dimensional vectors and
V-many words, we have à
• Remember: every word has
two vectors

• We optimize these parameters by walking down the gradient (see right figure)
• We compute all vector gradients!
28
4.

29
30
31
32
5. Optimization: Gradient Descent
• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take small step in direction
of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this L

But life turns

out to be
okay J

33
Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Algorithm:

34
Stochastic Gradient Descent
• Problem: 𝐽 𝜃 is a function of all windows in the corpus (potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!

• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one
• Algorithm:

35
Lecture Plan
1. The course (10 mins)
2. Human language and word meaning (15 mins)
3. Word2vec introduction (15 mins)
4. Word2vec objective function gradients (25 mins)
5. Optimization basics (5 mins)
6. Looking at word vectors (10 mins or less)
• See Jupyter Notebook

36
37
4. Word2vec derivations of gradient
• Zoom U. Whiteboard – see video or revised slides
• The basic Lego piece: The chain rule
• Useful basic fact:

• If in doubt: write it out with indices

38
Chain Rule
• Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

• Simple example:

𝑑𝑦
= 20(𝑥 ; + 7);. 3𝑥 <
𝑑𝑥
39
Interactive Whiteboard Session!

Let’s derive gradient for center word together

For one example window and one example outside word:

exp(𝑢@ A 𝑣B )
log 𝑝 𝑜 𝑐 = log F
∑CDE exp(𝑢C A 𝑣B )

You then also need the gradient for context words (it’s similar;
left for homework). That’s all of the parameters 𝜃 here.
40
Calculating all gradients!
• We went through the gradient for each center vector v in a window
• We also need gradients for outside vectors u
• Derive at home!
• Generally, in each window we will compute updates for all parameters that are being
used in that window. For example:

𝑃 𝑢',*#%#& |𝑣!"#$%#& 𝑃 𝑢"+ |𝑣!"#$%#&

𝑃 𝑢#(+, | 𝑣&'()#(* 𝑃 𝑢!"#$%$ |𝑣&'()#(*

… problems turning into banking crises as …

outside context words center word outside context words

in window of size 2 at position t in window of size 2
41
Word2vec: More details
Why two vectors? à Easier optimization. Average both at the end.

Two model variants:

1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
This lecture so far: Skip-gram model

Additional efficiency in training:

1. Negative sampling
So far: Focus on naïve softmax (simpler but more expensive training method)

Word Embeddings
No ratings yet
Word Embeddings
55 pages
Lecture1 Word Embeddings
No ratings yet
Lecture1 Word Embeddings
99 pages
cs224n Lecture Notes
No ratings yet
cs224n Lecture Notes
35 pages
Lecture 2
No ratings yet
Lecture 2
80 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
Lecture 10
No ratings yet
Lecture 10
86 pages
XCS224N Module1 Slides
No ratings yet
XCS224N Module1 Slides
72 pages
Ba LLMS W2 S2 2024 2025
No ratings yet
Ba LLMS W2 S2 2024 2025
47 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Sense VEC A Fast and Accurate Method For Word Sense Disambiguation in Neural Word Embeddings
No ratings yet
Sense VEC A Fast and Accurate Method For Word Sense Disambiguation in Neural Word Embeddings
9 pages
cs224n 2025 Lecture01 Wordvecs1
No ratings yet
cs224n 2025 Lecture01 Wordvecs1
36 pages
Unit 5 DL
No ratings yet
Unit 5 DL
11 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Unit IV
No ratings yet
Unit IV
57 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Word 2 Vec
No ratings yet
Word 2 Vec
33 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Vector Semantics and Embeddings
No ratings yet
Vector Semantics and Embeddings
29 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Documentation of Life Experience: Als RPL Form 1
88% (8)
Documentation of Life Experience: Als RPL Form 1
5 pages
NLP Lec 03
No ratings yet
NLP Lec 03
26 pages
Wordembed
No ratings yet
Wordembed
31 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Semantic Networks
100% (1)
Semantic Networks
68 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
C# Mastery: A Comprehensive Guide to Advanced C# Features and Applications
From Everand
C# Mastery: A Comprehensive Guide to Advanced C# Features and Applications
Lena Neill
No ratings yet
Cs 224 N
No ratings yet
Cs 224 N
128 pages
cs224n spr2024 Lecture01 Wordvecs1
No ratings yet
cs224n spr2024 Lecture01 Wordvecs1
40 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet
GRC Training - Terminology
0% (1)
GRC Training - Terminology
13 pages
Madhav Institute of Technology & Science, Gwalior
No ratings yet
Madhav Institute of Technology & Science, Gwalior
13 pages
41 Le Nhu Kieu Vy
No ratings yet
41 Le Nhu Kieu Vy
3,216 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
cs224n Winter2023 Lecture1 Notes Draft
No ratings yet
cs224n Winter2023 Lecture1 Notes Draft
13 pages
NLP 1
No ratings yet
NLP 1
15 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Chapter Transformers
No ratings yet
Chapter Transformers
8 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Snake Game Blackbook PROJECT
No ratings yet
Snake Game Blackbook PROJECT
40 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
36 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
Vidhi Shrivastava - SR Project Manager CV
No ratings yet
Vidhi Shrivastava - SR Project Manager CV
6 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
BAED-AI2121-2322S-Written Work 1-4th Quarter Grade 12
100% (1)
BAED-AI2121-2322S-Written Work 1-4th Quarter Grade 12
5 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
Civil Site Design V 1700
No ratings yet
Civil Site Design V 1700
6 pages
HP Pavilion Gaming 17 Laptop PC
No ratings yet
HP Pavilion Gaming 17 Laptop PC
86 pages
Smart Note Taker: A Seminar Report On
No ratings yet
Smart Note Taker: A Seminar Report On
32 pages
ECSWI269ver020 Online Proctored Exams Candidate GuidelinesmacOs
No ratings yet
ECSWI269ver020 Online Proctored Exams Candidate GuidelinesmacOs
14 pages
IT2042 Info Sec UNIT V NOTES
No ratings yet
IT2042 Info Sec UNIT V NOTES
13 pages
MCS-013: Discrete Mathematics
From Everand
MCS-013: Discrete Mathematics
Dr. DK Sukhani
No ratings yet
HUAWEI MateBook X Pro User Guide
No ratings yet
HUAWEI MateBook X Pro User Guide
30 pages
Learning Quranic Grammar in Urdu
No ratings yet
Learning Quranic Grammar in Urdu
2 pages
Duty Roster
No ratings yet
Duty Roster
6 pages
PSUC
No ratings yet
PSUC
5 pages
Mpesa Web User Application Form
No ratings yet
Mpesa Web User Application Form
2 pages
Technical Development Unit Test Plan
No ratings yet
Technical Development Unit Test Plan
11 pages
PTC Creo 2.0 m010 Installation Guide
No ratings yet
PTC Creo 2.0 m010 Installation Guide
69 pages
Ryan Michael
No ratings yet
Ryan Michael
156 pages
Official Remix Flexi Extra Size Cat Print in Place Bfcf289a C48e 4543 966c 0eb7c0004ab1
No ratings yet
Official Remix Flexi Extra Size Cat Print in Place Bfcf289a C48e 4543 966c 0eb7c0004ab1
4 pages
DM Enamor?
No ratings yet
DM Enamor?
15 pages
University of Gujrat: Important Instructions
No ratings yet
University of Gujrat: Important Instructions
2 pages
7-Integration Tools For Design and Process Control of Filament Winding!!!!!!!!!!!!!!
No ratings yet
7-Integration Tools For Design and Process Control of Filament Winding!!!!!!!!!!!!!!
144 pages
Portable Sport - Aid Bot
No ratings yet
Portable Sport - Aid Bot
74 pages
Threads, Concurrency, and Deadlocks
No ratings yet
Threads, Concurrency, and Deadlocks
2 pages
11 - Ir. Dr. Harriezan Ahmad PDF
No ratings yet
11 - Ir. Dr. Harriezan Ahmad PDF
10 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
47 pages
Verify Steps Fix Corba Service
No ratings yet
Verify Steps Fix Corba Service
4 pages
Unipower Company 2018pptx
No ratings yet
Unipower Company 2018pptx
22 pages
Advanced Scripting Techniques For Automating Regression Tests and Measurements With The Code Composer Studio Scripting Utility
No ratings yet
Advanced Scripting Techniques For Automating Regression Tests and Measurements With The Code Composer Studio Scripting Utility
24 pages
Programming Language: Arrays and Strings
No ratings yet
Programming Language: Arrays and Strings
35 pages
Programming Language: Functions
No ratings yet
Programming Language: Functions
24 pages
Elementary Programming
No ratings yet
Elementary Programming
24 pages
Blog App Using Spring Boot
No ratings yet
Blog App Using Spring Boot
7 pages
Posonic HomeAlarm EX10 & EX18 Flyer
No ratings yet
Posonic HomeAlarm EX10 & EX18 Flyer
1 page
Embedded Event Manager
No ratings yet
Embedded Event Manager
5 pages

Christopher Manning Lecture 1: Introduction and Word Vectors

Uploaded by

Christopher Manning Lecture 1: Introduction and Word Vectors

Uploaded by

Natural Language Processing

with Deep Learning

2. A big picture understanding of human languages and the difficulties in understanding

Definition: meaning (Webster dictionary)

Commonest linguistic way of thinking of meaning:

Means one 1, the rest 0s

Such symbols for words can be represented by one-hot vectors:

Vector dimension = number of words in vocabulary (e.g., 500,000+)

Problem with words as discrete symbols

!"#$%&'(%')*+%,)*-&#,.%(/*)0&'1'"*1')# ,2'31'" 4&1/%/*2/*52--%'%+*1'*6778!

18 These context words will represent banking

… problems turning into banking crises as …

outside context words center word outside context words

… problems turning into banking crises as …

outside context words center word outside context words

sometimes called a cost or loss function

The objective function 𝐽 𝜃 is the (average) negative log likelihood:

• Question: How to calculate 𝑃 𝑤!"# | 𝑤! ; 𝜃 ?

All words vectors 𝜃

𝑃 𝑢-*(!./0+ | 𝑣%#'( 𝑃 𝑢)*%+%+ |𝑣%#'(

𝑃 𝑢',#%#& | 𝑣%#'( 𝑃 𝑢!"#$%#& |𝑣%#'(

… problems turning into banking crises as …

outside context words center word outside context words

• This is an example of the softmax function ℝ6 → (0,1)6 Open

• Recall: 𝜃 represents all the

But life turns

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Very bad idea for pretty much all neural nets!

• If in doubt: write it out with indices

Let’s derive gradient for center word together

𝑃 𝑢',*#%#& |𝑣!"#$%#& 𝑃 𝑢"+ |𝑣!"#$%#&

𝑃 𝑢#(+, | 𝑣&'()#(* 𝑃 𝑢!"#$%$ |𝑣&'()#(*

… problems turning into banking crises as …

outside context words center word outside context words

Two model variants:

Additional efficiency in training:

You might also like

!"#$%&'(%')+%,)-&#,.%(/)0&'1'"1')# ,2'31'" 4&1/%/2/52--%'%+1'6778!

𝑃 𝑢-(!./0+ | 𝑣%#'( 𝑃 𝑢)%+%+ |𝑣%#'(