0% found this document useful (0 votes)
19 views162 pages

CS585 Lecture October15th

Uploaded by

fyi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views162 pages

CS585 Lecture October15th

Uploaded by

fyi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

CS 585

Natural Language Processing

October 15, 2024


Announcements / Reminders
 Please follow the Week 08 To Do List instructions (if you haven't
already)

 Programming Assignment #02 due on Sunday (10/13/24) Sunday


(10/20/24) at 11:59 PM CST

2
Plan for Today
 Introduction to Neural Networks

3
Introduction to Neural Networks

4
Main Machine Learning Categories
Supervised learning Unsupervised learning Reinforcement learning

Supervised learning is one Unsupervised learning Reinforcement learning is


of the most common involves finding underlying inspired by behavioral
techniques in machine patterns within data. psychology. It is based on a
learning. It is based on Typically used in clustering rewarding / punishing an
known relationship(s) and data points (similar algorithm.
patterns within data (for customers, etc.)
example: relationship Rewards and punishments
between inputs and are based on algorithm’s
outputs). action within its
environment.
Frequently used types:
regression, and
classification.

5
Choosing Hypothesis / Model
Given a training set of N example input-output
(feature-label) pairs
(x1, y1), (x2, y2), ..., (xN, yN)
where each pair was generated by
y = f(x)
Ideally, we would like our model h(x) (hypothesis)
that approximates the true function f(x) to be:
h(x) = y = f(x) (consistent hypothesis)
6
Feedforward Neural Network
features weights weights weights output

Input Hidden Hidden Output


layer layer layer layer

Also called (historically): multi-layer perceptron

7
ANN as a Complex Function
In ANNs hypotheses take form of complex algebraic circuits with
tunable connection strengths (weights).
weights weights weights

Input Hidden Hidden Output


layer layer layer layer

8
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
 Run forward computation to find estimate y ̂
 Run backward computation to update weights:
 For every output node
 Compute loss L between true y and the estimated y ̂
 For every weight w from hidden layer to the output layer
 Update the weight

 For every hidden node


 Assess how much blame it deserves for the current answer
 For every weight w from input layer to the hidden layer
 Update the weight

9
Back-propagation
Feed forward Evaluate Loss Back-propagation

w1 w1 w1

x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2

Feed a labeled sample How “incorrect” is the Update weights


through the network result compare to the (use Gradient Descent)
label?
10
Gradients and Learning Rate
 The value of the gradient (slope in our
example) weighted by a
learning rate η

 Higher learning rate means move w faster

11
NNs: Derivative of the Loss
features weights weights weights output

Input Hidden Hidden Output


layer layer layer layer

12
Neural Networks in NLP
Let’s consider the NLP modeling we explored
so far:
 Classification
 Language Modeling
Can we apply Neural Networks?

13
Logistic Regression Sentiment Analysis
features weights output

BINARY

answer

Input Output
layer layer

14
Logistic Regression Sentiment Analysis
features weights weights output

BINARY

answer

Input Hidden Output


layer layer layer

15
Complex Feature Vector Relationships

Adding hidden layers can help capture non-linear relationships between features!

16
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

17
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html

18
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

Input Hidden Output


layer layer layer

Multiclass output: add more output layer nodes + use softmax (instead of sigmoid)

19
Embeddings as Input Features

20
Embeddings as Input Features

Assumption:
“3-word sentences”

21
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

BINARY

answer

Input Hidden Output


layer layer layer

22
Texts in Different Sizes: Ideas
Some simple solutions:
1. Make the input the length of the longest sample
 if shorter then pad with zero embeddings
 truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
 take the mean of all the word embeddings
 take the element-wise max of all the word
embeddings
 for each dimension, pick the max value from all
words
23
Language Models Revisited
Language Modeling: Calculating the probability of the next
word in a sequence given some history.
• N-gram based language models
• other: neural network-based?

Task: predict next word wt


given prior words wt-1, wt-2, wt-3, …
Problem: Now we’re dealing with sequences of arbitrary
length.
Solution: Sliding windows (of fixed length)

24
Neural Language Model

25
Neural LM Better Than N-Gram LM
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed

Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog" embeddings
to generalize and predict “fed” after dog

26
Training Neural Networks

27
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
 Run forward computation to find estimate y ̂
 Run backward computation to update weights:
 For every output node
 Compute loss L between true y and the estimated y ̂
 For every weight w from hidden layer to the output layer
 Update the weight

 For every hidden node


 Assess how much blame it deserves for the current answer
 For every weight w from input layer to the hidden layer
 Update the weight

28
Back-propagation
Feed forward Evaluate Loss Back-propagation

w1 w1 w1

x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2

Feed a labeled sample How “incorrect” is the Update weights


through the network result compare to the (use Gradient Descent)
label?
29
Gradients and Learning Rate
 The value of the gradient (slope in our
example) weighted by a
learning rate η

 Higher learning rate means move w faster

30
NNs: Derivative of the Loss
features weights weights weights output

Input Hidden Hidden Output


layer layer layer layer

31
Convolutional Neural Networks
The name Convolutional Neural Network (CNN) indicates that the
network employs a mathematical operation called convolution.

Convolutional networks are a specialized type of neural networks


that use convolution in place of general matrix multiplication in at
least one of their layers.

CNN is able to successfully capture the spatial dependencies in an


image (data grid) through the application of relevant filters.

CNNs can reduce images (data grids) into a form which is easier to
process without losing features that are critical for getting a good
prediction.

32
Convolutional Neural Networks
Flattening

Pooling

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

33
Convolution: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolutional_Neural_Network_NeuralNetworkFilter.gif

34
Kernel / Filter: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Padding_strides.gif

35
Convoluting Matrices
Convolution (and Convolutional Neural Networks) can be applied
to any grid-like data (tensors: matrices, vectors, etc.).

kernel data
0 1 0 0 2 3 0*0 1*2 0*3

1 1 1 conv 2 4 1 “overla
y” 1*2 1*4 1*1 sum 12
0 1 0 0 3 0 0*0 1*3 0*0

36
Selected Image Processing Kernels

Sharpen Mean Blur Gaussian Blur

Laplacian Prewitt (Edge) Prewitt (Edge)

37
Image Processing: Kernels / Filters

38
Applying Kernels / Filters

3 x 3 Kernel / Filter

39
Convolutional NN Kernels
In practice, Convolutional Neural Network kernels can be larger than
3x3 and are learned using back propagation.

Convolution Layer 1 Convolution Layer 2 Convolution Layer 3

40
Convolution Layer 1

Kernel 1

41
Convolution Layer 1

Kernel 2
Kernel 1

42
Convolution Layer 1

Original image
Kernel 3

Kernel 2
Kernel 1

Convolution 1

43
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

44
Max Pooling Layer
Convolution 1

Max Pooling

45
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

46
Convolution Layer 2

Original convolution
after pooling Kernel C

Kernel B
Kernel A

Convolution A

47
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

48
Flattening
Final output of convolution layers is “flattened” to become a vector of features.

Convert to
vector

Final convolution layer output

Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/

49
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) allow cycles in the computational graph
(network). A network node (unit) can take its own output from an earlier step as
input (with delay introduced).
Enables having internal state / memory  inputs received earlier affect the RNN
response to current input.

50
Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is an artificial neural network. Unlike standard
feedforward neural networks, LSTM has feedback connections. Such a recurrent
neural network (RNN) can process not only single data points (such as images), but
also entire sequences of data (such as speech or video). This characteristic makes
LSTM networks ideal for processing and predicting data.

51
Large Language Model (LLM)
A large language model (LLM) is a language model
consisting of a neural network with many
parameters (typically billions of weights or more),
trained on large quantities of unlabeled text using
self-supervised learning.

Source: Wikipedia

52
Generative Pre-trained Transformer 3
What is it?
Generative Pre-trained Transformer 3 (GPT-3) is an
autoregressive language model that uses deep learning
to produce human-like text. It is the third-generation
language prediction model in the GPT-n series (and the
successor to GPT-2) created by OpenAI, a San Francisco-
based artificial intelligence research laboratory.

Size:
175 billion machine learning parameters
~45 GB
Source: Wikipedia

53
Parameters? What Are Those?
features weights weights output

j
i

Input Hidden Output


layer layer layer

54
Transformer Architecture

55
GPT-4 Architecture

Source: TheAiEdge.io

56
Self-Attention
In artificial neural networks, attention is a technique that is meant to mimic
cognitive attention. The effect enhances some parts of the input data while
diminishing other parts — the motivation being that the network should devote
more focus to the important parts of the data, even though they may be small.
Learning which part of the data is more important than another depends on the
context, and this is trained by gradient descent.

Source: Park et al. – “SANVis: Visual Analytics for Understanding Self-Attention Networks”

57
Generative Pre-trained Transformer 4
What is it?
Generative Pre-trained Transformer 4 (GPT-4) is a
multimodal large language model created by OpenAI. As a
transformer, GPT-4 was pretrained to predict the next
token (using both public data and "data licensed from
third-party providers"), and was then fine-tuned with
reinforcement learning from human and AI feedback for
human alignment and policy compliance.
Size:
1 trillion machine learning parameters

Source: Wikipedia

58
Large Language Models Data Sources

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

59
LLM Data Pre-Processing Pipeline

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

60
ChatGPT
What is it?
ChatGPT is a chatbot developed by OpenAI and released in
November 2022. It is built on top of OpenAI's GPT-3.5 and
GPT-4 families of large language models (LLMs) and has
been fine-tuned (an approach to transfer learning) using
both supervised and reinforcement learning techniques.

Source: Wikipedia

61
Transfer Learning
In transfer learning, experience with one
learning task helps an agent learn better on
another task.

Pre-trained models can be used as a starting


point for developing new models.

62
Encoding Word Relationships:
Vector Representations
Word Embeddings

63
Challenge
 We know word relationships exist
 How can we quantify them in a automated
fashion?
 How do we represent them in numerical
way?
 How can we use them in computational
models and processes?

64
Vector Semantics: Two Ideas
 Idea 1:
 Let's define the meaning of a word by its
distribution in language use (neighboring
words or grammatical environments)

 Idea 2:
 Let's define the meaning of a word as a
point in space
65
Bag of Words: Strings Representation
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...

Bag of words assumption: word/token position does not matter.


66
Bag of Words: Meaning Ignored!
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...

Bag of words assumption: word/token position does not matter.


67
Connotation as a Point in Space
 Words seem to vary along three affective DIMENSIONS:
 valence: the pleasantness of the stimulus
 arousal: the intensity of emotion provoked by the stimulus
 dominance: the degree of control exerted by the stimulus

Word Score Word Score


love 1.000 toxic 0.008
valence
happy 1.000 nightmare 0.005
elated 0.960 mellow 0.069
arousal
frenzy 0.965 napping 0.046
powerful 0.991 weak 0.045
dominance
leadership 0.983 empty 0.081
Source: NRC VAD Lexicon (https://saifmohammad.com/WebPages/nrc-vad.html)

68
Vector Semantics
 The idea:
 represent a word as a point in a
multidimensional semantic space that is
derived from the distributions of word
neighbors

69
Point in Space Based on Distribution
 Each word = a vector
 not just "good" or "word45"
 Similar words: “nearby in semantic space"
 We build this space automatically by seeing
which words are nearby in text

70
Vector Semantics: Words as Vectors

Source: Signorelli, Camilo & Arsiwalla, Xerxes. (2019). Moral Dilemmas for Artificial Intelligence: a position paper on an application of
Compositional Quantum Cognition

71
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

72
Word Embedding
 Embedding:
 “embedded into a space”
 mapping from one space or structure to
another
 The standard way to represent meaning in
NLP
 Fine-grained model of meaning for
similarity

73
The Why: Sentiment Analysis
 Using words only:
 a feature is a word identity
 for example

 feature

 requires exact same word to be in training and


test

74
The Why: Sentiment Analysis
 Using embeddings:
 a feature is a word vector
 the previous word was vector [35, 22, 17]
 now in the test set we might see a similar
vector [34, 21, 14]
 we can generalize to similar but unseen words

75
Term-Document Matrix
 Each document is represented by a vector
of words

76
Term-Document Matrix
 Vectors are similar for the two comedies
 “As you like it” and “Twelfth Night”

 But comedies are different than the other


two
 more fools and wit and fewer battles
77
Term-Document Matrix
 Vectors are similar for the two comedies
 “As you like it” and “Twelfth Night”

 But comedies are different than the other


two
 more fools and wit and fewer battles
78
Document Vector Visualization

79
Words as Vectors
 battle is "the kind of word that occurs in
Julius Caesar and Henry V"

 fool is "the kind of word that occurs in


comedies, especially Twelfth Night"

80
Word-Word (Term-Context) Matrix
 Two words are similar in meaning if their
context vectors are similar

81
Document Vector Visualization

82
Document Vector Visualization

Note vector
length and
direction
83
Vector Dot / Scalar Product
Given two vectors a and b (N - vector space dimension):
and
their vector dot/scalar product is:

Using matrix representation:

84
Vector Dot / Scalar Product
 Vector dot/scalar product is a scalar:

 Vector dot/scalar:
 high values when the two vectors have large
values in the same dimensions
 useful similarity measure

85
Vectors and Dot / Scalar Product

86
Vector Dot / Scalar Product: Problem
 Dot product favors long vectors: higher if a vector is
longer (has higher values in many dimension)
 Vector length:

 Frequent words (of, the, you) have long vectors (since


they occur many times with other words).
 dot product overly favors frequent words
87
Alternative: Cosine Similarity
Euclidean distance Cosine similarity

88
Word Similarity | Cosine Similarity

Where: v and w are two different word vectors


89
Word Similarity | Cosine Similarity
 -1: vectors point in opposite directions
 +1: vectors point in same directions
 0: vectors are orthogonal

But since raw frequency values are non-negative, the


cosine for term-term matrix vectors ranges from 0–1
90
Word Similarity
 Two words are similar in meaning if their
context vectors are similar

91
Word Similarity Visualization

92
Word Similarity | Cosine Similarity
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325

Low similarity

High similarity

93
Cosine Similarity Visualization

94
Words as Vectors: Issues
 We saw how to build vectors to represent
words:
 one-hot encoding:
 binary, count, tf*idf
 Some problems
 Large dimensionality of word vectors
 Lack of meaningful relationships between
words

95
Vector Embeddings: Methods
 tf-idf
 popular in Information Retrieval
 sparse vectors
 word represented by (a simple function of) the
counts of nearby words

 Word2vec
 dense vectors
 representation is created by training a classifier to
predict whether a word is likely to appear nearby

96
Sparse vs. Dense Vectors
 Sparse vectors have a lot of values set to
zero.
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2]
 Dense vector: most of the values are non-
zero.
 better use of storage
 carries more information
[3, 1, 5, 0, 1, 4, 9, 8, 7, 1, 1, 2, 2, 2, 2]
97
Sparse vs. Dense Vectors
 tf-idf vectors are typically:
 long (length 20,000 to 50,000)
 sparse (most elements are zero)

 What if we could learn vectors that are


 short (length 50-1000)
 dense (most elements are non-zero)

98
Short / Dense Vectors: Benefits
 Why short/dense vectors?
 short vectors may be easier to use as features in
machine learning (fewer weights to tune)
 dense vectors may generalize better than explicit
counts
 dense vectors may do better at capturing synonymy:
 car and automobile are synonyms; but are distinct
dimensions
 a word with car as a neighbor and a word with
automobile as a neighbor should be similar, but aren't
 In practice, they work better

99
Short/Dense Vectors: Methods
 “Neural Language Model”-inspired models
 Word2vec, GloVe

 Singular Value Decomposition (SVD)


 A special case of this is called LSA – Latent Semantic
Analysis
 Alternative to these "static embeddings":
 Contextual Embeddings (ELMo, BERT)
 Compute distinct embeddings for a word in its
context
 Separate embeddings for each token of a word
100
Word2Vec

101
Language Models: Application

we want to find predict the “rest” of the query

102
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:

𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏

where:
- ith word / token

In MLE, the resulting parameter set maximizes the likelihood of the


training set T given the model M (i.e., P(T | M)).

103
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word

where:
- ith word / token

In MLE, the resulting parameter set maximizes the likelihood of the


training set T given the model M (i.e., P(T | M)).

104
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word

where:
Looks at PAST words to
- ith word / token predict the NEXT word!
(highest P() NEXT word)

In MLE, the resulting parameter set maximizes the likelihood of the


training set T given the model M (i.e., P(T | M)).

105
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____

wordt-N

wordt-N+1

Model wordt

wordt-2

wordt-1

Input Output

106
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____

wordt-N

wordt-N+1

Model wordt

wordt-2

wordt-1

Features Prediction

107
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt _____

anchor
P(S) = 0.0001

thou ....

Model not
P(S) = 0.8

shalt
...

zebra
P(S) = 0.0002

Input Output

108
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not _____

anchor
P(S) = 0.0001

shalt ....

Model bear
P(S) = 0.45

not
...

zebra
P(S) = 0.0002

Input Output

109
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear _____

anchor
P(S) = 0.0001

not ....

Model false
P(S) = 0.4

bear
...

zebra
P(S) = 0.0002

Input Output

110
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear false _____

anchor
P(S) = 0.0001

bear ....

Model witness
P(S) = 0.75

false
...

zebra
P(S) = 0.0002

Input Output

111
Trained Language Models: Prediction
Given input and a model (word embeddings):
S: thou shalt _____

anchor
P(S) = 0.0001

thou ....
Look up Project to
word
Calculate
output not
predictions P(S) = 0.8
embeddings vocabulary
shalt
...

zebra
P(S) = 0.0002

Input Output

112
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______

but not:
Tomorrow is ______ to be

where:
 context words
 a word to be predicted: target word
113
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______

but not: target word

Tomorrow is ______ to be

context words

114
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

How would you go about it?

115
Word2Vec: Idea

DON’T count - Predict!

116
Word2Vec: Idea
 Instead of counting how often each word w occurs near
"apricot"
 Train a classifier on a binary prediction task:
 Is w likely to show up near "apricot"?
 We don’t actually care about this task
 but we'll take the learned classifier weights as the word
embeddings
 Use self-supervision:
 A word c that occurs near “apricot” in the corpus acts as the
gold "correct answer" for supervised learning
 No need for human labels

117
Available Tools
 Word2vec (Mikolov et al)
https://code.google.com/archive/p/word2vec
/

 GloVe (Pennington, Socher, Manning)


http://nlp.stanford.edu/projects/glove/

118
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

119
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors


such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
120
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Two approaches possible:


 use the context words to predict the target word
 use the target word to predict context words

121
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Let’s generalize it a bit:


wordt-2 wordt-1 wordt wordt+1 wordt+2

122
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Let’s generalize it even further:


wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N

123
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

But we don’t need to look at ALL words in:


wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N

We can reduce the size of the context:


wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N
sliding window +/- 2

124
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Two approaches possible:


 use the context words to predict the target word:
Continuous Bag of Words model (CBOW)
 use the target word to predict context words:
Skip Gram model

125
CBOW Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N

wordt-2

wordt-2

sum wordt

wordt+1

wordt+2

Input Projection Output

126
Skip Gram Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N

wordt-2

wordt-2

wordt sum

wordt+1

wordt+2

Input Projection Output

127
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness

thou

shalt

not sum

bear

false

Input Projection Output

128
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
P(+|not, thou)
P(-|not, thou)

shalt
P(+|not,shalt)
P(-|not, shalt)

not sum
bear
P(+|not, bear)
P(-|not, bear)

false
P(+|not, false)
P(-|not, false)

Input Projection Output

129
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

130
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors


such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
131
Word2Vec: the Approach

132
Word2Vec: the Approach

133
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence: target word

…lemon, a [tablespoon of apricot jam, a] pinch…

context words

134
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

135
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

Goal 1: train a classifier that is given a candidate


(word, context word) pair: (apricot, jam), (apricot,
aardvark), etc.

136
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

Goal 2: assign probabilities to every (word, context


word) pair:
P(+ | w, ci) and
P(- | w, ci) = 1 - P(+ | w, ci)

137
Target and Context Embeddings
Target Context
aardvark aardvark

... ...

not not
Vocabulary size |V|

Vocabulary size |V|


... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

138
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not not
Vocabulary size |V|

... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

139
Cosine Similarity Visualization

Two vectors are similar if they have


a high dot product | cosine similarity
140
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Target &
shalt Context similar shalt c
...
when ...

thou
cw thou

... ...

zebra
is high zone
Embedding size d Embedding size d

141
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

142
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...
NOT a
thou thou
PROBABILITY
... though! ...

zebra zone
Embedding size d Embedding size d

143
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...
Use sigmoid
thou thou
function!
... ...

zebra zone
Embedding size d Embedding size d

144
Similarity  Probability

P(+ | w, c) = (c  w) =

P(- | w, c) = 1 - P(+ | w, c) =
= (-c  w) =

145
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

OK, but we have lots of possible context words!

146
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

Assuming word independence, calculate:


P(+ | w, c1, c2, c3, c4) = 

147
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

In general:
P(+ | w, c1:L) = 

148
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

In general [with sums instead of products]:


log P(+ | w, c1:L) = 

149
Skip Gram Classifier: Summary
A probabilistic classifier, given
 a test target word w
 its context window of L words c1:L
Estimates probability that w occurs in this window
based on similarity of w (embeddings) to c1:L
(embeddings).

To compute this, we just need embeddings for all


the words.
150
Parameters:Target (W) and Context (C)

151
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

152
Word2Vec: Training
Assume a +/- 2 (L = 4) word window, given training
sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…

Positive (+) examples:


(apricot, tablespoon),(apricot, of),(apricot, jam),(apricot, a)
Negative (-) K (typically double (+)) examples:
(apricot, aardvark),(apricot, my),(apricot, where),(apricot, coaxial)
(apricot, seven),(apricot, forever),(apricot, dear),(apricot, if)

153
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors


such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
154
Loss Function
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual
context words (+), and minimize the similarity of the target
with the k negative sampled non-neighbor words (-).

155
Classifier: Learning Process
 How to learn?
 use stochastic gradient descent

 Adjust the word weights to:


 make the positive pairs more likely
 and the negative pairs less likely,
 ... for the entire training set.

156
Gradient Descent: Single Step

157
Loss Function Derivatives

158
Gradient Descent: Updates
Start with randomly initialized W and C matrices
Target (W) Context (C)
aardvark aardvark

... ...

not not
Vocabulary size |V|

Vocabulary size |V|


... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

159
Gradient Descent: Updates
... then incrementally do updates using.

learning rate

160
Skip Gram Word2Vec: Summary
 Start with |V| random d-dimensional vectors as initial
embeddings
 Train a classifier based on embedding similarity
 Take a corpus and take pairs of words that co-occur as positive
examples
 Take pairs of words that don't co-occur as negative examples
 Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
 Throw away the classifier code and keep the embeddings.

161
Sliding Window Size
 Small windows (+/- 2) : nearest words are
syntactically similar words in same taxonomy
 Hogwarts nearest neighbors are other fictional
schools
 Sunnydale, Evernight, Blandings

 Large windows (+/- 5) : nearest words are


related words in same semantic field
 Hogwarts nearest neighbors are Harry Potter world:
 Dumbledore, half-blood, Malfoy
162

You might also like