0% found this document useful (0 votes)

19 views162 pages

CS585 Lecture October15th

Uploaded by

fyi3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views162 pages

CS585 Lecture October15th

Uploaded by

fyi3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 162

CS 585

Natural Language Processing

October 15, 2024

Announcements / Reminders
 Please follow the Week 08 To Do List instructions (if you haven't
already)

 Programming Assignment #02 due on Sunday (10/13/24) Sunday

(10/20/24) at 11:59 PM CST

2
Plan for Today
 Introduction to Neural Networks

3
Introduction to Neural Networks

4
Main Machine Learning Categories
Supervised learning Unsupervised learning Reinforcement learning

Supervised learning is one Unsupervised learning Reinforcement learning is

of the most common involves finding underlying inspired by behavioral
techniques in machine patterns within data. psychology. It is based on a
learning. It is based on Typically used in clustering rewarding / punishing an
known relationship(s) and data points (similar algorithm.
patterns within data (for customers, etc.)
example: relationship Rewards and punishments
between inputs and are based on algorithm’s
outputs). action within its
environment.
Frequently used types:
regression, and
classification.

5
Choosing Hypothesis / Model
Given a training set of N example input-output
(feature-label) pairs
(x1, y1), (x2, y2), ..., (xN, yN)
where each pair was generated by
y = f(x)
Ideally, we would like our model h(x) (hypothesis)
that approximates the true function f(x) to be:
h(x) = y = f(x) (consistent hypothesis)
6
Feedforward Neural Network
features weights weights weights output

Input Hidden Hidden Output

layer layer layer layer

Also called (historically): multi-layer perceptron

7
ANN as a Complex Function
In ANNs hypotheses take form of complex algebraic circuits with
tunable connection strengths (weights).
weights weights weights

Input Hidden Hidden Output

layer layer layer layer

8
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
 Run forward computation to find estimate y ̂
 Run backward computation to update weights:
 For every output node
 Compute loss L between true y and the estimated y ̂
 For every weight w from hidden layer to the output layer
 Update the weight

 For every hidden node

 Assess how much blame it deserves for the current answer
 For every weight w from input layer to the hidden layer
 Update the weight

9
Back-propagation
Feed forward Evaluate Loss Back-propagation

w1 w1 w1

x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2

Feed a labeled sample How “incorrect” is the Update weights

through the network result compare to the (use Gradient Descent)
label?
10
Gradients and Learning Rate
 The value of the gradient (slope in our
example) weighted by a
learning rate η

 Higher learning rate means move w faster

11
NNs: Derivative of the Loss
features weights weights weights output

Input Hidden Hidden Output

layer layer layer layer

12
Neural Networks in NLP
Let’s consider the NLP modeling we explored
so far:
 Classification
 Language Modeling
Can we apply Neural Networks?

13
Logistic Regression Sentiment Analysis
features weights output

BINARY

answer

Input Output
layer layer

14
Logistic Regression Sentiment Analysis
features weights weights output

BINARY

answer

Input Hidden Output

layer layer layer

15
Complex Feature Vector Relationships

Adding hidden layers can help capture non-linear relationships between features!

16
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

17
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html

18
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

Input Hidden Output

layer layer layer

Multiclass output: add more output layer nodes + use softmax (instead of sigmoid)

19
Embeddings as Input Features

20
Embeddings as Input Features

Assumption:
“3-word sentences”

21
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings

BINARY

answer

Input Hidden Output

layer layer layer

22
Texts in Different Sizes: Ideas
Some simple solutions:
1. Make the input the length of the longest sample
 if shorter then pad with zero embeddings
 truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
 take the mean of all the word embeddings
 take the element-wise max of all the word
embeddings
 for each dimension, pick the max value from all
words
23
Language Models Revisited
Language Modeling: Calculating the probability of the next
word in a sequence given some history.
• N-gram based language models
• other: neural network-based?

Task: predict next word wt

given prior words wt-1, wt-2, wt-3, …
Problem: Now we’re dealing with sequences of arbitrary
length.
Solution: Sliding windows (of fixed length)

24
Neural Language Model

25
Neural LM Better Than N-Gram LM
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed

Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog" embeddings
to generalize and predict “fed” after dog

26
Training Neural Networks

27
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
 Run forward computation to find estimate y ̂
 Run backward computation to update weights:
 For every output node
 Compute loss L between true y and the estimated y ̂
 For every weight w from hidden layer to the output layer
 Update the weight

 For every hidden node

 Assess how much blame it deserves for the current answer
 For every weight w from input layer to the hidden layer
 Update the weight

28
Back-propagation
Feed forward Evaluate Loss Back-propagation

w1 w1 w1

x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2

Feed a labeled sample How “incorrect” is the Update weights

through the network result compare to the (use Gradient Descent)
label?
29
Gradients and Learning Rate
 The value of the gradient (slope in our
example) weighted by a
learning rate η

 Higher learning rate means move w faster

30
NNs: Derivative of the Loss
features weights weights weights output

Input Hidden Hidden Output

layer layer layer layer

31
Convolutional Neural Networks
The name Convolutional Neural Network (CNN) indicates that the
network employs a mathematical operation called convolution.

Convolutional networks are a specialized type of neural networks

that use convolution in place of general matrix multiplication in at
least one of their layers.

CNN is able to successfully capture the spatial dependencies in an

image (data grid) through the application of relevant filters.

CNNs can reduce images (data grids) into a form which is easier to
process without losing features that are critical for getting a good
prediction.

32
Convolutional Neural Networks
Flattening

Pooling

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

33
Convolution: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolutional_Neural_Network_NeuralNetworkFilter.gif

34
Kernel / Filter: The Idea

3 x 3 Kernel / Filter

Source: https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Padding_strides.gif

35
Convoluting Matrices
Convolution (and Convolutional Neural Networks) can be applied
to any grid-like data (tensors: matrices, vectors, etc.).

kernel data
0 1 0 0 2 3 0*0 1*2 0*3

1 1 1 conv 2 4 1 “overla
y” 1*2 1*4 1*1 sum 12
0 1 0 0 3 0 0*0 1*3 0*0

36
Selected Image Processing Kernels

Sharpen Mean Blur Gaussian Blur

Laplacian Prewitt (Edge) Prewitt (Edge)

37
Image Processing: Kernels / Filters

38
Applying Kernels / Filters

3 x 3 Kernel / Filter

39
Convolutional NN Kernels
In practice, Convolutional Neural Network kernels can be larger than
3x3 and are learned using back propagation.

Convolution Layer 1 Convolution Layer 2 Convolution Layer 3

40
Convolution Layer 1

Kernel 1

41
Convolution Layer 1

Kernel 2
Kernel 1

42
Convolution Layer 1

Original image
Kernel 3

Kernel 2
Kernel 1

Convolution 1

43
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

44
Max Pooling Layer
Convolution 1

Max Pooling

45
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

46
Convolution Layer 2

Original convolution
after pooling Kernel C

Kernel B
Kernel A

Convolution A

47
Convolutional Neural Networks

By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374

48
Flattening
Final output of convolution layers is “flattened” to become a vector of features.

Convert to
vector

Final convolution layer output

Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/

49
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) allow cycles in the computational graph
(network). A network node (unit) can take its own output from an earlier step as
input (with delay introduced).
Enables having internal state / memory  inputs received earlier affect the RNN
response to current input.

50
Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is an artificial neural network. Unlike standard
feedforward neural networks, LSTM has feedback connections. Such a recurrent
neural network (RNN) can process not only single data points (such as images), but
also entire sequences of data (such as speech or video). This characteristic makes
LSTM networks ideal for processing and predicting data.

51
Large Language Model (LLM)
A large language model (LLM) is a language model
consisting of a neural network with many
parameters (typically billions of weights or more),
trained on large quantities of unlabeled text using
self-supervised learning.

Source: Wikipedia

52
Generative Pre-trained Transformer 3
What is it?
Generative Pre-trained Transformer 3 (GPT-3) is an
autoregressive language model that uses deep learning
to produce human-like text. It is the third-generation
language prediction model in the GPT-n series (and the
successor to GPT-2) created by OpenAI, a San Francisco-
based artificial intelligence research laboratory.

Size:
175 billion machine learning parameters
~45 GB
Source: Wikipedia

53
Parameters? What Are Those?
features weights weights output

j
i

Input Hidden Output

layer layer layer

54
Transformer Architecture

55
GPT-4 Architecture

Source: TheAiEdge.io

56
Self-Attention
In artificial neural networks, attention is a technique that is meant to mimic
cognitive attention. The effect enhances some parts of the input data while
diminishing other parts — the motivation being that the network should devote
more focus to the important parts of the data, even though they may be small.
Learning which part of the data is more important than another depends on the
context, and this is trained by gradient descent.

Source: Park et al. – “SANVis: Visual Analytics for Understanding Self-Attention Networks”

57
Generative Pre-trained Transformer 4
What is it?
Generative Pre-trained Transformer 4 (GPT-4) is a
multimodal large language model created by OpenAI. As a
transformer, GPT-4 was pretrained to predict the next
token (using both public data and "data licensed from
third-party providers"), and was then fine-tuned with
reinforcement learning from human and AI feedback for
human alignment and policy compliance.
Size:
1 trillion machine learning parameters

Source: Wikipedia

58
Large Language Models Data Sources

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

59
LLM Data Pre-Processing Pipeline

Source: Zhao et al. – “A Survey of Large Language Models” [2023]

60
ChatGPT
What is it?
ChatGPT is a chatbot developed by OpenAI and released in
November 2022. It is built on top of OpenAI's GPT-3.5 and
GPT-4 families of large language models (LLMs) and has
been fine-tuned (an approach to transfer learning) using
both supervised and reinforcement learning techniques.

Source: Wikipedia

61
Transfer Learning
In transfer learning, experience with one
learning task helps an agent learn better on
another task.

Pre-trained models can be used as a starting

point for developing new models.

62
Encoding Word Relationships:
Vector Representations
Word Embeddings

63
Challenge
 We know word relationships exist
 How can we quantify them in a automated
fashion?
 How do we represent them in numerical
way?
 How can we use them in computational
models and processes?

64
Vector Semantics: Two Ideas
 Idea 1:
 Let's define the meaning of a word by its
distribution in language use (neighboring
words or grammatical environments)

 Idea 2:
 Let's define the meaning of a word as a
point in space
65
Bag of Words: Strings Representation
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...

Bag of words assumption: word/token position does not matter.

66
Bag of Words: Meaning Ignored!
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...

Bag of words assumption: word/token position does not matter.

67
Connotation as a Point in Space
 Words seem to vary along three affective DIMENSIONS:
 valence: the pleasantness of the stimulus
 arousal: the intensity of emotion provoked by the stimulus
 dominance: the degree of control exerted by the stimulus

Word Score Word Score

love 1.000 toxic 0.008
valence
happy 1.000 nightmare 0.005
elated 0.960 mellow 0.069
arousal
frenzy 0.965 napping 0.046
powerful 0.991 weak 0.045
dominance
leadership 0.983 empty 0.081
Source: NRC VAD Lexicon (https://saifmohammad.com/WebPages/nrc-vad.html)

68
Vector Semantics
 The idea:
 represent a word as a point in a
multidimensional semantic space that is
derived from the distributions of word
neighbors

69
Point in Space Based on Distribution
 Each word = a vector
 not just "good" or "word45"
 Similar words: “nearby in semantic space"
 We build this space automatically by seeing
which words are nearby in text

70
Vector Semantics: Words as Vectors

Source: Signorelli, Camilo & Arsiwalla, Xerxes. (2019). Moral Dilemmas for Artificial Intelligence: a position paper on an application of
Compositional Quantum Cognition

71
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

72
Word Embedding
 Embedding:
 “embedded into a space”
 mapping from one space or structure to
another
 The standard way to represent meaning in
NLP
 Fine-grained model of meaning for
similarity

73
The Why: Sentiment Analysis
 Using words only:
 a feature is a word identity
 for example

 feature

 requires exact same word to be in training and

test

74
The Why: Sentiment Analysis
 Using embeddings:
 a feature is a word vector
 the previous word was vector [35, 22, 17]
 now in the test set we might see a similar
vector [34, 21, 14]
 we can generalize to similar but unseen words

75
Term-Document Matrix
 Each document is represented by a vector
of words

76
Term-Document Matrix
 Vectors are similar for the two comedies
 “As you like it” and “Twelfth Night”

 But comedies are different than the other

two
 more fools and wit and fewer battles
77
Term-Document Matrix
 Vectors are similar for the two comedies
 “As you like it” and “Twelfth Night”

 But comedies are different than the other

two
 more fools and wit and fewer battles
78
Document Vector Visualization

79
Words as Vectors
 battle is "the kind of word that occurs in
Julius Caesar and Henry V"

 fool is "the kind of word that occurs in

comedies, especially Twelfth Night"

80
Word-Word (Term-Context) Matrix
 Two words are similar in meaning if their
context vectors are similar

81
Document Vector Visualization

82
Document Vector Visualization

Note vector
length and
direction
83
Vector Dot / Scalar Product
Given two vectors a and b (N - vector space dimension):
and
their vector dot/scalar product is:

Using matrix representation:

84
Vector Dot / Scalar Product
 Vector dot/scalar product is a scalar:

 Vector dot/scalar:
 high values when the two vectors have large
values in the same dimensions
 useful similarity measure

85
Vectors and Dot / Scalar Product

86
Vector Dot / Scalar Product: Problem
 Dot product favors long vectors: higher if a vector is
longer (has higher values in many dimension)
 Vector length:

 Frequent words (of, the, you) have long vectors (since

they occur many times with other words).
 dot product overly favors frequent words
87
Alternative: Cosine Similarity
Euclidean distance Cosine similarity

88
Word Similarity | Cosine Similarity

Where: v and w are two different word vectors

89
Word Similarity | Cosine Similarity
 -1: vectors point in opposite directions
 +1: vectors point in same directions
 0: vectors are orthogonal

But since raw frequency values are non-negative, the

cosine for term-term matrix vectors ranges from 0–1
90
Word Similarity
 Two words are similar in meaning if their
context vectors are similar

91
Word Similarity Visualization

92
Word Similarity | Cosine Similarity
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325

Low similarity

High similarity

93
Cosine Similarity Visualization

94
Words as Vectors: Issues
 We saw how to build vectors to represent
words:
 one-hot encoding:
 binary, count, tf*idf
 Some problems
 Large dimensionality of word vectors
 Lack of meaningful relationships between
words

95
Vector Embeddings: Methods
 tf-idf
 popular in Information Retrieval
 sparse vectors
 word represented by (a simple function of) the
counts of nearby words

 Word2vec
 dense vectors
 representation is created by training a classifier to
predict whether a word is likely to appear nearby

96
Sparse vs. Dense Vectors
 Sparse vectors have a lot of values set to
zero.
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2]
 Dense vector: most of the values are non-
zero.
 better use of storage
 carries more information
[3, 1, 5, 0, 1, 4, 9, 8, 7, 1, 1, 2, 2, 2, 2]
97
Sparse vs. Dense Vectors
 tf-idf vectors are typically:
 long (length 20,000 to 50,000)
 sparse (most elements are zero)

 What if we could learn vectors that are

 short (length 50-1000)
 dense (most elements are non-zero)

98
Short / Dense Vectors: Benefits
 Why short/dense vectors?
 short vectors may be easier to use as features in
machine learning (fewer weights to tune)
 dense vectors may generalize better than explicit
counts
 dense vectors may do better at capturing synonymy:
 car and automobile are synonyms; but are distinct
dimensions
 a word with car as a neighbor and a word with
automobile as a neighbor should be similar, but aren't
 In practice, they work better

99
Short/Dense Vectors: Methods
 “Neural Language Model”-inspired models
 Word2vec, GloVe

 Singular Value Decomposition (SVD)

 A special case of this is called LSA – Latent Semantic
Analysis
 Alternative to these "static embeddings":
 Contextual Embeddings (ELMo, BERT)
 Compute distinct embeddings for a word in its
context
 Separate embeddings for each token of a word
100
Word2Vec

101
Language Models: Application

we want to find predict the “rest” of the query

102
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:

𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏

where:
- ith word / token

In MLE, the resulting parameter set maximizes the likelihood of the

training set T given the model M (i.e., P(T | M)).

103
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word

where:
- ith word / token

In MLE, the resulting parameter set maximizes the likelihood of the

training set T given the model M (i.e., P(T | M)).

104
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word

where:
Looks at PAST words to
- ith word / token predict the NEXT word!
(highest P() NEXT word)

In MLE, the resulting parameter set maximizes the likelihood of the

training set T given the model M (i.e., P(T | M)).

105
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____

wordt-N

wordt-N+1

Model wordt

wordt-2

wordt-1

Input Output

106
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____

wordt-N

wordt-N+1

Model wordt

wordt-2

wordt-1

Features Prediction

107
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt _____

anchor
P(S) = 0.0001

thou ....

Model not
P(S) = 0.8

shalt
...

zebra
P(S) = 0.0002

Input Output

108
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not _____

anchor
P(S) = 0.0001

shalt ....

Model bear
P(S) = 0.45

not
...

zebra
P(S) = 0.0002

Input Output

109
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear _____

anchor
P(S) = 0.0001

not ....

Model false
P(S) = 0.4

bear
...

zebra
P(S) = 0.0002

Input Output

110
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear false _____

anchor
P(S) = 0.0001

bear ....

Model witness
P(S) = 0.75

false
...

zebra
P(S) = 0.0002

Input Output

111
Trained Language Models: Prediction
Given input and a model (word embeddings):
S: thou shalt _____

anchor
P(S) = 0.0001

thou ....
Look up Project to
word
Calculate
output not
predictions P(S) = 0.8
embeddings vocabulary
shalt
...

zebra
P(S) = 0.0002

Input Output

112
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______

but not:
Tomorrow is ______ to be

where:
 context words
 a word to be predicted: target word
113
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______

but not: target word

Tomorrow is ______ to be

context words

114
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

How would you go about it?

115
Word2Vec: Idea

DON’T count - Predict!

116
Word2Vec: Idea
 Instead of counting how often each word w occurs near
"apricot"
 Train a classifier on a binary prediction task:
 Is w likely to show up near "apricot"?
 We don’t actually care about this task
 but we'll take the learned classifier weights as the word
embeddings
 Use self-supervision:
 A word c that occurs near “apricot” in the corpus acts as the
gold "correct answer" for supervised learning
 No need for human labels

117
Available Tools
 Word2vec (Mikolov et al)
https://code.google.com/archive/p/word2vec
/

 GloVe (Pennington, Socher, Manning)

http://nlp.stanford.edu/projects/glove/

118
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

119
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors

such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
120
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Two approaches possible:

 use the context words to predict the target word
 use the target word to predict context words

121
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Let’s generalize it a bit:

wordt-2 wordt-1 wordt wordt+1 wordt+2

122
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Let’s generalize it even further:

wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N

123
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

But we don’t need to look at ALL words in:

wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N

We can reduce the size of the context:

wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N
sliding window +/- 2

124
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Two approaches possible:

 use the context words to predict the target word:
Continuous Bag of Words model (CBOW)
 use the target word to predict context words:
Skip Gram model

125
CBOW Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N

wordt-2

sum wordt

wordt+1

wordt+2

Input Projection Output

126
Skip Gram Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N

wordt-2

wordt sum

wordt+1

wordt+2

Input Projection Output

127
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness

thou

shalt

not sum

bear

false

Input Projection Output

128
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
P(+|not, thou)
P(-|not, thou)

shalt
P(+|not,shalt)
P(-|not, shalt)

not sum
bear
P(+|not, bear)
P(-|not, bear)

false
P(+|not, false)
P(-|not, false)

Input Projection Output

129
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

130
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors

132
Word2Vec: the Approach

133
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence: target word

…lemon, a [tablespoon of apricot jam, a] pinch…

context words

134
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

135
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

Goal 1: train a classifier that is given a candidate

(word, context word) pair: (apricot, jam), (apricot,
aardvark), etc.

136
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

Goal 2: assign probabilities to every (word, context

word) pair:
P(+ | w, ci) and
P(- | w, ci) = 1 - P(+ | w, ci)

137
Target and Context Embeddings
Target Context
aardvark aardvark

... ...

not not
Vocabulary size |V|

Vocabulary size |V|

... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

138
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not not
Vocabulary size |V|

... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

139
Cosine Similarity Visualization

Two vectors are similar if they have

a high dot product | cosine similarity
140
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Target &
shalt Context similar shalt c
...
when ...

thou
cw thou

... ...

zebra
is high zone
Embedding size d Embedding size d

141
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

142
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...
NOT a
thou thou
PROBABILITY
... though! ...

zebra zone
Embedding size d Embedding size d

143
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...
Use sigmoid
thou thou
function!
... ...

zebra zone
Embedding size d Embedding size d

144
Similarity  Probability

P(+ | w, c) = (c  w) =


P(- | w, c) = 1 - P(+ | w, c) =
= (-c  w) =


145
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

OK, but we have lots of possible context words!

146
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

Assuming word independence, calculate:

P(+ | w, c1, c2, c3, c4) = 

147
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

In general:
P(+ | w, c1:L) = 

148
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

In general [with sums instead of products]:

log P(+ | w, c1:L) = 

149
Skip Gram Classifier: Summary
A probabilistic classifier, given
 a test target word w
 its context window of L words c1:L
Estimates probability that w occurs in this window
based on similarity of w (embeddings) to c1:L
(embeddings).

To compute this, we just need embeddings for all

the words.
150
Parameters:Target (W) and Context (C)

151
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

152
Word2Vec: Training
Assume a +/- 2 (L = 4) word window, given training
sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…

Positive (+) examples:

(apricot, tablespoon),(apricot, of),(apricot, jam),(apricot, a)
Negative (-) K (typically double (+)) examples:
(apricot, aardvark),(apricot, my),(apricot, where),(apricot, coaxial)
(apricot, seven),(apricot, forever),(apricot, dear),(apricot, if)

153
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors

such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
154
Loss Function
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual
context words (+), and minimize the similarity of the target
with the k negative sampled non-neighbor words (-).

155
Classifier: Learning Process
 How to learn?
 use stochastic gradient descent

 Adjust the word weights to:

 make the positive pairs more likely
 and the negative pairs less likely,
 ... for the entire training set.

156
Gradient Descent: Single Step

157
Loss Function Derivatives

158
Gradient Descent: Updates
Start with randomly initialized W and C matrices
Target (W) Context (C)
aardvark aardvark

... ...

not not
Vocabulary size |V|

Vocabulary size |V|

... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

159
Gradient Descent: Updates
... then incrementally do updates using.

learning rate

160
Skip Gram Word2Vec: Summary
 Start with |V| random d-dimensional vectors as initial
embeddings
 Train a classifier based on embedding similarity
 Take a corpus and take pairs of words that co-occur as positive
examples
 Take pairs of words that don't co-occur as negative examples
 Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
 Throw away the classifier code and keep the embeddings.

161
Sliding Window Size
 Small windows (+/- 2) : nearest words are
syntactically similar words in same taxonomy
 Hogwarts nearest neighbors are other fictional
schools
 Sunnydale, Evernight, Blandings

 Large windows (+/- 5) : nearest words are

related words in same semantic field
 Hogwarts nearest neighbors are Harry Potter world:
 Dumbledore, half-blood, Malfoy
162

Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
DLL Ucsp
95% (19)
DLL Ucsp
69 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
(Q1-W4-M2NS-If-20.1) : I - Objectives
100% (1)
(Q1-W4-M2NS-If-20.1) : I - Objectives
5 pages
Angela Rosales: Lesson Plan On Classification
100% (1)
Angela Rosales: Lesson Plan On Classification
3 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
Module - 3 AAI
No ratings yet
Module - 3 AAI
119 pages
Neural Networks Two
No ratings yet
Neural Networks Two
69 pages
06 NeuralNetworks 2024
No ratings yet
06 NeuralNetworks 2024
82 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
NLP Slides2
No ratings yet
NLP Slides2
93 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
AI Lecture 16
No ratings yet
AI Lecture 16
51 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
NLP - Natural Language Processing
No ratings yet
NLP - Natural Language Processing
74 pages
7 NN Apr 28 2021
No ratings yet
7 NN Apr 28 2021
81 pages
12-13.chapter9 DeepLearningInNLP
No ratings yet
12-13.chapter9 DeepLearningInNLP
45 pages
Chap11 Neural Nets
No ratings yet
Chap11 Neural Nets
38 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Deep Learning For Natural Language GDG Bloomington 1690248059
No ratings yet
Deep Learning For Natural Language GDG Bloomington 1690248059
41 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Lecture14 - ML (FF, Autoenc, Dense Networks)
No ratings yet
Lecture14 - ML (FF, Autoenc, Dense Networks)
28 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
Chap 1 Introduction To ML
No ratings yet
Chap 1 Introduction To ML
33 pages
Chapter 6 AI
No ratings yet
Chapter 6 AI
52 pages
Lecture6 421
No ratings yet
Lecture6 421
43 pages
Lecture Slides-Week13,14
No ratings yet
Lecture Slides-Week13,14
62 pages
2009 Tutorial Nips
No ratings yet
2009 Tutorial Nips
113 pages
MLT Unit 4 and 5 Part 2
No ratings yet
MLT Unit 4 and 5 Part 2
34 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
5th Unit
No ratings yet
5th Unit
36 pages
Chapter Two
No ratings yet
Chapter Two
16 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
AIDL03 EvolutionOfAI
No ratings yet
AIDL03 EvolutionOfAI
22 pages
MIT Presntation ANN02-Supervised
No ratings yet
MIT Presntation ANN02-Supervised
64 pages
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
No ratings yet
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
32 pages
Chapter 9 - ANNs
No ratings yet
Chapter 9 - ANNs
25 pages
Week 4
No ratings yet
Week 4
13 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
AI Lab 1
No ratings yet
AI Lab 1
11 pages
Natual Language Processing
No ratings yet
Natual Language Processing
33 pages
Unit 2
No ratings yet
Unit 2
6 pages
Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
No ratings yet
Neural Networks - Slides - CMU - Aarti Singh & Barnabas Poczos
36 pages
28 Lecture CSC462
No ratings yet
28 Lecture CSC462
28 pages
Rnoti p1707
No ratings yet
Rnoti p1707
9 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
22 pages
NLP Short
No ratings yet
NLP Short
5 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Machine Learning-Gkouzionis
No ratings yet
Machine Learning-Gkouzionis
14 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
3 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
Gender and Development in Philippines
No ratings yet
Gender and Development in Philippines
3 pages
KET Speaking Sample Paper
No ratings yet
KET Speaking Sample Paper
3 pages
Today S Health Information Management 3rd Edition by Dana C McWay
No ratings yet
Today S Health Information Management 3rd Edition by Dana C McWay
322 pages
Course Outline in Film Appreciation and Practices
No ratings yet
Course Outline in Film Appreciation and Practices
5 pages
Memory Model
100% (1)
Memory Model
31 pages
Solicitation Letter
No ratings yet
Solicitation Letter
1 page
Fifth Grade Meter Lesson Plan: Music Concept
No ratings yet
Fifth Grade Meter Lesson Plan: Music Concept
3 pages
Andria Barberi Resume 2014
No ratings yet
Andria Barberi Resume 2014
6 pages
Sem-5 Syllabus
No ratings yet
Sem-5 Syllabus
38 pages
Class Program Grade One 2ND Quarter
No ratings yet
Class Program Grade One 2ND Quarter
5 pages
Project Proposal EAPP 12 ABM - V.C.L.
No ratings yet
Project Proposal EAPP 12 ABM - V.C.L.
2 pages
Lor Salazar Maria
No ratings yet
Lor Salazar Maria
1 page
Lesson-1 2
No ratings yet
Lesson-1 2
29 pages
Historical Development of The Philippine Educational System
No ratings yet
Historical Development of The Philippine Educational System
12 pages
Full Thesis
No ratings yet
Full Thesis
70 pages
Geir Stamp Problem
No ratings yet
Geir Stamp Problem
8 pages
Inspiring Creative Supervision
No ratings yet
Inspiring Creative Supervision
4 pages
Guided Backpropagation
No ratings yet
Guided Backpropagation
11 pages
Group 11
No ratings yet
Group 11
27 pages
1st Grade Small Group Reading Lesson Plan - Main Idea, Topic & Details
No ratings yet
1st Grade Small Group Reading Lesson Plan - Main Idea, Topic & Details
2 pages
2025 Chum-1 8
No ratings yet
2025 Chum-1 8
8 pages
Presentation Whale Psychology
No ratings yet
Presentation Whale Psychology
5 pages
DLL Q4 wk2 Eng9ee
No ratings yet
DLL Q4 wk2 Eng9ee
5 pages
Performance Evaluation HR 6 - 2-EPE-01 - HASAN MD MASUD
No ratings yet
Performance Evaluation HR 6 - 2-EPE-01 - HASAN MD MASUD
1 page
Classroom Observation Form: CLASS/SUBJECT OBSERVED: (Observation Should Be in The Area of Difficulty)
No ratings yet
Classroom Observation Form: CLASS/SUBJECT OBSERVED: (Observation Should Be in The Area of Difficulty)
2 pages
Teachers Recruitment Board, Chennai-6 List of Year Wise Selection of Teaching Personnel
No ratings yet
Teachers Recruitment Board, Chennai-6 List of Year Wise Selection of Teaching Personnel
1 page