CS585 Lecture October15th
CS585 Lecture October15th
2
Plan for Today
Introduction to Neural Networks
3
Introduction to Neural Networks
4
Main Machine Learning Categories
Supervised learning Unsupervised learning Reinforcement learning
5
Choosing Hypothesis / Model
Given a training set of N example input-output
(feature-label) pairs
(x1, y1), (x2, y2), ..., (xN, yN)
where each pair was generated by
y = f(x)
Ideally, we would like our model h(x) (hypothesis)
that approximates the true function f(x) to be:
h(x) = y = f(x) (consistent hypothesis)
6
Feedforward Neural Network
features weights weights weights output
7
ANN as a Complex Function
In ANNs hypotheses take form of complex algebraic circuits with
tunable connection strengths (weights).
weights weights weights
8
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
Run forward computation to find estimate y ̂
Run backward computation to update weights:
For every output node
Compute loss L between true y and the estimated y ̂
For every weight w from hidden layer to the output layer
Update the weight
9
Back-propagation
Feed forward Evaluate Loss Back-propagation
w1 w1 w1
x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2
11
NNs: Derivative of the Loss
features weights weights weights output
12
Neural Networks in NLP
Let’s consider the NLP modeling we explored
so far:
Classification
Language Modeling
Can we apply Neural Networks?
13
Logistic Regression Sentiment Analysis
features weights output
BINARY
answer
Input Output
layer layer
14
Logistic Regression Sentiment Analysis
features weights weights output
BINARY
answer
15
Complex Feature Vector Relationships
Adding hidden layers can help capture non-linear relationships between features!
16
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia
17
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html
18
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings
Multiclass output: add more output layer nodes + use softmax (instead of sigmoid)
19
Embeddings as Input Features
20
Embeddings as Input Features
Assumption:
“3-word sentences”
21
Embeddings as Input Features
features weights weights output
(features learned from data)
Embeddings
BINARY
answer
22
Texts in Different Sizes: Ideas
Some simple solutions:
1. Make the input the length of the longest sample
if shorter then pad with zero embeddings
truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
take the mean of all the word embeddings
take the element-wise max of all the word
embeddings
for each dimension, pick the max value from all
words
23
Language Models Revisited
Language Modeling: Calculating the probability of the next
word in a sequence given some history.
• N-gram based language models
• other: neural network-based?
24
Neural Language Model
25
Neural LM Better Than N-Gram LM
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed
Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog" embeddings
to generalize and predict “fed” after dog
26
Training Neural Networks
27
Training Neural Networks: Intuition
For every training tuple (x ,y) = (feature vector, label)
Run forward computation to find estimate y ̂
Run backward computation to update weights:
For every output node
Compute loss L between true y and the estimated y ̂
For every weight w from hidden layer to the output layer
Update the weight
28
Back-propagation
Feed forward Evaluate Loss Back-propagation
w1 w1 w1
x
z z 𝜕𝐿𝑜𝑠𝑠 z
f(x,y) f(x,y) 𝜕𝑥 f(x,y)
𝜕𝐿𝑜𝑠𝑠
y 𝜕𝑦
𝜕𝐿𝑜𝑠𝑠
Loss = z - 𝜕𝑧
w2 w2 zexpected w2
30
NNs: Derivative of the Loss
features weights weights weights output
31
Convolutional Neural Networks
The name Convolutional Neural Network (CNN) indicates that the
network employs a mathematical operation called convolution.
CNNs can reduce images (data grids) into a form which is easier to
process without losing features that are critical for getting a good
prediction.
32
Convolutional Neural Networks
Flattening
Pooling
33
Convolution: The Idea
3 x 3 Kernel / Filter
Source: https://commons.wikimedia.org/wiki/File:Convolutional_Neural_Network_NeuralNetworkFilter.gif
34
Kernel / Filter: The Idea
3 x 3 Kernel / Filter
Source: https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Padding_strides.gif
35
Convoluting Matrices
Convolution (and Convolutional Neural Networks) can be applied
to any grid-like data (tensors: matrices, vectors, etc.).
kernel data
0 1 0 0 2 3 0*0 1*2 0*3
1 1 1 conv 2 4 1 “overla
y” 1*2 1*4 1*1 sum 12
0 1 0 0 3 0 0*0 1*3 0*0
36
Selected Image Processing Kernels
37
Image Processing: Kernels / Filters
38
Applying Kernels / Filters
3 x 3 Kernel / Filter
39
Convolutional NN Kernels
In practice, Convolutional Neural Network kernels can be larger than
3x3 and are learned using back propagation.
40
Convolution Layer 1
Kernel 1
41
Convolution Layer 1
Kernel 2
Kernel 1
42
Convolution Layer 1
Original image
Kernel 3
Kernel 2
Kernel 1
Convolution 1
43
Convolutional Neural Networks
44
Max Pooling Layer
Convolution 1
Max Pooling
45
Convolutional Neural Networks
46
Convolution Layer 2
Original convolution
after pooling Kernel C
Kernel B
Kernel A
Convolution A
47
Convolutional Neural Networks
48
Flattening
Final output of convolution layers is “flattened” to become a vector of features.
Convert to
vector
Source: https://nikolanews.com/not-just-introduction-to-convolutional-neural-networks-part-1/
49
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) allow cycles in the computational graph
(network). A network node (unit) can take its own output from an earlier step as
input (with delay introduced).
Enables having internal state / memory inputs received earlier affect the RNN
response to current input.
50
Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is an artificial neural network. Unlike standard
feedforward neural networks, LSTM has feedback connections. Such a recurrent
neural network (RNN) can process not only single data points (such as images), but
also entire sequences of data (such as speech or video). This characteristic makes
LSTM networks ideal for processing and predicting data.
51
Large Language Model (LLM)
A large language model (LLM) is a language model
consisting of a neural network with many
parameters (typically billions of weights or more),
trained on large quantities of unlabeled text using
self-supervised learning.
Source: Wikipedia
52
Generative Pre-trained Transformer 3
What is it?
Generative Pre-trained Transformer 3 (GPT-3) is an
autoregressive language model that uses deep learning
to produce human-like text. It is the third-generation
language prediction model in the GPT-n series (and the
successor to GPT-2) created by OpenAI, a San Francisco-
based artificial intelligence research laboratory.
Size:
175 billion machine learning parameters
~45 GB
Source: Wikipedia
53
Parameters? What Are Those?
features weights weights output
j
i
54
Transformer Architecture
55
GPT-4 Architecture
Source: TheAiEdge.io
56
Self-Attention
In artificial neural networks, attention is a technique that is meant to mimic
cognitive attention. The effect enhances some parts of the input data while
diminishing other parts — the motivation being that the network should devote
more focus to the important parts of the data, even though they may be small.
Learning which part of the data is more important than another depends on the
context, and this is trained by gradient descent.
Source: Park et al. – “SANVis: Visual Analytics for Understanding Self-Attention Networks”
57
Generative Pre-trained Transformer 4
What is it?
Generative Pre-trained Transformer 4 (GPT-4) is a
multimodal large language model created by OpenAI. As a
transformer, GPT-4 was pretrained to predict the next
token (using both public data and "data licensed from
third-party providers"), and was then fine-tuned with
reinforcement learning from human and AI feedback for
human alignment and policy compliance.
Size:
1 trillion machine learning parameters
Source: Wikipedia
58
Large Language Models Data Sources
59
LLM Data Pre-Processing Pipeline
60
ChatGPT
What is it?
ChatGPT is a chatbot developed by OpenAI and released in
November 2022. It is built on top of OpenAI's GPT-3.5 and
GPT-4 families of large language models (LLMs) and has
been fine-tuned (an approach to transfer learning) using
both supervised and reinforcement learning techniques.
Source: Wikipedia
61
Transfer Learning
In transfer learning, experience with one
learning task helps an agent learn better on
another task.
62
Encoding Word Relationships:
Vector Representations
Word Embeddings
63
Challenge
We know word relationships exist
How can we quantify them in a automated
fashion?
How do we represent them in numerical
way?
How can we use them in computational
models and processes?
64
Vector Semantics: Two Ideas
Idea 1:
Let's define the meaning of a word by its
distribution in language use (neighboring
words or grammatical environments)
Idea 2:
Let's define the meaning of a word as a
point in space
65
Bag of Words: Strings Representation
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...
68
Vector Semantics
The idea:
represent a word as a point in a
multidimensional semantic space that is
derived from the distributions of word
neighbors
69
Point in Space Based on Distribution
Each word = a vector
not just "good" or "word45"
Similar words: “nearby in semantic space"
We build this space automatically by seeing
which words are nearby in text
70
Vector Semantics: Words as Vectors
Source: Signorelli, Camilo & Arsiwalla, Xerxes. (2019). Moral Dilemmas for Artificial Intelligence: a position paper on an application of
Compositional Quantum Cognition
71
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia
72
Word Embedding
Embedding:
“embedded into a space”
mapping from one space or structure to
another
The standard way to represent meaning in
NLP
Fine-grained model of meaning for
similarity
73
The Why: Sentiment Analysis
Using words only:
a feature is a word identity
for example
feature
74
The Why: Sentiment Analysis
Using embeddings:
a feature is a word vector
the previous word was vector [35, 22, 17]
now in the test set we might see a similar
vector [34, 21, 14]
we can generalize to similar but unseen words
75
Term-Document Matrix
Each document is represented by a vector
of words
76
Term-Document Matrix
Vectors are similar for the two comedies
“As you like it” and “Twelfth Night”
79
Words as Vectors
battle is "the kind of word that occurs in
Julius Caesar and Henry V"
80
Word-Word (Term-Context) Matrix
Two words are similar in meaning if their
context vectors are similar
81
Document Vector Visualization
82
Document Vector Visualization
Note vector
length and
direction
83
Vector Dot / Scalar Product
Given two vectors a and b (N - vector space dimension):
and
their vector dot/scalar product is:
84
Vector Dot / Scalar Product
Vector dot/scalar product is a scalar:
Vector dot/scalar:
high values when the two vectors have large
values in the same dimensions
useful similarity measure
85
Vectors and Dot / Scalar Product
86
Vector Dot / Scalar Product: Problem
Dot product favors long vectors: higher if a vector is
longer (has higher values in many dimension)
Vector length:
88
Word Similarity | Cosine Similarity
91
Word Similarity Visualization
92
Word Similarity | Cosine Similarity
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325
Low similarity
High similarity
93
Cosine Similarity Visualization
94
Words as Vectors: Issues
We saw how to build vectors to represent
words:
one-hot encoding:
binary, count, tf*idf
Some problems
Large dimensionality of word vectors
Lack of meaningful relationships between
words
95
Vector Embeddings: Methods
tf-idf
popular in Information Retrieval
sparse vectors
word represented by (a simple function of) the
counts of nearby words
Word2vec
dense vectors
representation is created by training a classifier to
predict whether a word is likely to appear nearby
96
Sparse vs. Dense Vectors
Sparse vectors have a lot of values set to
zero.
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2]
Dense vector: most of the values are non-
zero.
better use of storage
carries more information
[3, 1, 5, 0, 1, 4, 9, 8, 7, 1, 1, 2, 2, 2, 2]
97
Sparse vs. Dense Vectors
tf-idf vectors are typically:
long (length 20,000 to 50,000)
sparse (most elements are zero)
98
Short / Dense Vectors: Benefits
Why short/dense vectors?
short vectors may be easier to use as features in
machine learning (fewer weights to tune)
dense vectors may generalize better than explicit
counts
dense vectors may do better at capturing synonymy:
car and automobile are synonyms; but are distinct
dimensions
a word with car as a neighbor and a word with
automobile as a neighbor should be similar, but aren't
In practice, they work better
99
Short/Dense Vectors: Methods
“Neural Language Model”-inspired models
Word2vec, GloVe
101
Language Models: Application
102
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
where:
- ith word / token
103
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word
where:
- ith word / token
104
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word
where:
Looks at PAST words to
- ith word / token predict the NEXT word!
(highest P() NEXT word)
105
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____
wordt-N
wordt-N+1
Model wordt
wordt-2
wordt-1
Input Output
106
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____
wordt-N
wordt-N+1
Model wordt
wordt-2
wordt-1
Features Prediction
107
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt _____
anchor
P(S) = 0.0001
thou ....
Model not
P(S) = 0.8
shalt
...
zebra
P(S) = 0.0002
Input Output
108
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not _____
anchor
P(S) = 0.0001
shalt ....
Model bear
P(S) = 0.45
not
...
zebra
P(S) = 0.0002
Input Output
109
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear _____
anchor
P(S) = 0.0001
not ....
Model false
P(S) = 0.4
bear
...
zebra
P(S) = 0.0002
Input Output
110
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear false _____
anchor
P(S) = 0.0001
bear ....
Model witness
P(S) = 0.75
false
...
zebra
P(S) = 0.0002
Input Output
111
Trained Language Models: Prediction
Given input and a model (word embeddings):
S: thou shalt _____
anchor
P(S) = 0.0001
thou ....
Look up Project to
word
Calculate
output not
predictions P(S) = 0.8
embeddings vocabulary
shalt
...
zebra
P(S) = 0.0002
Input Output
112
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______
but not:
Tomorrow is ______ to be
where:
context words
a word to be predicted: target word
113
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______
Tomorrow is ______ to be
context words
114
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
115
Word2Vec: Idea
116
Word2Vec: Idea
Instead of counting how often each word w occurs near
"apricot"
Train a classifier on a binary prediction task:
Is w likely to show up near "apricot"?
We don’t actually care about this task
but we'll take the learned classifier weights as the word
embeddings
Use self-supervision:
A word c that occurs near “apricot” in the corpus acts as the
gold "correct answer" for supervised learning
No need for human labels
117
Available Tools
Word2vec (Mikolov et al)
https://code.google.com/archive/p/word2vec
/
118
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings
119
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors
121
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
122
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
123
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
124
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
125
CBOW Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N
wordt-2
wordt-2
sum wordt
wordt+1
wordt+2
126
Skip Gram Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N
wordt-2
wordt-2
wordt sum
wordt+1
wordt+2
127
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
shalt
not sum
bear
false
128
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
P(+|not, thou)
P(-|not, thou)
shalt
P(+|not,shalt)
P(-|not, shalt)
not sum
bear
P(+|not, bear)
P(-|not, bear)
false
P(+|not, false)
P(-|not, false)
129
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings
130
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors
132
Word2Vec: the Approach
133
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence: target word
context words
134
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
135
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
136
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
137
Target and Context Embeddings
Target Context
aardvark aardvark
... ...
not not
Vocabulary size |V|
shalt shalt
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
138
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not not
Vocabulary size |V|
... ...
shalt shalt
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
139
Cosine Similarity Visualization
... ...
not w not
Vocabulary size |V|
... ...
Target &
shalt Context similar shalt c
...
when ...
thou
cw thou
... ...
zebra
is high zone
Embedding size d Embedding size d
141
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not w not
Vocabulary size |V|
... ...
Similarity(c, w)
shalt cw shalt c
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
142
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not w not
Vocabulary size |V|
... ...
Similarity(c, w)
shalt cw shalt c
... ...
NOT a
thou thou
PROBABILITY
... though! ...
zebra zone
Embedding size d Embedding size d
143
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not w not
Vocabulary size |V|
... ...
Similarity(c, w)
shalt cw shalt c
... ...
Use sigmoid
thou thou
function!
... ...
zebra zone
Embedding size d Embedding size d
144
Similarity Probability
P(+ | w, c) = (c w) =
P(- | w, c) = 1 - P(+ | w, c) =
= (-c w) =
145
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
146
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
147
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
In general:
P(+ | w, c1:L) =
148
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
149
Skip Gram Classifier: Summary
A probabilistic classifier, given
a test target word w
its context window of L words c1:L
Estimates probability that w occurs in this window
based on similarity of w (embeddings) to c1:L
(embeddings).
151
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings
152
Word2Vec: Training
Assume a +/- 2 (L = 4) word window, given training
sentence:
153
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors
155
Classifier: Learning Process
How to learn?
use stochastic gradient descent
156
Gradient Descent: Single Step
157
Loss Function Derivatives
158
Gradient Descent: Updates
Start with randomly initialized W and C matrices
Target (W) Context (C)
aardvark aardvark
... ...
not not
Vocabulary size |V|
shalt shalt
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
159
Gradient Descent: Updates
... then incrementally do updates using.
learning rate
160
Skip Gram Word2Vec: Summary
Start with |V| random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
Take a corpus and take pairs of words that co-occur as positive
examples
Take pairs of words that don't co-occur as negative examples
Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
Throw away the classifier code and keep the embeddings.
161
Sliding Window Size
Small windows (+/- 2) : nearest words are
syntactically similar words in same taxonomy
Hogwarts nearest neighbors are other fictional
schools
Sunnydale, Evernight, Blandings