CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
Hard
between words. With word vectors, we can quite easily encode this
ability in the vectors themselves (using distance measures such as
Jaccard, Cosine, Euclidean, etc).
2 Word Vectors
∑ik=1 σi
|V |
∑i=1 σi
cs 224d: deep learning for nlp 4
Applying SVD to X:
|V | |V | |V | |V |
0 ··· − v1 −
| | σ1
0 · · · |V | − v2 −
X = u1 u2 · · · |V | σ2
|V | |V |
.. .. .. ..
| | . . . .
|V | k k |V |
0 ··· − v1 −
| | σ1
0 ··· k − v2 −
X̂ = u1 u2 ··· k σ2
|V | |V |
.. .. .. ..
| | . . . .
Both of these methods give us word vectors that are more than
sufficient to encode semantic and syntactic (part of speech) informa-
tion but are associated with many other problems:
• The dimensions of the matrix change very often (new words are
added very frequently and corpus changes in size).
Let us step back and try a new approach. Instead of computing and
storing global information about some huge dataset (which might be
billions of sentences), we can try to create a model that will be able
to learn one iteration at a time and eventually be able to encode the
probability of a word given its context. Context of a word:
We can set up this probabilistic model of known and unknown The context of a word is the set of C
surrounding words. For instance, the
parameters and take one training example at a time in order to learn C = 2 context of the word "fox" in the
just a little bit of information for the unknown parameters based on sentence "The quick brown fox jumped
over the lazy dog" is {"quick", "brown",
the input, the output of the model, and the desired output of the "jumped", "over"}.
model.
At every iteration we run our model, evaluate the errors, and
follow an update rule that has some notion of penalizing the model
parameters that caused the error. This idea is a very old one dating
back to 1986. We call this method "backpropagating" the errors (see
Learning representations by back-propagating errors.
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams (1988).)
P ( w1 , w2 , · · · , w n )
We can take the unary language model approach and break apart
this probability by assuming the word occurrences are completely
independent:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
Unigram model:
However, we know this is a bit ludicrous because we know the n
next word is highly contingent upon the previous sequence of words. P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
And the silly sentence example might actually score highly. So per-
haps we let the probability of the sequence depend on the pairwise
probability of a word in the sequence and the word next to it. We call
this the bigram model and represent it as:
cs 224d: deep learning for nlp 6
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
Bigram model:
Again this is certainly a bit naive since we are only concerning n
ourselves with pairs of neighboring words rather than evaluating a P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
whole sentence, but as we will see, this representation gets us pretty
far along. Note in the Word-Word Matrix with a context of size 1, we
basically can learn these pairwise probabilities. But again, this would
require computing and storing global information about a massive
dataset.
Now that we understand how we can think about a sequence of
tokens having a probability, let us observe some example models that
could learn these probabilities.
model. We denote this n × 1 vector as vi . Similarly, U is the output • U ∈ Rn×|V | : Output word matrix
word matrix. The j-th row of U is an n-dimensional embedded vector • ui : i-th row of U , the output vector
representation of word wi
for word w j when it is an output of the model. We denote this row of
U as u j . Note that we do in fact learn two vectors for every word wi
(i.e. input word vector vi and output word vector ui ).
1. We generate our one hot word vectors (x (c−m) , . . . , x (c−1) , x (c+1) , . . . , x (c+m) )
for the input context of size m.
• objective function
• gradients
• update rules
probability that (w, c) did not come from the corpus data. First, let’s
model P( D = 1|w, c) with the sigmoid function:
1
P( D = 1|w, c, θ ) = T
1 + e(−vc vw )
Now, we build a new objective function that tries to maximize the
probability of a word and context being in the corpus data if it in-
deed is, and maximize the probability of a word and context not
being in the corpus data if it indeed is not. We take a simple maxi-
mum likelihood approach of these two probabilities. (Here we take θ
to be the parameters of the model, and in our case it is V and U .)
K
log σ (ucT−m+ j · vc ) + ∑ log σ(−ũkT · vc )
k =1