CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016
CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016
Spring 2016
sentations since they are used in downstream subsystems (such as • Fast to compute performance
deep neural networks). To do this in practice, we will need to tune • Helps understand subsystem
many hyperparameters in the Word2Vec subsystem (such as the • Needs positive correlation with real
task to determine usefulness
dimension of the word vector representation). While the idealistic
approach is to retrain the entire system after any parametric changes
in the Word2Vec subsystem, this is impractical from an engineering
standpoint because the machine learning system (in step 3) is typi-
cally a deep neural network with millions of parameters that takes
very long to train. In such a situation, we would want to come up
with a simple intrinsic evaluation technique which can provide a
measure of "goodness" of the word to word vector subsystem. Ob-
viously, a requirement is that the intrinsic evaluation has a positive
correlation with the final task performance.
( x b − x a + x c ) T xi
d = argmax
i k xb − x a + xc k
cs 224d: deep learning for nlp 3
Thus, word vectors should not be retrained if the training data set
is small. If the training set is large, retraining may improve perfor-
mance.
N exp(Wk(i)· x (i) )
− ∑ log
i =1 ∑C (i )
c=1 exp(Wc· x )
The only difference above is that k(i ) is now a function that returns
the correct class index for example x (i) .
Let us now try to estimate the number of parameters that would
be updated if we consider training both, model weights (W), as well
word vectors (x). We know that a simple linear decision boundary
would require a model that takes in at least one d-dimensional input
word vector and produces a distribution over C classes. Thus, to
update the model weights, we would be updating C · d parameters.
If we update the word vectors for every word in the vocabulary V
as well, then we would be updating as many as |V | word vectors,
each of which is d-dimensional. Thus, the total number of parameters
would be as many as C · d + |V | · d for a simple linear classifier:
cs 224d: deep learning for nlp 10
∇W·1
..
.
∇
W·d
∇θ J (θ ) =
∇ xaardvark
..
.
∇ xzebra
This is an extremely large number of parameters considering how
simple the model’s decision boundary is - such a large number of
parameters is highly prone to overfitting.
To reduce overfitting risk, we introduce a regularization term
which poses the Bayesian belief that the parameters (θ) should be
small is magnitude (i.e. close to zero):
N
exp(Wk(i)· x (i) )
C ·d+|V |·d
− ∑ log +λ ∑ θk2
i =1 ∑C (i )
c=1 exp(Wc· x ) k =1
x ( i −2)
x ( i −1)
(i )
xwindow = x (i )
x ( i +1)
x ( i +2)
As a result, when we evaluate the gradient of the loss with respect
to the words, we will receive gradients for the word vectors:
∇ x ( i −2)
∇ x ( i −1)
= ∇ x (i )
δwindow
∇ x ( i +1)
∇ x ( i +2)
The gradient will of course need to be distributed to update the
corresponding word vectors in implementation.