08 Exercises Word2vec MUD SOLVED
08 Exercises Word2vec MUD SOLVED
Exercise 1
Let the word vectors of the N previous words be x1; x2; … xN, each a column vector
of dimension D, and let y be the one-hot vector for the current word. The network is
specified by the equations that follow these lines:
𝑥$
𝑥%
𝑥 = #…(
𝑥'
ℎ = tanh(𝑊𝑥 + 𝑏)
𝑦4 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑈ℎ + 𝑑 )
𝐽 = 𝐶𝐸 (𝑦, 𝑦4)
𝐶𝐸 = − B 𝑦C log (𝑦4C )
C
ℝL , ℎ ∈ ℝL , 𝑈 ∈ ℝN M L , 𝑑 ∈ ℝN , 𝑦4 ∈ ℝN
SOLUTION
1a. The CBOW is trained to predict a center word given a context window that extends
on both sides, while word vectors learned by NNLM do not capture the context to the
right of the word.
The CBOW model simply uses the sum of context words, while the NNLM model
combines context words non-linearly. Thus the NNLM can learn to treat “not good to"
differently from “good to not", etc.
The complexity can be reduced by using negative sampling to compute the softmax or
using the hierarchical softmax.
Exercise 2.
2a. We know that dense word vectors like the ones obtained with word2vec or GloVe
have many advantages over using sparse one-hot word vectors. Name a few.
2b. Also name at least 2 disadvantages of sparse vectors that it are not solved in dense
vectors. Which of the following is NOT an advantage dense vectors have over sparse
vectors?
SOLUTION
2a Models using dense word vectors generalize better to rare words than those using
sparse vectors.
Dense word vectors encode similarity between words while sparse vectors do not.
Dense word vectors are easier to include as features in machine learning systems than
sparse vectors.
2b. Just like sparse representations, word2vec or GloVe do not have representations for
unseen words and hence do not help in generalization.
Also, there is only one representation per word, so polysemy is not solved.
2
Exercise 3
Given the following neural architecture. What is it learning? Can you explain which
exact NLP task is training?
SOLUTION
The architecture is predicting the 4th word given the 3 previous words. This is an
architecture for the task of language modeling, predicting the probability of sequences
of words.
Exercise 4
We have each used the Word2Vec algorithm to obtain word embeddings for the same
vocabulary of words V.
Q Q
In particular, developer A has got `context' vectors 𝑢P and `center' vectors 𝑣P for every
T T
𝑤 in V, and developer B has got `context' vectors 𝑢P and `center' vectors 𝑣P for every
𝑤 in V .
For every pair of words 𝑤, 𝑤′in V, the inner product is the same in both models:
Q V Q T V T Q T
(𝑢P ) 𝑣WX = (𝑢P ) 𝑣WX . Does it mean that, for every word 𝑤 in V, 𝑣P = 𝑣P ? Discuss
your response.
SOLUTION
No. Word2Vec model only optimizes for the inner product between
word vectors for words in the same context.
One can rotate all word vectors by the same amount and the inner product will
still be the same. Alternatively one can scale the set of context vectors by a
factor of k and the set of center vectors by a factor of 1=k. Such transformations
preserves inner product, but the set of vectors could be di_erent.
Note that degenerate solutions (all zero vectors3 etc.) are discouraged.