Vector Semantics 5: (Count (C) )
Vector Semantics 5: (Count (C) )
CS 440/ECE 448
Fall 2020 Vector Semantics 5
Margaret Fleck
In order to get nice results from word2vec, the basic algorithm needs to be tweaked a bit to
produce this good performance. Similar tweaks are frequently required by other similar
algorithms.
Tweak 1: Word2vec uses more negative training pairs than positive pairs, by a
factor of 2 up to 20 (depending on the amount of training data available).
You might think that the positive and negative pairs should be roughly balanced. However,
that apparently doesn't work. One reason may be that the positive context words are
definite indications of similarity, whereas the negative words are random choices that may
be more neutral than actively negative.
Tweak 2: Positive training examples are weighted by 1/m, where m is the distance
between the focus and context word. I.e. so adjacent context words are more
important than words with a bit of separation.
The closer two words are, the more likely their relationship is strong. This is a common
heuristic in similar algorithms.
α is usually set to 0.75. But to see how this brings up the probabilities of rare words
compared to the common ones, it's a bit easier if you look at α = 0.5, i.e. we're computing
the square root of the input. In the table below, you can see that large probabilities stay
large, but very small ones are increased by quite a lot. After this transformation, you need to
1 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...
−
x x0.75 √x
.99 .992 .995
.9 .924 .949
.1 .178 .316
.01 .032 .1
This trick can also be used on PMI values (e.g. if using the methods from the previous
lecture).
Deletion, subsampling
Ah, but apparently they are still unhappy with the treatment of very common and very rare
words. So, when we first read the input training data, word2vec modifies it as follows:
This improves the balance between rare and common words. Also, deleting a word brings
the other words closer together, which improves the effectiveness of our context windows.
Evaluation
The 2014 version of word2vec uses use 1 billion words to train embeddings for basic task.
For the word analogy tasks, they used an embedding with 1000 dimensions and about 33
billion words of training data. Performance on word analogies is about 66%.
By coomparison: children hear about 2-10 million words per year. Assuming the high end of
that range of estimates, they've heard about 170 million words by the time they take the SAT.
So the algorithm is performing well, but still seems to be underperforming given the
amount of data it's consuming.
A more recent embedding method, BERT large, is trained using a 24-layer network with
340M parameters. This has somewhat improved performance but apparently can't be
reproduced on a standard GPU. Again, a direction for future research is to figure out why ok
performance seems to require so much training data and compute power.
2 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...
word2vec Explained
Dependency-Based Word Embeddings
Neural Word Embedding as Implicit Matrix Factorization
Improving Distributional Similarity with Lessons learned from Word Embeddings
3 of 3 5/10/21, 02:02