0% found this document useful (0 votes)
25 views3 pages

Vector Semantics 5: (Count (C) )

This document discusses tweaks that are frequently required to improve the performance of word2vec and similar algorithms. It describes using more negative training pairs than positive pairs, weighting positive pairs based on distance, smoothing counts of rare words, and deleting very rare and very common words from the training text.

Uploaded by

chuck212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views3 pages

Vector Semantics 5: (Count (C) )

This document discusses tweaks that are frequently required to improve the performance of word2vec and similar algorithms. It describes using more negative training pairs than positive pairs, weighting positive pairs based on distance, smoothing counts of rare words, and deleting very rare and very common words from the training text.

Uploaded by

chuck212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...

CS 440/ECE 448
Fall 2020 Vector Semantics 5
Margaret Fleck

In order to get nice results from word2vec, the basic algorithm needs to be tweaked a bit to
produce this good performance. Similar tweaks are frequently required by other similar
algorithms.

Building the set of training examples


In the basic algorithm, we consider the input (focus) words one by one. For each focus word,
we extract all words within +/- k positions as positive context words. We also randomly
generate a set of negative context words. This produces a set of positive pairs (w,c) and a set
of negative pairs (w,c') that are used to update the embeddings of w, c, and c'.

Tweak 1: Word2vec uses more negative training pairs than positive pairs, by a
factor of 2 up to 20 (depending on the amount of training data available).

You might think that the positive and negative pairs should be roughly balanced. However,
that apparently doesn't work. One reason may be that the positive context words are
definite indications of similarity, whereas the negative words are random choices that may
be more neutral than actively negative.

Tweak 2: Positive training examples are weighted by 1/m, where m is the distance
between the focus and context word. I.e. so adjacent context words are more
important than words with a bit of separation.

The closer two words are, the more likely their relationship is strong. This is a common
heuristic in similar algorithms.

Smoothing negative context counts


For a fixed focus word w, negative context words are picked with a probability based on
how often words occur in the training data. However, if we compute P(c) = count(c)/N (N is
total words in data), rare words aren't picked often enough as context words. So instead we
replace each raw count count(c) with (count(c))α . The probabilities used for selecting
negative training examples are computed from these smoothed counts.

α is usually set to 0.75. But to see how this brings up the probabilities of rare words
compared to the common ones, it's a bit easier if you look at α = 0.5, i.e. we're computing
the square root of the input. In the table below, you can see that large probabilities stay
large, but very small ones are increased by quite a lot. After this transformation, you need to

1 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...

normalize the numbers so that the probabilities add up to one again.


x x0.75 √x
.99 .992 .995

.9 .924 .949

.1 .178 .316

.01 .032 .1

.0001 .001 .01

This trick can also be used on PMI values (e.g. if using the methods from the previous
lecture).

Deletion, subsampling
Ah, but apparently they are still unhappy with the treatment of very common and very rare
words. So, when we first read the input training data, word2vec modifies it as follows:

very rare words are deleted from the text, and


very common words are deleted with a probability that increases with how frequent
they are.

This improves the balance between rare and common words. Also, deleting a word brings
the other words closer together, which improves the effectiveness of our context windows.

Evaluation
The 2014 version of word2vec uses use 1 billion words to train embeddings for basic task.

For the word analogy tasks, they used an embedding with 1000 dimensions and about 33
billion words of training data. Performance on word analogies is about 66%.

By coomparison: children hear about 2-10 million words per year. Assuming the high end of
that range of estimates, they've heard about 170 million words by the time they take the SAT.
So the algorithm is performing well, but still seems to be underperforming given the
amount of data it's consuming.

A more recent embedding method, BERT large, is trained using a 24-layer network with
340M parameters. This has somewhat improved performance but apparently can't be
reproduced on a standard GPU. Again, a direction for future research is to figure out why ok
performance seems to require so much training data and compute power.

2 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...

Some follow-on papers


Original Mikolov et all papers:

Efficient Estimation of Word Representations in Vector Space


Distributed Representations of Words and Phrases and their Compositionality

Goldberg and Levy papers (easier and more explicit)

word2vec Explained
Dependency-Based Word Embeddings
Neural Word Embedding as Implicit Matrix Factorization
Improving Distributional Similarity with Lessons learned from Word Embeddings

3 of 3 5/10/21, 02:02

You might also like