Using Word Embeddings For Text Search
Using Word Embeddings For Text Search
Retrieval
Purdue University
Purdue University 1
Preamble
In the year 2013, a group of researchers at Google created a revolution in the text
processing community with the publication of the word2vec neural network for
generating numerical representations for the words in a text corpus. Here are their
papers on that discovery:
https://arxiv.org/abs/1301.3781
https://arxiv.org/pdf/1310.4546.pdf
To elaborate, if two different words, wordi and wordj , are frequently surrounded
by the same set of other words (called the context words), the word embedding
vectors for wordi and wordj would be numerically close.
Purdue University 2
Preamble (contd.)
What’s amazing is that when you look at the similarity clusters produced by
word2vec, they create a powerful illusion that a computer has finally solved the
mystery of how to automatically learn the semantics of the words.
My goal in this lecture is to introduce you to the word2vec neural network and to
then present some of my lab’s research on using the Word2vec embeddings in the
context of software engineering for automatic bug localization.
Purdue University 3
Outline
Purdue University 4
Importance of Text Search and Retrieval
Outline
Purdue University 5
Importance of Text Search and Retrieval
You could say that instantaneous on-line search has now become an
important part of our reflexive behavior.
Purdue University 6
Importance of Text Search and Retrieval
Before ending this section, I just wanted to mention quickly that the
most important annual research forums on general text retrieval are
the ACM’s SIGIR conference and the NIST’s TREC workshops. Here
are the links to these meetings:
https://sigir.org/sigir2019/
https://trec.nist.gov/pubs/call2020.html
Outline
Purdue University 10
How the Word Embeddings are Learned in Word2vec
Word2vec
I’ll explain the word2vec neural network with the help of the figure
shown on the next slide.
Each such input goes through the first linear layer where it is
multiplied by a matrix, denoted WV ×N , of learnable parameters.
Purdue University 11
How the Word Embeddings are Learned in Word2vec
Word2vec (contd.)
Figure: The SkipGram model for generating the word2vec embeddings for a text corpus.
Purdue University 12
How the Word Embeddings are Learned in Word2vec
Word2vec (contd.)
Overall, in the word2vec network, a V -element tensor at the input
goes into an N-element tensor in the middle layer and, eventually,
back into a V -element final output tensor.
Here is a very important point about the first linear operation on the
input in the word2vec network: In all of the DL implementations you
have seen so far, a torch.nn.Linear layer was always followed by a
nonlinear activation, but that’s NOT the case for the first invocation
of torch.nn.Linear in the word2vec network.
The reason for the note in red above is that the sole purpose of the
first linear layer is merely to serve as a projection operator.
But what does that mean?
To understand what we mean by a projection operator, you have to
come to terms with the semantics of the matrix WV ×N . And that
takes us to the next slide.
Purdue University 13
How the Word Embeddings are Learned in Word2vec
Word2vec (contd.)
You see, the matrix WV ×N is actually meant to be a stack of the
word embeddings that we are interested in. The i th row of this matrix
stands for the N-element word embedding for the i th word in a sorted
list of the vocabulary.
Given the one-hot vector for, say, the i th vocab word at the input, the
purpose of multiplying this vector with the matrix WV ×N is simply to
“extract” the current value for the embedding for this word and to
then present it to the neural layer that follows.
You could say that, for the i th -word at the input, the role of the
WV ×N matrix is to project the current value of the word’s embedding
into the neural layer that follows. It’s for this reason that the middle
layer of the network is known as the projection layer.
In case you are wondering about the size of N vis-a-vis that of V ,
that’s a hyperparameter of the network whose value is best set by
trying
Purdue out different values for N and choosing the best.
University 14
How the Word Embeddings are Learned in Word2vec
Word2vec (contd.)
After the projection layer, the rest of the word2vec network as shown
in Slide 12 is standard stuff. You have a linear neural layer with
torth.nn.Softmax as the activation function.
To emphasize, the learnable weights in the N × V matrix W 0 along
with the activation that follows is the only neural layer in word2vec.
You can reach the documentation on the activation function
(torch.nn.Softmax through the main webpage for torch.nn. This
activation function is listed under the category “Nonlinear activations
(Other)” at the “torch.nn” webpage.
To summarize, word2vec is a single-layer neural network that uses a
projection layer as its front end.
While the figure on Slide 12 is my visualization of how the data flows
forward in a word2vec network, a more common depiction of this
network as shown on the next slide.
Purdue University 15
How the Word Embeddings are Learned in Word2vec
Word2vec (contd.)
This figure is from the following publication by Shayan Akbar and
myself:
https://engineering.purdue.edu/RVL/Publications/Akbar_SCOR_Source_Code_Retrieval_2019_MSR_paper.pdf
Figure: A more commonly used depiction for the SkipGram model for generating the word2vec embeddings for a vocabulary
Purdue University 16
Softmax as the Activation Function in Word2vec
Outline
Purdue University 17
Softmax as the Activation Function in Word2vec
yet the two functions carry very different meanings, which has nothing
to do with the fact that the cross-entropy formula requires you take
the negative log of a ratio that looks like what is shown above.
The cross-entropy formula presented in the Object Detection lecture
focuses specifically on just that output node that is supposed to be
the true class label of the input notwithstanding the appearance of all
the nodes in the denominator that is used for the normalization of the
value at the node that the formula focuses on.
On the other hand, the Softmax formula shown above places equal
focus on all the output nodes. That is, the values at all the nodes are
normalized by the same denominator.
Purdue University 19
Softmax as the Activation Function in Word2vec
Purdue University 22
Training the Word2vec Network
Outline
Purdue University 23
Training the Word2vec Network
That raises the issue of how to present the context words at the
output for the calculation of the loss.
Purdue University 24
Training the Word2vec Network
w0T
j wi
X 1 e
= − · log2
2W + 1 w 0 T wi P
k e
j∈context(i) k
w 0T w X 0T
w w
X
= − log2 e j i + log2 e k i ignoring inconsequential terms
j∈context(i) k
w 0T
X
0T k wi
X
= −w j wi + log2 e
j∈context(i) k
Purdue University 25
Training the Word2vec Network
Given the simple form for the loss as shown on the last slide, it is easy
to write the formulas for the gradients of the loss with respect to the
all the learnable parameters. To see how that would work, here is a
rewrite of the equation shown at the bottom of the previous slide:
w0T
X
0T k wi
X
Lossi = −w j wi + log2 e
j∈context(i) k
where the subscript i on Loss means that this the loss when the word
whose position index in the vocab is i is fed into the network.
As shown on the next slide, the form presented above lends itself
simply to the calculation of the gradients of the loss for updating the
learnable weights.
Purdue University 26
Training the Word2vec Network
For arbitrary values for the indices s and t, we get the following
expression for the gradients of the loss with respect to the elements of
the matrix W :
T T
~0 ~
∂(w~0 j w
~i ) ew k w
P
∂Lossi 1 ∂( i)
k
X
= − + T
∂wst ∂wst ~0 ~ ∂wst
ew k w
P
j∈context(i) i
k
~ i so that you
where I have introduced the arrowed vector notation w
can distinguish between the elements of the matrix and the row and
column vectors of the same matrix.
∂Lossi
You will end up with a similar form for the loss gradients ∂w 0 st .
Purdue University 27
Using Word2vec for Improving the Quality of Text Retrieval
Outline
Purdue University 28
Using Word2vec for Improving the Quality of Text Retrieval
Figure: Some examples of word clusters obtained through the similarity of their embeddings.
Purdue University 30
Using Word2vec for Improving the Quality of Text Retrieval
Figure: Additional examples of software-centric word similarities based on learned their embeddings.
Purdue University 31
Using Word2vec for Improving the Quality of Text Retrieval
https://engineering.purdue.edu/RVL/Publications/Akbar_SCOR_Source_Code_Retrieval_2019_MSR_paper.pdf
Purdue University 32