Embeddings
Embeddings
It is a big encoder-decoder model where the input sequence gets into a big encoding block, obtaining
rich embeddings for each token, which will feed the decoding block to obtain an output.
1. One-Hot Vectors:
This technique consists of having a vector where each column corresponds
to a word in the vocabulary.
V = { Hardin, likes, to, watch, movies, Tessa, too, also, football, games}
We will have this vector, where each column corresponds to a word in the vocabulary:
Hardin likes to watch movies Tessa too also football games
Each word will have a value of 1 for its own column and a value of 0 for all the other
columns.
There are different ways to create vocabularies, depending on the nature of the problem.
Methods To create vocabularies:
a. Stemming:
It is the process of truncating words to a common root named “stem”.
• All words will share the same root as a stem, but the stem could be a word out of the language
• It’s easy to calculate but has the drawback of producing non -real words as stems
Connections
Connected === Connect
Connection
Connecting
b. Lemmatization:
Lemmatization consists of reducing words to their canonical, dictionary, or citation form
named “lemma”.
• Words could or could not share the same root as the lemma, but the lemma is a real word
• Typically this is a more complicated process, as it requires big databases or dictionaries to find lemmas
Playing
Played -----> Dictionary -----> Play
2. Word Embeddings:
Word embeddings are just vectors of real numbers representing words.
They usually capture word context, semantic similarity, and relationship with other words.
Applying Semantic Arithmetic to word embeddings would give the following:
Proper Explanation For the Below :https://kawine.github.io/blog/nlp/2019/06/21/word-analogies.html
Notice in the Male-Female graph, the distance between “king” and “man” is very similar to the distance
between “queen” and “woman”. That difference would correspond to the concept of “royalty”.
We could even approximate the vector for “king” applying these operations:
3. Word2Vec:
The key concept to understand how it works is the distributional hypothesis in
linguistics: “words that are used and occur in the same contexts tend to purport similar meanings“.
The architecture is a very simple shallow neural network with three layers: input (one-
hot vector), hidden (with N units of our choice), and output (one-hot vector).
The training will minimize the difference between the expected outputv vectors and the
predicted ones, leaving as a side-product the embeddings in a weights matrix:
So we’ll have two weight matrices: Matrix W between the input and hidden layer, and Matrix W' between the hidden
and output layer.
After training, the matrix with size will have one column per word, which are the embeddings for each word.
A. CBOW Architecture:
The whole concept of CBOW is that we know the context of a word (the surrounding words), and our goal is
to predict that word.
and we decide to use a window of 3 (meaning we’ll pick one word before the target
word and one word after it).
These are the training examples we’d use:
Hardin ____ to (Expected word: "likes")
likes ____ watch (Expected word: "to")
to ____ movies (Expected word: "watch")
…..
all encoded as one-hot vectors:
The only difference in this approach is that we’d use the word as input and the context (surrounding words) as
output (predictions).
1. GloVe:
GloVe embeddings relate to the probabilities that two words appear together. Or simply put: embeddings are
similar when their words appear together often.
Link to proper Explanation: https://nlp.stanford.edu/projects/glove/
2. FastText:
It is an extension of Word2vec consisting of using n-grams of characters instead of whole
words.
3. ElMo:
ELMo solves the problem of having a word with different meanings depending on the sentence.
Consider the semantic difference of the word “cell” in: “He went to the prison cell with his phone” and “He
went to extract blood cell samples”.