0% found this document useful (0 votes)
11 views3 pages

Embeddings

The document provides an overview of the basic parts of transformers including the input, encoder stack, decoder stack, and output. It then describes the general structure of transformers for language translation from English to French as an encoder-decoder model. The encoding block obtains rich embeddings for each token which feeds the decoding block to produce the output. Each encoding/decoding block contains many stacked encoders/decoders to capture patterns from basic to sophisticated.

Uploaded by

suyashbansod55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views3 pages

Embeddings

The document provides an overview of the basic parts of transformers including the input, encoder stack, decoder stack, and output. It then describes the general structure of transformers for language translation from English to French as an encoder-decoder model. The encoding block obtains rich embeddings for each token which feeds the decoding block to produce the output. Each encoding/decoding block contains many stacked encoders/decoders to capture patterns from basic to sophisticated.

Uploaded by

suyashbansod55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Basics Of Transformers

22 April 2024 22:20


INPUT…..

Breakdown Parts Of Transformer:


Encoders can’t operate directly with text but with vectors.
1. Input The text must be converted into tokens, which are part of a fixed vocabulary.
2. Encoder Stack
3. Decoder Stack
4. Output
After that, tokens are converted into embedding vectors by using a fixed
representation like word2vec,etc.
General Structure Of Transformers (Language Translation: English To
As we are processing the sequence all at once, we need to know the
French): position of tokens in the sequence. To address this problem, the
transformer adds a positional encoding vector to each token
embedding, obtaining a special embedding with positional information.

These vectors are ready to be used by the encoders.

It is a big encoder-decoder model where the input sequence gets into a big encoding block, obtaining
rich embeddings for each token, which will feed the decoding block to obtain an output.

Simple Raw SIMULATION

Converting Input Text to Vectors: (Types Of


techniques)

1. One-Hot Vectors:
This technique consists of having a vector where each column corresponds
to a word in the vocabulary.

(1) Hardin likes to watch movies. Tessa likes movies too.


(2) Tessa also likes to watch football games.

Vocabulary would look like:

V = { Hardin, likes, to, watch, movies, Tessa, too, also, football, games}

We will have this vector, where each column corresponds to a word in the vocabulary:
Hardin likes to watch movies Tessa too also football games

Each word will have a value of 1 for its own column and a value of 0 for all the other
columns.

The vector for the word “Hardin” would be:

Hardin likes to watch movies Tessa too also football games


1 0 0 0 0 0 0 0 0 0

The vector for the word “likes” would be:


Now, each of the encoding/decoding blocks actually contains many stacked
encoders/decoders. That way, the initial layers capture more basic Hardin likes to watch movies Tessa too also football games
patterns, whereas the last layers can detect more sophisticated ones, 0 1 0 0 0 0 0 0 0 0
similar to convolutional networks:

There are different ways to create vocabularies, depending on the nature of the problem.
Methods To create vocabularies:

1. Stop Words Removal:


Stop Words are those we ignore because they are not relevant to our problem.
Typically we’ll remove the most common ones in the language: “ the”, “in”, “a” and so on.
This helps to obtain a smaller vocabulary and represent words with fewer dimensions.
2. Reducing Inflectional Forms:
A popular and useful technique is reducing the explosion of inflectional
forms (verbs in different tenses, gender inflections, etc.) to a common word preserving the essential meaning.
There are two main specific techniques to achieve this : Stemming & Lemmatization.

a. Stemming:
It is the process of truncating words to a common root named “stem”.
• All words will share the same root as a stem, but the stem could be a word out of the language
• It’s easy to calculate but has the drawback of producing non -real words as stems

Connections
Connected === Connect
Connection
Connecting

b. Lemmatization:
Lemmatization consists of reducing words to their canonical, dictionary, or citation form
named “lemma”.
• Words could or could not share the same root as the lemma, but the lemma is a real word
• Typically this is a more complicated process, as it requires big databases or dictionaries to find lemmas

Playing
Played -----> Dictionary -----> Play

New Section 1 Page 1


Played -----> Dictionary -----> Play
Play

2. Word Embeddings:
Word embeddings are just vectors of real numbers representing words.
They usually capture word context, semantic similarity, and relationship with other words.
Applying Semantic Arithmetic to word embeddings would give the following:
Proper Explanation For the Below :https://kawine.github.io/blog/nlp/2019/06/21/word-analogies.html

Notice in the Male-Female graph, the distance between “king” and “man” is very similar to the distance
between “queen” and “woman”. That difference would correspond to the concept of “royalty”.

We could even approximate the vector for “king” applying these operations:

3. Word2Vec:
The key concept to understand how it works is the distributional hypothesis in
linguistics: “words that are used and occur in the same contexts tend to purport similar meanings“.

There are two methods in Word2vec to find a vector representing a word:


• Continuous Bag Of Words (CBOW): the model predicts the current word from the surrounding context words
• Continuous skip-gram: the model uses the current word to predict the surrounding context words

The architecture is a very simple shallow neural network with three layers: input (one-
hot vector), hidden (with N units of our choice), and output (one-hot vector).
The training will minimize the difference between the expected outputv vectors and the
predicted ones, leaving as a side-product the embeddings in a weights matrix:

So we’ll have two weight matrices: Matrix W between the input and hidden layer, and Matrix W' between the hidden
and output layer.
After training, the matrix with size will have one column per word, which are the embeddings for each word.
A. CBOW Architecture:

The whole concept of CBOW is that we know the context of a word (the surrounding words), and our goal is
to predict that word.

For example, imagine we train with the former text:

Hardin likes to watch movies. Tessa likes movies too.

V = { Hardin, likes, to, watch, movies, Tessa, too}

and we decide to use a window of 3 (meaning we’ll pick one word before the target
word and one word after it).
These are the training examples we’d use:
Hardin ____ to (Expected word: "likes")
likes ____ watch (Expected word: "to")
to ____ movies (Expected word: "watch")

…..
all encoded as one-hot vectors:

New Section 1 Page 2


Then we will train this network as any other neural classifier, and we would obtain our word embeddings in the
weights matrix.

B. Skip- Gram Architecture:

The only difference in this approach is that we’d use the word as input and the context (surrounding words) as
output (predictions).

Therefore, we’d generate these examples:


"likes" (Expected context: ["John", "to"])
"to" (Expected context: ["likes", "watch"])
"watch" (Expected context: ["to", "movies"])
...

one-hot vectors would look like:

Other Word Embeddings Methods:

1. GloVe:

GloVe embeddings relate to the probabilities that two words appear together. Or simply put: embeddings are
similar when their words appear together often.
Link to proper Explanation: https://nlp.stanford.edu/projects/glove/

2. FastText:
It is an extension of Word2vec consisting of using n-grams of characters instead of whole
words.

3. ElMo:
ELMo solves the problem of having a word with different meanings depending on the sentence.
Consider the semantic difference of the word “cell” in: “He went to the prison cell with his phone” and “He
went to extract blood cell samples”.

New Section 1 Page 3

You might also like