0% found this document useful (0 votes)
15 views55 pages

Unit - 2

nlp and deep learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views55 pages

Unit - 2

nlp and deep learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

GENERATIVE AI AND LARGE LANGUAGE MODELS

SCSB4014

UNIT – 2

UNIT 2: TRANSFORMER

Data preprocessing: Tokenization, Embedding, Positional Encoding - Transformer


Architecture: Attention mechanism, Types of Attention, Normalization - LayerNorm,
RMSNorm, Encoder, Decoder, Feed Forward Network and Softmax - Model scalability -
Parameters, Layers, and Performance - Application: Time Series Data, Sequence Based Data,
Text and Vision.
Data pre-processing: Tokenization, Embedding, Positional Encoding

Data preprocessing is the process of cleaning, organizing, and transforming raw data so that
it can be more easily analyzed, understood, and used in machine learning models. The goal of
data preprocessing is to improve data quality and eliminate issues like missing values so that
the data can be used for machine learning.

Data preprocessing can help identify patterns, make predictions, and inform decision-
making. It can also increase the accuracy and efficiency of machine learning models.

Steps

Data preprocessing can include tasks like:

1. Removing incorrect or irrelevant data


2. Handling missing values
3. Smoothing noisy data
4. Extracting specific features from images

There are several different tools and methods used for preprocessing data, including the
following:

1. sampling, which selects a representative subset from a large population of data;

2. transformation, which manipulates raw data to produce a single input;

3. denoising, which removes noise from data;

4. imputation, which synthesizes statistically relevant data for missing values;

5. normalization, which organizes data for more efficient access; and

6. feature extraction, which pulls out a relevant feature subset that is significant in a
particular context.
A computer scientist and one more person are given a legal database. The task is when
provided with a case; efficiently find previous cases that are relevant, this will quickly assist
the colleagues to find laws, precedents or specific legal interpretations related to the
provided case. Initially, it might consider employing a traditional keyword-based search. This
involves breaking down the case document into key words or phrases and searching for exact
matches in the database. However, that’ll soon recognize the shortcomings of this method.
The complexity and jargon-rich nature of legal language make it challenging. For instance, If a
lawyer is searching for “intellectual property infringement” that might overlook cases using
different terminology such as “patent breach” or “copyright violation”. A more effective
solution lies in semantic search powered by vector embeddings. By the end of this article,
that aim to demonstrate how tokenization, vector embedding, and positional encoding are
pivotal tools in overcoming these challenges.

Tokenization

Tokenization is a data preprocessing technique that breaks down text into smaller units,
called tokens, for analysis and processing and to make it easier for machines to process. It's a
fundamental step in Natural Language Processing (NLP) and is used in many applications,
including search engines, machine translation, and speech recognition.

It breaks down unstructured text data into smaller units called tokens. A single token can
range from a single character or individual word to much larger textual units.
Tokenization in NLP
Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence,
information engineering, and human-computer interaction. This field focuses on how to
program computers to process and analyze large amounts of natural language data. It is
difficult to perform as the process of reading and understanding languages is far more
complex than it seems at first glance.
Tokenization is a critical step in many NLP tasks, including text processing, language
modelling, and machine translation. The process involves splitting a string, or text into a list
of tokens. One can think of tokens as parts like a word is a token in a sentence, and a
sentence is a token in a paragraph.
Tokenization involves using a tokenizer to segment unstructured data and natural language
text into distinct chunks of information, treating them as different elements. The tokens
within a document can be used as vector, transforming an unstructured text document into a
numerical data structure suitable for machine learning. This rapid conversion enables the
immediate utilization of these tokenized elements by a computer to initiate practical actions
and responses. Alternatively, they may serve as features within a machine learning pipeline,
prompting more sophisticated decision-making processes or behaviors.

Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here
are some types of tokenization:

Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use this approach,
in which words are treated as the basic units of meaning.

Input: "Tokenization is an important NLP task."


Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]

Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is useful for tasks
requiring individual sentence analysis or processing.

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller
units."]
Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which can be
especially useful when dealing with morphologically rich languages or rare words.

Input: "tokenization"
Output: ["token", "ization"]

Character Tokenization:
This process divides the text into individual characters. This can be useful for modelling
character-level language.

Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]

Tokenization is one stage in text mining pipelines that converts raw text data into a
structured format for machine processing. It's necessary for other preprocessing techniques,
and, therefore, is often (one of) the first preprocessing steps in NLP pipelines. For
example, stemming and lemmatization reduce morphological variants to one base word
form (for example, running, runs, and runner become run). These text normalization
techniques only work on tokenized text, but they need some method for identifying
individual words.

Readily implemented and conceptually simple, tokenization is a crucial step in preparing


text data sets for neural network and transformer architectures. Tokenization is used in
building deep learning models like large language models (LLM), as well as conducting
various NLP tasks, such as sentiment analysis and word embeddings. For example, GPT uses
a tokenization method called byte-pair encoding (BPE).

The different types of tokenization essentially denote varying levels of granularity in the
tokenization process. Word tokenization is the most common type used in introductions to
tokenization, and it divides raw text into word-level units. Subword tokenization delimits
text beneath the word level; wordpiece tokenization breaks text into partial word units (for
example, starlight becomes star and light), and character tokenization divides raw text into
individual characters (for example, letters, digits, and punctuation marks). Other
tokenization methods, such as sentence tokenization, divide text above the word level.

This use the Python natural language toolkit (NLTK) to walk through tokenizing .txt files at
different levels of granularity using an open-access Asian religious texts file that is sourced
largely from Project Gutenberg. It focuses on tokenization as a means to prepare raw text
data for use in machine learning models and NLP tasks. Other libraries and packages, such
as Keras and Genism, also come with tokenization algorithms. Transformer
architectures such as BERT can also implement tokenization. However, this tutorial focuses
on tokenization with Python NLTK.

Tokenizers are tools or libraries that have been created for this purpose, Hugging face have
developed and open-sourced transformers which simplifies the process of utilizing a
tokenizer from a pre-trained model. For consistent preprocessing of an input data, it’s
important to replicate the method used during the model’s pre-training phase. This requires
downloading the relevant information from the Models Hugging face Repository.
Accomplishing this is straightforward with the AutoTokenizer class and its from_pretrained()
method. By specifying our model’s checkpoint name, this function automatically retrieves
and caches the model’s tokenizer data, ensuring it’s downloaded only once during the initial
execution of the code.
Need of Tokenization
Tokenization is a crucial step in text processing and natural language processing (NLP) for
several reasons.
 Effective Text Processing: Tokenization reduces the size of raw text so that it can be
handled more easily for processing and analysis.
 Feature extraction: Text data can be represented numerically for algorithmic
comprehension by using tokens as features in machine learning models.
 Language Modelling: Tokenization in NLP facilitates the creation of organized
representations of language, which is useful for tasks like text generation and language
modelling.
 Information Retrieval: Tokenization is essential for indexing and searching in systems
that store and retrieve information efficiently based on words or phrases.
 Text Analysis: Tokenization is used in many NLP tasks, including sentiment
analysis and named entity recognition, to determine the function and context of
individual words in a sentence.
 Vocabulary Management: By generating a list of distinct tokens that stand in for words in
the dataset, tokenization helps manage a corpus’s vocabulary.
 Task-Specific Adaptation: Tokenization can be customized to meet the needs of
particular NLP tasks, meaning that it will work best in applications such as summarization
and machine translation.
 Preprocessing Step: This essential preprocessing step transforms unprocessed text into a
format appropriate for additional statistical and computational analysis.

Techniques for Tokenization


We have discussed the ways to implement how can we perform tokenization using NLTK
library. We can also implement tokenization using following methods and libraries:
Spacy: Spacy is NLP library that provide robust tokenization capabilities.
BERT tokenizer: BERT uses WordPiece tokenizer is a type of subword tokenizer for tokenizing
input text. Using regular expressions allows for more fine-grained control over tokenization,
and that can customize the pattern based on the specific requirements.
Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also
found applications in the field of natural language processing, specifically for tokenization. It
is a subword tokenization technique that works by iteratively merging the most frequent
pairs of consecutive bytes (or characters) in a given corpus.
Sentence Piece: SentencePiece is another subword tokenization algorithm commonly used
for natural language processing tasks. It is designed to be language-agnostic and works by
iteratively merging frequent sequences of characters or subwords in a given corpus.

Tokenization Use Cases

Tokenization serves as the backbone for a myriad of applications in the digital realm,
enabling machines to process and understand vast amounts of text data. By breaking down
text into manageable chunks, tokenization facilitates more efficient and accurate data
analysis. Here are some prominent use cases, along with real-world applications:

Search engines

When we type a query into a search engine like Google, it employs tokenization to dissect
the input. This breakdown helps the engine sift through billions of documents to present with
the most relevant results.

Machine translation

Tools such as Google Translate utilize tokenization to segment sentences in the source
language. Once tokenized, these segments can be translated and then reconstructed in the
target language, ensuring the translation retains the original context.

Speech recognition

Voice-activated assistants like Siri or Alexa rely heavily on tokenization. When it pose a
question or command, the spoken words are first converted into text. This text is then
tokenized, allowing the system to process and act upon the request.
Sentiment analysis in reviews

Tokenization plays a crucial role in extracting insights from user-generated content, such as
product reviews or social media posts. For instance, a sentiment analysis system for e-
commerce platforms might tokenize user reviews to determine whether customers are
expressing positive, neutral, or negative sentiments. For example:

 The review: "This product is amazing, but the delivery was late."

 After tokenization: ["This", "product", "is", "amazing", ",", "but", "the", "delivery", "was",
"late", "."]

The tokens "amazing" and "late" can then be processed by the sentiment model to assign
mixed sentiment labels, providing actionable insights for businesses.

Chatbots and virtual assistants

Tokenization enables chatbots to understand and respond to user inputs effectively. For
example, a customer service chatbot might tokenize the query:

"I need to reset my password but can't find the link."

Which is tokenized as: ["I", "need", "to", "reset", "my", "password", "but", "can't", "find",
"the", "link"].
This breakdown helps the chatbot identify the user's intent ("reset password") and respond
appropriately, such as by providing a link or instructions.

Tokenization Challenges

Navigating the intricacies of human language, with its nuances and ambiguities, presents a
set of unique challenges for tokenization. Here's a deeper dive into some of these obstacles,
along with recent advancements that address them:

Ambiguity

Language is inherently ambiguous. Consider the sentence "Flying planes can be dangerous."
Depending on how it's tokenized and interpreted, it could mean that the act of piloting
planes is risky or that planes in flight pose a danger. Such ambiguities can lead to vastly
different interpretations.

Implementing Tokenization

The landscape of Natural Language Processing offers many tools, each tailored to specific
needs and complexities. Here's a guide to some of the most prominent tools and
methodologies available for tokenization:

NLTK (Natural Language Toolkit). A stalwart in the NLP community, NLTK is a comprehensive
Python library that caters to a wide range of linguistic needs. It offers both word and
sentence tokenization functionalities, making it a versatile choice for beginners and seasoned
practitioners alike.

Spacy. A modern and efficient alternative to NLTK, Spacy is another Python-based NLP
library. It boasts speed and supports multiple languages, making it a favorite for large-scale
applications.

BERT tokenizer. Emerging from the BERT pre-trained model, this tokenizer excels in context-
aware tokenization. It's adept at handling the nuances and ambiguities of language, making it
a top choice for advanced NLP projects (see this tutorial on NLP with BERT).
Advanced techniques.

Byte-Pair Encoding (BPE). An adaptive tokenization method, BPE tokenizes based on the most
frequent byte pairs in a text. It's particularly effective for languages that form meaning by
combining smaller units.

SentencePiece. An unsupervised text tokenizer and detokenizer mainly for Neural Network-
based text generation tasks. It handles multiple languages with a single model and can
tokenize text into subwords, making it versatile for various NLP tasks.

Hugging Face Transformers

One of the most popular tools for NLP tasks, the Hugging Face Transformers library provides
a seamless integration with PyTorch, making it ideal for both research and production. This
library includes advanced tokenizers designed to work with state-of-the-art transformer
models like BERT, GPT, and RoBERTa. Key features include:

Fast tokenizers: Built using Rust, these tokenizers offer significant speed improvements,
enabling faster pre-processing for large datasets.

Support for subword tokenization: The library supports Byte-Pair Encoding (BPE),
WordPiece, and Unigram tokenization, ensuring efficient handling of out-of-vocabulary
words and complex languages.

Built-in pretrained tokenizers: Each model in the Hugging Face Transformers library comes
with a corresponding pretrained tokenizer, ensuring compatibility and ease of use. For
instance, the BERT tokenizer splits text into subwords, making it adept at handling language
nuances.

Word Embeddings
Embeddings are numeric representations of words in a lower-dimensional space, capturing
semantic and syntactic information. They play a vital role in Natural Language Processing
(NLP) tasks. This article explores traditional and neural approaches, such as TF-IDF,
Word2Vec, and GloVe, offering insights into their advantages and disadvantages.
Understanding the importance of pre-trained word embeddings, providing a comprehensive
understanding of their applications in various NLP scenarios. In the traditional "one-hot"
representation of words as vectors that has a vector of the same dimension as the
cardinality of the vocabulary. To reduce dimensionality usually stop words are removed, as
well as applying stemming, lemmatizing, etc. to normalize the features that want to perform
some NLP task on.

Word Embedding in NLP


Word Embedding is an approach for representing words and documents. Word Embedding
or Word Vector is a numeric vector input that represents a word in a lower-dimensional
space. It allows words with similar meanings to have a similar representation.
Word Embeddings are a method of extracting features out of text so that we can input
those features into a machine learning model to work with text data. They try to preserve
syntactical and semantic information. The methods such as Bag of Words
(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any
syntactical or semantic information. In these algorithms, the size of the vector is the number
of elements in the vocabulary. We can get a sparse matrix if most of the elements are zero.
Large input vectors will mean a huge number of weights which will result in high
computation required for training. Word Embeddings give a solution to these problems.
Need for Word Embedding
 To reduce dimensionality
 To use a word to predict the words around it.
 Inter-word semantics must be captured.
Use of Word Embeddings used
 They are used as input to machine learning models.
 Take the words —-> Give their numeric representation —-> Use in training or inference.
 To represent or visualize any underlying patterns of usage in the corpus that was used to
train them.
Prior to supplying data to a machine learning or deep learning model, it is essential to
transform text or images into numerical formats as part of the preprocessing step. Let’s see
how to preprocess a simple text to tensors using TensorFlow framework. Tensors are
nothing but numerical representation of any kind of data.
1. Text vectorization
Before we proceed, lets understand what a token is. A token is a minimal unit of text which
can be a word ,subword , character or even group of words (n-grams).
2. Embedding:
Embedding converts the vectorized output of textual data into dense, continuous vectors.
This often helps the language model to capture semantic meaning and contextual
relationships within the text.
Importance of High-Quality Training Data
The importance of high-quality training data in machine learning lies in the fact that it
directly impacts the accuracy and reliability of machine learning models. For a model to
accurately learn patterns and make predictions, it needs to be trained on large volumes of
diverse, accurate, and unbiased data. If the data used for training is low-quality or contains
inaccuracies and biases, it will produce less accurate and potentially biased predictions.
Importance of High-Quality Training Data
The importance of high-quality training data in machine learning lies in the fact that it
directly impacts the accuracy and reliability of machine learning models. For a model to
accurately learn patterns and make predictions, it needs to be trained on large volumes of
diverse, accurate, and unbiased data. If the data used for training is low-quality or contains
inaccuracies and biases, it will produce less accurate and potentially biased predictions.

The quality of datasets being used to train models applies to every type of AI model,
including Foundation Models, such as ChatGPT and Google’s BERT. The Washington Post
took a closer look at the vast datasets being used to train some of the world’s most popular
and powerful large language models (LLMs). In particular, the article reviewed the content
of Google’s C4 dataset, finding that quality and quantity are equally important, especially
when training LLMs.

In image recognition tasks, if the training data used to teach the model contains images with
inaccurate or incomplete labels, then the model may not be able to recognize or classify
similar images in its predictions accurately.
At the same time, if the training data is biased towards certain groups or demographics,
then the model may learn and replicate those biases, leading to unfair or discriminatory
treatment of certain groups. For instance, Google, too, succumbed to bias traps in a recent
incident where its Vision AI model generated racist outcomes.

Approaches for Text Representation


Traditional Approach
The conventional method involves compiling a list of distinct terms and giving each one a
unique integer value, or id. and after that, insert each word’s distinct id into the sentence.
Every vocabulary word is handled as a feature in this instance. Thus, a large vocabulary
will result in an extremely large feature size. Common traditional methods include:
One-Hot Encoding
One-hot encoding is a simple method for representing words in natural language
processing (NLP). In this encoding scheme, each word in the vocabulary is represented as a
unique vector, where the dimensionality of the vector is equal to the size of the
vocabulary. The vector has all elements set to 0, except for the element corresponding to
the index of the word in the vocabulary, which is set to 1.
Bag of Word (Bow)
Bag-of-Words (BoW) is a text representation technique that represents a document as an
unordered set of words and their respective frequencies. It discards the word order and
captures the frequency of each word in the document, creating a vector representation.
While BoW is a simple and interpretable representation, below disadvantages highlight its
limitations in capturing certain aspects of language structure and semantics:
 BoW ignores the order of words in the document, leading to a loss of sequential
information and context making it less effective for tasks where word order is crucial,
such as in natural language understanding.
 BoW representations are often sparse, with many elements being zero resulting in
increased memory requirements and computational inefficiency, especially when
dealing with large datasets.
Term frequency-inverse document frequency (TF-IDF)
Term Frequency-Inverse Document Frequency, commonly known as TF-IDF, is a numerical
statistic that reflects the importance of a word in a document relative to a collection of
documents (corpus). It is widely used in natural language processing and information
retrieval to evaluate the significance of a term within a specific document in a larger
corpus. TF-IDF consists of two components:
 Term Frequency (TF): Term Frequency measures how often a term (word) appears in a
document. It is calculated using the formula:

Inverse Document Frequency (IDF): Inverse Document Frequency measures the


importance of a term across a collection of documents. It is calculated using the formula:

The TF-IDF score for a term t in a document d is then given by multiplying the TF and IDF
values:

The higher the TF-IDF score for a term in a document, the more important that term is to
that document within the context of the entire corpus. This weighting scheme helps in
identifying and extracting relevant information from a large collection of documents, and
it is commonly used in text mining, information retrieval, and document clustering.
TF-IDF is a widely used technique in information retrieval and text mining, but its
limitations should be considered, especially when dealing with tasks that require a deeper
understanding of language semantics. For example:
 TF-IDF treats words as independent entities and doesn’t consider semantic
relationships between them. This limitation hinders its ability to capture contextual
information and word meanings.
 Sensitivity to Document Length: Longer documents tend to have higher overall term
frequencies, potentially biasing TF-IDF towards longer documents.
Neural Approach
Word2Vec
Word2Vec is a neural approach for generating word embeddings. It belongs to the family
of neural word embedding techniques and specifically falls under the category of
distributed representation models. It is a popular technique in natural language
processing (NLP) that is used to represent words as continuous vector spaces. Developed
by a team at Google, Word2Vec aims to capture the semantic relationships between
words by mapping them to high-dimensional vectors. The underlying idea is that words
with similar meanings should have similar vector representations. In Word2Vec every
word is assigned a vector. We start with either a random vector or one-hot vector.
There are two neural embedding methods for Word2Vec, Continuous Bag of Words
(CBOW) and Skip-gram.
Continuous Bag of Words(CBOW)

Continuous Bag of Words (CBOW) is a type of neural network architecture used in the
Word2Vec model. The primary objective of CBOW is to predict a target word based on its
context, which consists of the surrounding words in a given window. Given a sequence of
words in a context window, the model is trained to predict the target word at the center
of the window.
CBOW is a feedforward neural network with a single hidden layer. The input layer
represents the context words, and the output layer represents the target word. The
hidden layer contains the learned continuous vector representations (word embeddings)
of the input words.
The architecture is useful for learning distributed representations of words in a continuous
vector space.
Fig: 1
The hidden layer contains the continuous vector representations (word embeddings) of
the input words.
 The weights between the input layer and the hidden layer are learned during training.
 The dimensionality of the hidden layer represents the size of the word embeddings
(the continuous vector space).
Skip-Gram
The Skip-Gram model learns distributed representations of words in a continuous vector
space. The main objective of Skip-Gram is to predict context words (words surrounding a
target word) given a target word. This is the opposite of the Continuous Bag of Words
(CBOW) model, where the objective is to predict the target word based on its context. It is
shown that this method produces more meaningful embeddings.
After applying the above neural embedding methods we get trained vectors of each word
after many iterations through the corpus. These trained vectors preserve syntactical or
semantic information and are converted to lower dimensions. The vectors with similar
meaning or semantic information are placed close to each other in space.

Pretrained Word-Embedding
Pre-trained word embeddings are representations of words that are learned from large
corpora and are made available for reuse in various natural language processing (NLP)
tasks. These embeddings capture semantic relationships between words, allowing the
model to understand similarities and relationships between different words in a
meaningful way.
GloVe
GloVe is trained on global word co-occurrence statistics. It leverages the global context to
create word embeddings that reflect the overall meaning of words based on their co-
occurrence probabilities. this method, we take the corpus and iterate through it and get the
co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix
through this. The words which occur next to each other get a value of 1, if they are one
word apart then 1/2, if two words apart then 1/3 and so on.
Advantages and Disadvantage of Word Embeddings
 It is much faster to train than hand build models like WordNet (which uses graph
embeddings).
 Almost all modern NLP applications start with an embedding layer.
 It Stores an approximation of meaning.
Disadvantages
 It can be memory intensive.
 It is corpus dependent. Any underlying bias will have an effect on the model.
 It cannot distinguish between homophones.
Eg: brake/break, cell/sell, weather/whether etc.

Positional encoding

It is a data preprocessing technique that adds information about the order of words in a
sequence to a model. positional encoding is used to provide positional information to the
model. In detail, a position-dependent signal is added to each word embedding for each
input sequence to help the model incorporate the order of words. The output of positional
encoding has the same dimension as the embedding layer.
Importance of positional encodings important
Positional encodings are crucial in Transformer models for several reasons:
 Preserving Sequence Order: Transformer models process tokens in parallel, lacking
inherent knowledge of token order. Positional encodings provide the model with
information about the position of tokens in the sequence, ensuring that the model can
differentiate between tokens based on their position. This is essential for tasks where
word order matters, such as language translation and text generation.
 Maintaining Contextual Information: In natural language processing tasks, the
meaning of a word often depends on its position in the sentence. For example, in the
sentence "The cat sat on the mat," the word "cat" has a different meaning than in "The
mat sat on the cat." transformer
 Enhancing Generalization: By incorporating positional information, transformer
models can generalize better across sequences of different lengths. This is particularly
important for tasks where the length of the input sequence varies, such as document
summarization or question answering. Positional encodings enable the model to
handle input sequences of varying lengths without sacrificing performance.
 Mitigating Symmetry: Without positional encodings, the self-attention mechanism in
Transformer models would treat tokens symmetrically, potentially leading to
ambiguous representations. Positional encodings introduce an asymmetry into the
model, ensuring that tokens at different positions are treated differently, thereby
improving the model's ability to capture long-range dependencies.

Example of Positional Encoding:


Let's consider a simple example to illustrate the concept of positional encoding in the
context of a Transformer model.
Suppose we have a Transformer model tasked with translating English sentences into
French. One of the sentences in English is:

"The cat sat on the mat."

Before the sentence is fed into the Transformer model, it undergoes tokenization, where
each word is converted into a token. Let's assume the tokens for this sentence are:

["The", "cat" , "sat", "on", "the" ,"mat"]


Next, each token is mapped to a high-dimensional vector representation through an
embedding layer. These embeddings encode semantic information about the words in the
sentence. However, they lack information about the order of the words.
Embeddings={E1,E2,E3,E4,E5,E6}
where each Ei is a 4-dimensional vector.
This is where positional encoding comes into play. To ensure that the model understands
the order of the words in the sequence, positional encodings are added to the word
embeddings. These encodings provide each token with a unique positional representation.
These positional encodings are added element-wise to the word embedding’s. The
resulting vectors contain both semantic and positional information, allowing the
Transformer model to understand not only the meaning of each word but also its position
in the sequence.
This example illustrates how positional encoding ensures that the Transformer model can
effectively process and understand input sequences by incorporating information about
the order of the tokens.

They enable Transformer models to effectively process and understand input sequences,
leading to improved performance across a wide range of natural language processing
tasks.

Positional Encoding Layer in Transformers


The Positional Encoding layer in Transformers plays a critical role by providing necessary
positional information to the model. This is particularly important because the
Transformer architecture, unlike RNNs or LSTMs, processes input sequences in parallel
and lacks inherent mechanisms to account for the sequential order of tokens. The
mathematical intuition behind the Positional Encoding layer in Transformers is centred on
enabling the model to incorporate information about the order of tokens in a sequence.
Positional encodings utilize a specific mathematical formula to generate a unique
encoding for each position in the input sequence. Here’s a closer look at the methodology:
 Formula for Positional Encoding: For each position ?p in the sequence, and for each
dimension 2?2i and 2?+12i+1 in the encoding vector:
Code Implementation of Positional Encoding in Transformers

The defined positional_encoding function generates a positional encoding matrix that is


widely used in models like the Transformer to give the model information about the
relative or absolute position of tokens in a sequence.Here, is a breakdown of what each
part does.

1. Function Parameters:
 position: Total positions or length of the sequence.
 d_model: Dimensionality of the model's output.
2. Generating the Base Matrix:
 angle_rads: Creates a matrix where rows represent sequence positions and
columns represent feature dimensions. Values are scaled by dividing each position
index by 10000 raised to (2 * index / d_model).
3. Applying Sine and Cosine Functions:
 Even indices: Apply the sine function to encode positions.
 Odd indices: Apply the cosine function for a phase-shifted encoding.
4. Creating the Positional Encoding Tensor:
 The matrix is expanded to match input shape expectations of models like
Transformers and cast to tf.float32.
5. Output:
 Returns a TensorFlow tensor of shape (1, position, d_model), ready to be added to
input embeddings to incorporate positional information.

Transformer Architecture: Attention mechanism


The transformer model is a neural network architecture that made a radical shift in the
field of machine learning. When writing this article, transformer variants have long
dominated popular performance leader boards in almost every natural language
processing task. What is more, recent transformer-like architectures have become the
state of the art in the computer vision field as well.
What Are Transformers?
Transformers were first developed to solve the problem of sequence transduction, or
neural machine translation, which means they are meant to solve any task that transforms
an input sequence to an output sequence. This is why they are called “Transformers”.
But let’s start from the beginning.
What Are Transformer Models?
A transformer model is a neural network that learns the context of sequential data and
generates new data out of it.
To put it simply:
A transformer is a type of artificial intelligence model that learns to understand and
generate human-like text by analyzing patterns in large amounts of text data.
composed of two main parts:
The encoder takes in our input and outputs a matrix representation of that input. For
instance, the English sentence “How are you?”
The decoder takes in that encoded representation and iteratively generates an output. In
our example, the translated sentence “¿Cómo estás?”

The Encoder WorkFlow


The encoder is a fundamental component of the Transformer architecture. The primary
function of the encoder is to transform the input tokens into contextualized
representations. Unlike earlier models that processed tokens independently, the
Transformer encoder captures the context of each token with respect to the entire
sequence.
Its structure composition consists as follows
Fig: 2

The Encoder WorkFlow


The encoder is a fundamental component of the Transformer architecture. The primary
function of the encoder is to transform the input tokens into contextualized
representations. Unlike earlier models that processed tokens independently, the
Transformer encoder captures the context of each token with respect to the entire
sequence.
Its structure composition consists as follows:
Fig: 3

STEP 1 - Input Embeddings


The embedding only happens in the bottom-most encoder. The encoder begins by
converting input tokens - words or subwords - into vectors using embedding layers. These
embeddings capture the semantic meaning of the tokens and convert them into numerical
vectors.
All the encoders receive a list of vectors, each of size 512 (fixed-sized). In the bottom
encoder, that would be the word embeddings, but in other encoders, it would be the
output of the encoder that’s directly below them.
Fig: 4

STEP 2 - Positional Encoding


Since Transformers do not have a recurrence mechanism like RNNs, they use positional
encodings added to the input embeddings to provide information about the position of
each token in the sequence. This allows them to understand the position of each word
within the sentence.
To do so, the researchers suggested employing a combination of various sine and cosine
functions to create positional vectors, enabling the use of this positional encoder for
sentences of any length.
In this approach, each dimension is represented by unique frequencies and offsets of the
wave, with the values ranging from -1 to 1, effectively representing each position.

Fig: 5
STEP 3 - Stack of Encoder Layers
The Transformer encoder consists of a stack of identical layers (6 in the original
Transformer model).
The encoder layer serves to transform all input sequences into a continuous, abstract
representation that encapsulates the learned information from the entire sequence. This
layer comprises two sub-modules:
A multi-headed attention mechanism.
A fully connected network.
Additionally, it incorporates residual connections around each sublayer, which are then
followed by layer normalization.

Fig: 6

Attention:
Attention, in general, refers to the ability to focus on one thing and ignore other things
that seem irrelevant at the time. In machine learning, this concept is applied by teaching
the model to focus on certain parts of the input data and disregard others to better solve
the task at hand.
In tasks like machine translation, for example, the input data is a sequence of some text.
When we humans read a piece of text, it seems natural to attend to some parts more than
others. Usually, it’s the who, when, and where part of a sentence that captures our
attention. Since this is a skill we develop from birth, we don’t acknowledge its importance.
But without it, we wouldn’t be able to contextualize.
For instance, if we see the word bank, in our heads, we might think about a financial
institution or a place where blood donations are stored, or even a portable battery. But if
we read the sentence ” I am going to the bank to apply for a loan”, we immediately catch
up on what bank is mentioned. This is because we implicitly attended to a few clues. From
the “going to” part, we understood that a bank is a place in this context, and from the
“apply for a loan” part, we got that that can receive a loan there.
The whole sentence gives out information that adds up to create a mental picture of what
a bank is. Suppose a machine could do the same thing as go as we do. In that case, most of
the significant natural language processing problems like words with multiple meanings,
sentences with multiple grammatical structures, and uncertainty about what a pronoun
refers to will be solved.
The Birth of Transformers
The Transformer architecture, introduced in the paper “Attention Is All You Need” by
Vaswani et al. in 2017, redefined the game. It relied on the self-attention mechanism to
process sequences in parallel, making it highly efficient. This was the birth of the
Transformer model.
The Building Blocks of Attention Mechanisms
Self-Attention: The Basics
Self-attention, also known as scaled dot-product attention, is a mechanism that allows a
Transformer to weigh the importance of different words in a sentence when processing a
specific word. It can be likened to a spotlight focusing on different sentence parts as the
model processes each word. This mechanism is mathematically defined as follows:
Query, Key, and Value: For a given word, the self-attention mechanism computes three
vectors: Query (Q), Key (K), and Value (V). These vectors are learned during training.
Attention Scores: The model calculates attention scores by taking the dot product of the
Query vector for the current word and the Key vectors for all the words in the input
sequence. These scores indicate how much focus each word should receive.
Softmax and Scaling: The attention scores are passed through a softmax function to get a
probability distribution. This distribution is then used to weigh the Value vectors, deciding
how much each word’s information should contribute to the current word’s
representation.
Weighted Sum: Finally, the Value vectors are weighted by the attention scores and
summed to create the new representation of the current word.
Multi-Head Attention
In practice, Transformers use what is known as multi-head attention. Instead of relying on
a single attention mechanism, the model uses multiple heads or sets of Query, Key, and
Value vectors. Each head can focus on different input parts, capturing different aspects of
word relationships.
Positional Encoding
One challenge with self-attention is that it doesn’t inherently capture the order of words
in a sequence. To address this, Transformers incorporate positional encoding into their
input embeddings. Positional encodings are added to the word embeddings, allowing the
model to consider the position of each word in the sequence.
Self-Attention:
The self-attention mechanism is at the core of what makes Transformers powerful. Here
are some reasons why it’s so essential:

Long-Range Dependencies

Self-attention can capture relationships between words that are far apart in a sequence.
In contrast, RNNs struggle with long-range dependencies because information must flow
step by step.

Parallelization

Traditional sequence models like RNNs process data sequentially, one step at a time. Self-
attention, on the other hand, can process the entire sequence in parallel, making it more
computationally efficient.

Adaptability

The attention mechanism is not limited to language processing. It can be adapted for
various tasks and domains. For instance, in computer vision, self-attention mechanisms
can capture relationships between pixels in an image.
Attention Mechanisms in Real-Life

BERT: The Language Understanding Transformer

The BERT model, developed by Google, uses self-attention to pre-train on a massive text
corpus. BERT has set new benchmarks in various NLP tasks, from sentiment analysis to
text classification.

GPT-3: Language Generation at Scale

OpenAI’s GPT-3 is one of the largest language models in existence. It uses self-attention to
generate coherent and contextually relevant text, making it ideal for applications like
chatbots and language translation.

Image Analysis

The power of attention mechanisms isn’t limited to text. In computer vision, models like
the Vision Transformer have demonstrated that self-attention can capture complex
relationships between pixels in an image, enabling state-of-the-art image recognition.

Potential and Pitfalls

Model Size

Large-scale models with multiple heads and layers can become computationally
expensive. This can limit the accessibility of these models to a broader range of
applications.

Interpretability

The internal workings of attention mechanisms can be challenging to interpret.


Understanding why a model made a specific prediction can be challenging, especially in
critical applications like healthcare.

A Primer on Transformers

While far from perfect, transformers are our best current solution to
contextualization. The type of attention used in them is called self-attention. This
mechanism relates different positions of a single sequence to compute a representation of
the same sequence. It is instrumental in machine reading, abstractive summarization, and
even image description generation.

Since they were used initially for machine translation, the Transformers are based on the
encoder-decoder architecture, meaning that they have two major components. The first
component is an encoder which takes a sequence as input and transforms it into a state
with a fixed shape. The second component is the decoder. It maps the encoded state of a
fixed shape to an output sequence. Here is a diagram:

Transformers and attention mechanisms have revolutionized the field of deep learning,
offering a powerful way to process sequential data and capture long-range dependencies.
 Attention mechanisms are crucial in transformers, allowing different tokens to be
weighted based on their importance, enhancing model context and output quality.
 Transformers operate on self-attention, enabling the capture of long-range
dependencies without sequential processing.
 Multi-head attention in transformers enhances model performance by allowing the
model to focus on different aspects of the input data simultaneously.
 Transformers outperform RNNs and LSTMs in handling sequential data due to their
parallel processing capabilities.
 Applications of transformers span across NLP, computer vision, and state-of-the-art
model development.

Transformer Architecture: Attention Mechanism

Attention Mechanism has been a powerful tool for improving the performance of Deep
Learning and NLP models by allowing them to extract the most relevant and important
information from data, giving them the ability to simulate cognitive abilities of humans.
This article at OpenGenus aims to explore and walk it through the main types of Attention
Mechanism models and the main approaches to Attention. Attention Mechanism enables
enhances the performance of models by introducing the ability to mimic cognitive
attention the way humans do in order to make relevant predictions by understanding the
context of given data. Attention can be defined as 'Memory per unit of time'.
It is used in Deep Learning models to selectively focus on certain parts of the input and
assign weights to them based on their relevance to the current task, such that the model
can assign more resource and 'attention' to the most important parts of the input while
ignoring the less relevant parts. In Natural Language Processing (NLP), attention
mechanisms have been particularly successful in improving the performance of machine
translation, text summarization, and sentiment analysis models.

Fig: 7

The encoder’s role is to meticulously extract features from the input sequence. This is
achieved through a series of layers, each comprising a multi-head attention mechanism
followed by a feed-forward neural network. These layers are further enhanced with
normalization and residual connections to ensure stability during training. Remarkably,
the entire sequence is processed in parallel, which is a stark departure from the sequential
processing of traditional recurrent neural networks (RNNs).
A lot is going on in the diagram, but for our purposes, it’s only worth noticing that the
encoder module here is painted in blue while the decoder is in green. We can also see that
both the encoder and decoder modules use a layer called Multi-Head Attention. Let’s also
forget about the multi-headed part for now and focus only on what is inside one “head”.

The main component is called scaled dot-product attention and it’s very elegant in that it
achieves so much with just a few linear algebra operations. It’s made up of three
matrices, and, which are called a query, key, and value matrices and each has a dimension.
The concept of using queries, keys, and values is directly inspired by how databases work.
Each database storage has its data values indexed by keys, and users can retrieve the data
by making a query.

The self-attention operation is very similar, except that there isn’t a user or a controller
issuing the query, but it’s learned from the data. By the use of backpropagation, the
neural network updates its Q, K, and V matrices in order to mimic a user-database
interaction. To prove that this is possible, let’s reimagine the retrieval process as a vector
dot product:

where is a one-hot vector consisting of only ones and zeroes, and is a vector with the
values we’re retrieving.

In this case, the vector alpha is the de facto query because the output will consist only of
the values of where is 1:

Fig: 8
Now let’s remove the restriction for the query vector and allow float values between 0
and 1. By doing that, we would get a weighted proportional retrieval of the values:

Fig: 9

Scaled-Dot Product Attention Mechanism

The scaled dot product attention uses vector multiplication in the same exact way. To
obtain the final weights on the values, first, the dot product of the query with all keys is
computed and then divided by . Then a softmax function is applied.

In practice, however, these vector multiplications happen simultaneously because the


query keys and values are packed together into matrices, as already mentioned. The final
computation is, therefore:

The formula can also be viewed as the following diagram:


Fig: 10

The steps involved are as follows:

1. Input Sequence (Query, Key and Value)- The input sequence is transformed into
three vectors: query, key, and value. These vectors are learned during training and
represent different aspects of the input sequence.
2. First Matrix Multiplication- Computes the similarity between the Query Matrix and
the Key Matrix by performing a dot product operation. The resulting score
represents the relevance of each element in the input sequence to the current
state of the model.
3. Scaling Of Matrix- To prevent the scores from being to large and to avoid the
Vanishing Gradient problem, scaling of the matrix is done to stabilize the gradients
during training.
4. SoftMax Function- The scaled scores are passed through a softmax function to
normalize them to ensure that the attention weights sum up to 1. It is used to
convert a set of numbers into a probability distribution and enables the model to
learn which parts of the input are most relevant to the current task.
5. Second Matrix Multiplication- The output from the SoftMax function is then
multiplied with the Value Matrix which is the final output of the Attention Layer.
Multi-Head Attention Mechanism
Instead of performing and obtaining a single Attention on large matrices Q, K and V, it is
found to be more effecient to divide them into multiple matrices of smaller dimensions
and perform Scaled-Dot Product on each of those smaller matrices. Each attention module
can focus on calculating different types of relationships between the inputs and create
specific contextualized embeddings. As shown in the diagram above, these embeddings
can then be concatenated and put through an ordinary linear neural network layer,
together making the final output of the so-called Multi-Headed Attention Module. As it
turns out, this approach not only improves the model’s performance but improves training
stability as well:

Fig: 11

The decoder, on the other hand, is tasked with generating the output sequence. It mirrors
the encoder’s structure but includes an additional layer of cross-attention that allows it to
focus on relevant parts of the input sequence as it produces the output. The advantages of
the Transformer architecture are manifold. Its parallel processing capabilities drastically
accelerate training and inference times. Coupled with self-attention, the architecture adeptly
handles long-range dependencies, capturing intricate relationships within the data that span
considerable sequence lengths.

Obtain Query, Key and Value matrices- Obtain the value 'h', which is the number of heads.
Usually, a value of 8 is considered. Then we obtain the same number of set of Q, K and V
matrices. A triple set from 1 to h will be obtained for Q, K and V.

Scaled-Dot Product Attention- We perform scaled-dot product attention for each triple set
Qi, Ki and Vi to Qh, Kh and Vh.

Concatenation and Linear Transformation- The resulting attention distributions are


concatenated using the ConCat function and passed through a learnable linear
transformation to obtain a final representation.

MultiHead attention uses multiple attention heads to attend to different parts of the input
sequence which allows the model to learn different relationships between the different parts
of the input sequence. For example, the model can learn to attend to the local context of a
word, as well as the global context of the entire input sequence.

Additive-Attention Mechanism

This type of mechanism is used in neural networks to learn long-range dependencies


between different parts of a sequence. It works by computing a weighted sum of the hidden
states of the encoder, where the weights are determined by how relevant each hidden state
is to the current decoding step. First, the encoder reads the input sequence and produces a
sequence of hidden states which represent the encoder's understanding of the input
sequence.

The decoder then begins to generate the output sequence, one word at a time. The decoder
takes the previous word, the current hidden state, and the attention weights as input. The
attention weights are used to compute a weighted sum of the encoder's hidden states and
this weighted sum is then used to update the decoder's hidden state. The decoder then
generates the next word, and the process is repeated until the decoder generates the end-of-
sequence token.
Dynamic-Convolution Attention

Dynamic convolution attention was first introduced in the paper "Dynamic Convolution
Attention over Convolution Kernels" by Chen et al. (2019). It increases model complexity
without increasing the depth or width of a network. Instead of using a single kernel per layer,
dynamic convolution clusters multiple parallel convolution kernels in a dynamic method
based upon their attentions, which are dependent on the input. Aggregation of multiple
kernels is not only computationally efficient due to the small kernel size, but it also has
representation power as these kernels are aggregated in a non-linear fashion via attention. It
can be integrated easily into existing network architectures.

Entity-Aware Attention

Entity-aware attention was first introduced in the paper "Entity-Aware Attention for Relation
Extraction" by Sun et al. (2019). In this paper, the authors showed that entity-aware
attention could be used to improve the performance of relation extraction models. It is an
attention mechanism that focuses on specific entities or entities' interactions within a
sequence or a context.

In traditional attention mechanisms, attention weights are typically computed based on the
similarity between the query and key vectors derived from the input sequence. However, in
entity-aware attention, the attention mechanism takes into account the entities mentioned
in the input sequence and their respective roles using Named Entity Recognition (NER) or
Entity Linking.

Location-Based Attention

Location-based attention, also known as Positional Attention, was first introduced in the
paper "Spatial Attention in Natural Language Processing" by Xu et al. (2015). It takes into
account the relative position or location of elements within a sequence or context using
Learned Position Embeddings or Sinusoidal Encoding. It is commonly used in models that
process sequential data, such as natural language processing (NLP) tasks, where the position
of words or tokens can carry important information.
This can be particularly useful in tasks where the order or position of elements matters, such
as machine translation or text generation, as it helps the model capture sequential
dependencies and generate more accurate and coherent outputs.

Global Attention

While Scaled-Dot Product and Multi-Head Attention are methods to compute attention,
there are mainly two approaches to how the Attention itself is applied, i.e. Global Attention
and Local Attention. Global attention attends to all of the input sequence, while local
attention only attends to a subset of the input sequence.

Global attention can capture long-range dependencies between different parts of the input
sequence and contextual information in the input sequence, as it allows the model to attend
to all input elements at each decoding step. It can be computationally expensive as it has to
attend to the entire sequence. It is typically used for tasks that require the model to consider
the entire input sequence, such as text summarization.

Local Attention

Local attention can be more effective for tasks that require the model to focus on a specific
part of the input sequence. Local attention is easier to train as it only has to learn to attend
to a subset of the input sequence. It is typically used for tasks that require the model to focus
on a specific part of the input sequence, such as machine translation.

Importance of Attention Mechanisms

The advent of attention mechanisms has been nothing short of revolutionary in the realm of
deep learning. Attention allows models to dynamically focus on pertinent parts of the input
data, akin to the way humans pay attention to certain aspects of a visual scene or
conversation. This selective focus is particularly crucial in tasks where context is key, such as
language understanding or image recognition.

In the context of transformers, attention mechanisms serve to weigh the influence of


different input tokens when producing an output. This is not merely a replication of human
attention but an enhancement, enabling machines to surpass human performance in certain
tasks. Consider the following points that underscore the importance of attention
mechanisms:

 They provide a means to handle variable-sized inputs by focusing on the most relevant
parts.

 Attention-based models can capture long-range dependencies that earlier models like
RNNs struggled with.

 They facilitate parallel processing of input data, leading to significant improvements in


computational efficiency.

Encoder-Decoder Model

At the heart of the encoder-decoder model lies a symphony of sequential data translation,
where the encoder processes the input sequence and distills it into a fixed-length
representation, often referred to as the context vector. This vector serves as a condensed
summary of the input, capturing its essence for the decoder to interpret.

Fig: 12
The decoder, mirroring the encoder’s structure, awakens with the context vector, infusing
its initial hidden state. It embarks on a generative quest, conjuring the first token of the
output sequence, and continues to weave the subsequent tokens, each prediction delicately
influenced by the previously materialized tokens and the persistent whisper of the context
vector. This iterative dance persists until the narrative is complete, signaled by an end-of-
sequence token or the bounds of a predefined sequence length.

Types of Attention

Here are some types of attention mechanisms used in Transformer architecture:

1. Scaled Dot-Product Attention

The Scaled Dot-Product Attention is the fundamental building block of the Transformer's
attention mechanism. It involves three main components: queries (Q), keys (K), and values
(V). The attention score is computed as the dot product of the query and key vectors,
scaled by the square root of the dimension of the key vectors. This score is then passed
through a softmax function to obtain the attention weights, which are used to compute a
weighted sum of the value vectors.

2. Multi-Head Attention

Multi-Head Attention enhances the model's ability to focus on different parts of the input
sequence simultaneously. It involves multiple attention heads, each with its own set of
query, key, and value matrices. The outputs of these heads are concatenated and linearly
transformed to produce the final output. This allows the model to capture different
features and dependencies in the input sequence.

3. Self-Attention
Self-Attention, also known as intra-attention, allows the model to consider different
positions of the same sequence when computing the representation of a word. In the
context of the Transformer, self-attention is applied in both the encoder and decoder layers.
It enables the model to capture long-range dependencies and relationships within the input
sequence.

It is a mechanism that allows a model to attend to different parts of the same input
sequence. This is done by computing a weighted sum of the input sequence, where the
weights are determined by how relevant each part of the sequence is to the current task.

The basic idea behind self-attention is to compute attention weights for each word/token in
a sequence with respect to all other words/tokens in the same sequence. These attention
weights indicate the importance or relevance of each word/token to the others.

4. Encoder-Decoder Attention
Encoder-Decoder Attention, also known as cross-attention, is used in the decoder layers of
the Transformer. It allows the decoder to focus on relevant parts of the input sequence
(encoded by the encoder) when generating each word of the output sequence. This type
of attention ensures that the decoder has access to the entire input sequence, helping it
produce more accurate and contextually appropriate translations.
5. Causal or Masked Self-Attention
Causal or Masked Self-Attention is used in the decoder to ensure that the prediction for a
given position only depends on the known outputs at positions before it. This is crucial for
tasks like language modeling, where future tokens should not be visible during training.
The attention scores for future tokens are masked out, ensuring that the model cannot
look ahead.

Normalization - LayerNorm, RMSNorm

In machine learning, normalization is a statistical technique with various applications. There


are two main forms of normalization, namely data normalization and activation
normalization. Data normalization (or feature scaling) includes methods that rescale input
data so that the features have the same range, mean, variance, or other statistical
properties.

Normalization is often used to:

1. increase the speed of training convergence,


2. reduce sensitivity to variations and feature scales in input data,
3. reduce overfitting,
4. and produce better model generalization to unseen data.

Normalization techniques are often theoretically justified as reducing covariance shift,


smoothing optimization landscapes, and increasing regularization, though they are mainly
justified by empirical success.

Normalization has been a very crucial step during data preprocessing in traditional Machine
Learning that involves scaling the features of a dataset to a similar range to ensure that no
particular feature dominates the learning process due to its scale. Some common data
normalization techniques include Min-Max scaling and Z-score normalization. In neural
networks, weights are also normalized to ensure that the weights do not become too large or
too small which can lead to numerical instability during training. In deep learning,
normalization can be applied in two key areas:

4. Input Data: The data that feed into the neural network (e.g., features like f1, f2, f3)
can be normalized before entering the network. This step is similar to what we’ve
done in traditional machine learning.

5. Hidden Layer Activations: That can also normalize the activations (outputs) from
hidden layers in the neural network. This is often done to stabilize and accelerate
training, especially in deep networks.

Let’s dive into why normalization is so important in deep learning. To train a neural network,
and as that update the weights, some of them start getting really big. When that happens,
the activations tied to those weights also become large, making it harder for the model to
learn effectively. It slows things down and can cause problems in training.
Normalization helps fix this by keeping activations within a stable range. This not only makes
the training process more stable but also speeds it up, allowing the model to learn more
efficiently.

Another big benefit of normalization is that it prevents a problem called internal covariate
shift. This happens when the input data’s distribution changes as it moves through the layers
of the network, which can confuse the model. By normalizing the activations, that keep
things consistent, so the model can keep learning without getting thrown off.

Layer Normalization

Layer Normalization directly estimates the normalization statistics from the summed inputs
to the neurons within a hidden layer so the normalization does not introduce any new
dependencies between training cases. It works well for RNNs and improves both the training
time and the generalization performance of several existing RNN models. More recently, it
has been used with Transformer models.

LayerNorm with respect to the standard feed-forward neural network where a denotes the
weight-summed inputs to neurons, a(i) is the ith value of vector a(cap), g is the gain
parameter used to re-scale the standardized summed inputs and μ and σ2 are the mean and
variance statistic respectively estimated from raw summed inputs a.

A well-known explanation of the success of LayerNorm is its re-centering and re-scaling


invariance property. The former enables the model to be insensitive to shift noises on both
inputs and weights and the latter keeps the output representations intact when both inputs
and weights are randomly scaled. RMSNorm hypothesize that the re-scaling invariance is the
reason for success of LayerNorm rather than re-centering invariance. LayerNorm has been
stabilizing the training of deep neural networks by regularizing neuron dynamics within one
layer via mean and variance statistics. Due to its simplicity and requiring no dependencies
among training cases, LayerNorm has been widely applied to different neural architectures.
In some cases, LayerNorm was found to be essential for successfully training a model. But, as
networks grow larger and deeper the computational overhead becomes severe. As a result,
the efficiency gain from faster and more stable training (in terms of number of training steps)
is counter-balanced by an increased computational cost per training step which diminishes
the net efficiency. One major feature of LayerNorm that is widely regarded as contributions
to the stabilization is its re-centering invariance property. The paper argues that this mean
normalization does not reduce the variance of hidden states or model gradients, and
hypothesize that it has little impact on the success of LayerNorm.

Root Mean Square Layer Normalization

Root Mean Square Normalization or RMSNorm regularizes the summed inputs to a neuron in
one layer according to root mean square (RMS) giving the model re-scaling invariance
property and implicit learning rate adaptation ability. This is because the scale of the
activations influences the magnitude of weight updates during training and hence RMSNorm
adjusts the learning rate based on the RMS value of the inputs contributing to stable training
and faster convergence. Hence, RMSNorm only focuses on re-scaling invariance and
regularizes the summed inputs simply according to the root mean square (RMS) statistic as
given below.

Intuitively, RMSNorm simplifies LayerNorm by totally removing the mean statistic from
LayerNorm at the cost of sacrificing the invariance that mean normalization affords. When
the mean of summed inputs is zero, RMSNorm is exactly equal to LayerNorm. Through
various experiments, the author have shown that this property is not fundamental to the
success of LayerNorm and that RMSNorm is similarly or more effective.
Encoder, Decoder

A key part of the Transformer model, a popular deep learning model used for natural
language processing (NLP). Encoders in Transformers are neural network layers that process
the input sequence and produce a continuous representation, or embedding, of the input. The
decoder then uses these embeddings to generate the output sequence. The encoder typically
consists of multiple self-attention and feed-forward layers, allowing the model to process
and understand the input sequence effectively.

Fig: 13
The transformer encoder architecture typically consists of multiple layers, each of which
includes a self-attention mechanism and a feed-forward neural network. The encoder
processes the input sequence and produces a continuous representation, or embedding, of
the input, which is then passed to the decoder to generate the output sequence.

The transformer encoder comprises multiple self-attention and feed-forward layers, allowing
the model to process and understand the input sequence effectively. The decoder also
typically consists of multiple layers, including a self-attention mechanism and a feed-forward
network.` The decoder uses the embeddings produced by the encoder and its internal states
to generate the output sequence.

The decoder also typically consists of `multiple neural network layers that generate the
output data based on the context provided by the encoder. The decoder may also include
self-attention layers, which allow it to consider the context and dependencies between
different parts of the output when generating the data.
Need of an Encoder

The Transformer architecture is a neural network model that uses an encoder to process and
encode input data.

The encoder plays several important roles, such as:

Extracting useful features and patterns from complex and unstructured input data is crucial for
understanding the input and making predictions.

`Capturing relationships between different parts of the input through self-attention


mechanisms. This allows the model to consider the context and dependencies between
different parts of the input when making predictions, which is important for tasks such as
machine translation, where the meaning of a word can depend on the words around it.

The encoder also plays a crucial role in the encoder-decoder architecture, a common
structure used in natural languages processing tasks such as machine translation and
language generation.

In this architecture, the encoder processes and encodes the input data, and the decoder
generates the output data based on the encoded representation.

The encoded representation provided by the encoder serves as the “context” for the
decoder, which helps it generate more accurate and coherent output by capturing the
essence of the input.

Overall, the encoder is a vital component of the Transformer architecture. It is instrumental


in the model’s ability to process and understand complex input data by providing a compact
representation that captures the essence of the input.

In the first step of the Transformer process, the input data is transformed into a vector
representation using an embedding layer. This layer maps each word in the input to a fixed-
length vector, which helps the model capture the meaning and context of the words.

After the input data has been embedded, it is augmented with positional encoding.
The positional encoding is a series of sinusoidal functions that encode the relative position of
each word in the input sequence, and it is added to the input embeddings element-wise. This
is necessary because the transformer has no inherent understanding of the order of the input
sequence, and the position of each word in the sequence can affect its meaning.

(3) The next step in the Transformer process is the multi-headed attention stage. This is where
the transformer differs significantly from other models, using self-attention mechanisms to
capture dependencies between different parts of the input data. This enables the transformer
to capture the input data's long-range dependencies and contextual relationships.

Feed Forward Network and Softmax

The Transformer model revolutionizes language processing with its unique architecture,
which includes a crucial component known as the Feedforward Network (FFN). Positioned
within both the encoder and decoder modules of the Transformer, the FFN plays a vital role
in refining the data processed by the attention mechanisms.

Fig: 14
The FFN within both the encoder and decoder of the Transformer is constructed as a fully
connected, position-wise network. This design means that each position in the input
sequence is processed separately but in the same manner, which is crucial for maintaining
the positional integrity of the input data.

Key Characteristics of the FFN

1. Fully Connected Layers:

The FFN comprises two linear (fully connected) layers that transform the input data. The first
layer expands the input dimension from dmodel=512 to a larger dimension dff=2048, and the
second layer projects it back to dmodel.

2. Activation Function:

A Rectified Linear Unit (ReLU) activation function is applied between these two linear layers.
This function is defined as ReLU(x)=max(0,x) and is used to introduce non-linearity into the
model, helping it to learn more complex patterns.

3. Position-wise Processing:

Despite the sequential nature of the input data, each position (i.e., each word’s
representation in a sentence) is processed independently with the same FFN. This is akin to
applying the same transformation across all positions, ensuring uniformity in extracting
features from different parts of the input sequence.

Mathematical Representation

The operations within the FFN can be mathematically described by the following equations:

FFN(x)=max(0,xW1+b1)W2+b2

Where:

W1 and W2 are the weight matrices for the first and second linear layers, respectively.

b1 and b2 are the biases for these layers.

The ReLU activation is applied element-wise after the first linear transformation.

Example of FFN Processing


Consider a simplified example where the input x is a vector representing a single word’s
output from the post-LN stage:

x=[0.5,−0.2,0.1,…]x=[0.5,−0.2,0.1,…] (512-dimensional)

The first layer of the FFN transforms this vector into a higher 2048-dimensional space, adds a
bias, and applies the ReLU activation:

x′=max(0,xW1+b1)

Assuming non-negative outputs from ReLU for simplicity, the second layer then projects this
vector back down to the original 512-dimensional space:

FFN output=x′W2+b2

This output is then normalized by a subsequent post-LN step and either fed into the next
layer of the encoder or used as part of the input to the multi-head attention layer in the
decoder.

Softmax

Softmax is a mathematical function that takes a set of numbers and transforms them into a
set of probabilities between 0 and 1. These probabilities always add up to 1. It’s commonly
used in neural networks to convert raw scores from the network into a probability
distribution.

Softmax in Transformers

Transformers rely on a mechanism called self-attention to understand the relationships


between words in a sequence. This mechanism involves calculating scores that indicate how
relevant each word is to the current word being processed.

Softmax comes into play after these scores are computed. It takes these scores, which can be
any number, and converts them into a probability distribution. This distribution tells the
model how likely it is that each word is important in the current context.

Softmax is a mathematical function that takes a set of numbers and transforms them into a
set of probabilities between 0 and 1. These probabilities always add up to 1. It’s commonly
used in neural networks to convert raw scores from the network into a probability
distribution. This function, which plays a crucial role in the model’s attention mechanism.

Fig: 15
Softmax is commonly employed to convert raw attention scores into a probability
distribution, ensuring that the sum of attention weights equals 1. This normalization allows
the model to effectively focus on certain parts of the input sequence. However, I wonder if
there are alternative activation functions that could be less constraining and still allow the
optimization process to determine the best way to allocate attention, similar to how tanh or
other activations work in different layers of a neural network.

Any insights or references to relevant literature would be greatly appreciated.

softmax is typically used only as the last layer in these networks in order to generate the final
probabilities used in classification tasks, and thus represents only a small fraction of
computation time and energy. However, this is no longer true for Transformer networks,
which use softmax as a key component of the attention mechanism. For these networks,
softmax can become a significant bottleneck, as shown in Figure 1. The softmax operation is
inefficient in current hardware for two main reasons. First, softmax requires the use of the
exponential function. Exponential functions tend to require large look-up table (LUTs) to
compute the result through the use of Taylor expansions. This is particularly true for general-
purpose hardware such as CPUs and GPGPUs, which cater to exponential computations with
high accuracy requirements due to their use in various scientific computing applications. This
large area and power overhead makes it difficult to instantiate a large number of these units.
Second, in order to improve training stability, deep neural networks typically use a
numerically stable softmax, which subtracts the max of the vector on which softmax is being
performed in order to ensure that the result does not blow up to infinity. However, this
stability comes at a cost, as calculating the max introduces an additional pass through the
vector, incurring latency and memory overheads.

Model scalability - Parameters, Layers, and Performance

Scaling up the number of parameters in a transformer model can improve its performance
and make it more capable of understanding and generating complex text. For example, GPT-
3 has 175 billion parameters, while GPT-4 has over 1 trillion.

Using larger datasets

Using larger datasets to train a transformer model can improve its performance.

Allocating more computational resources

Using more computational resources to train a transformer model can improve its
performance.

Using distillation

Knowledge from a larger model can be transferred to a smaller model using distillation. This
can be useful because larger models are more expensive and slower to use.

Using TokenFormer

TokenFormer is an architecture that treats model parameters as tokens, allowing for efficient
scaling without retraining from scratch.

The scale of a transformer model, which is determined by the number of parameters, the
size of the dataset, and the computational resources used for training, has a greater impact
on model loss than the model's architectural structure

Application: Time Series Data, Sequence Based Data, Text and Vision.

Transformers treat time series data as a sequence of values, with each value representing a
time step. The model uses an encoder-decoder architecture, where the encoder takes in the
time series history and the decoder predicts future values. The decoder uses an attention
mechanism to learn which parts of the history are most useful for making predictions.

Benefits

Transformers can capture temporal patterns across different time scales, which can help
provide a more comprehensive understanding of the data.

Examples

Some examples of transformer-based time series models include:

Autoformer: Uses a decomposition layer to capture seasonality and trend-cycle components,


and an auto-correlation mechanism to improve performance.

Informer: Uses distillation to extract active data points and pass them to the next encoder
layer, which can help reduce memory usage.

Applications

Transformers have been used in various aspects of time series analysis, such as time series
forecasting

This allows transformers to capture complex temporal patterns and dependencies,


potentially outperforming traditional methods that struggle with long-term dependencies
and sequential processing.

Temporal Convolution vs. Self-Attention for Time Series Forecasting

In time series forecasting, two main approaches have been prevalent: temporal
convolutional networks (TCNs) and recurrent neural networks (RNNs), such as LSTM and
GRU. TCNs leverage convolutional layers to capture local patterns within the data, while
RNNs process sequences recursively, retaining memory of past states.

Transformers, on the other hand, employ self-attention mechanisms to weigh the


importance of each element in the sequence concerning all other elements. This allows
transformers to capture both local and global dependencies simultaneously, making them
particularly well-suited for time series forecasting tasks where long-range dependencies are
crucial.
The self-attention mechanism in transformers enables them to capture temporal patterns
across varying time scales, providing a more comprehensive understanding of the underlying
data dynamics. Additionally, transformers are inherently parallelizable, leading to faster
training times compared to sequential models like RNNs.

How Transformers Can Improve Time Series?

Using multi-head attention enabled by transformers could help improve the way time series
models handle long-term dependencies, offering benefits over current approaches. To give
an idea of how well transformers work for long dependencies, think of the long and detailed
responses that ChatGPT can generate in language-based models. Applying multi-head
attention to time series could produce similar benefits by allowing one head to focus on long-
term dependencies while another head focuses on short-term dependencies. We believe
transformers could make it possible for time series models to predict as many as 1,000 data
points into the future, if not more.

The Quadratic Complexity Issue

The way transformers calculate multi-head self-attention is problematic for time series.
Because data points in a series must be multiplied by every other data point in the series,
each data point that add to the input exponentially increases the time it takes to calculate
attention. This is called quadratic complexity, and it creates a computational bottleneck when
dealing with long sequences.

The Spacetimeformer Architecture

Spacetimeformer proposes a new way to represent inputs. Temporal attention models like
Informer represent the value of multiple variables per time step in a single input token,
which fails to consider spatial relationships between features. Graph attention models allow
it to manually represent relationships between features but rely on hardcoded graphs that
cannot change over time. Spacetimeformer combines both temporal and spatial attention
methods, creating an input token to represent the value of a single feature at a given time.
This helps the model understand more about the relationship between space, time, and
value information.
In transformers, the softmax function is commonly used as part of the mechanism for
calculating attention scores, which are critical for the self-attention mechanism that forms
the basis of the model. It is essential for several reasons:

1. Attention Weights: Transformers use attention mechanisms to weigh the importance


of different input tokens when generating an output. Softmax is used to convert the
raw attention scores, often called “logits,” into a probability distribution over the
input tokens. This distribution assigns higher attention weights to more relevant
tokens and lower weights to less relevant ones.

2. Probability Distribution: Softmax ensures that the attention scores are transformed
into a valid probability distribution, with all values between 0 and 1 and the sum
equal to 1. This property is important for correctly weighing the input tokens while
taking into account their relative importance.

3. Stabilizing Gradients: The softmax function has a smooth gradient, which makes it
easier to train deep neural networks like transformers using techniques like
backpropagation. It helps with gradient stability during training, making it easier for
the model to learn and adjust its parameters.

4. The softmax function is typically applied to the raw attention scores obtained from
the dot product of query and key vectors in the self-attention mechanism. The
formula for computing the softmax attention weights for a given query token in a
transformer is as follows:

Here, Q represents the query vector, K represents the key vectors of the input tokens, and
the exponential function (exp) is used to transform the raw scores into positive values. The
denominator ensures that the resulting values form a probability distribution.
In summary, the softmax function is a crucial component of transformers that enables them
to learn how to weigh input tokens based on their relevance to the current context, making
the model’s self-attention mechanism effective in capturing dependencies and relationships
in the data.

And the most important thing is the softmax is used to prevent exploding gradient or
vanishing gradient problems.

You might also like