Unit - 2
Unit - 2
SCSB4014
UNIT – 2
UNIT 2: TRANSFORMER
Data preprocessing is the process of cleaning, organizing, and transforming raw data so that
it can be more easily analyzed, understood, and used in machine learning models. The goal of
data preprocessing is to improve data quality and eliminate issues like missing values so that
the data can be used for machine learning.
Data preprocessing can help identify patterns, make predictions, and inform decision-
making. It can also increase the accuracy and efficiency of machine learning models.
Steps
There are several different tools and methods used for preprocessing data, including the
following:
6. feature extraction, which pulls out a relevant feature subset that is significant in a
particular context.
A computer scientist and one more person are given a legal database. The task is when
provided with a case; efficiently find previous cases that are relevant, this will quickly assist
the colleagues to find laws, precedents or specific legal interpretations related to the
provided case. Initially, it might consider employing a traditional keyword-based search. This
involves breaking down the case document into key words or phrases and searching for exact
matches in the database. However, that’ll soon recognize the shortcomings of this method.
The complexity and jargon-rich nature of legal language make it challenging. For instance, If a
lawyer is searching for “intellectual property infringement” that might overlook cases using
different terminology such as “patent breach” or “copyright violation”. A more effective
solution lies in semantic search powered by vector embeddings. By the end of this article,
that aim to demonstrate how tokenization, vector embedding, and positional encoding are
pivotal tools in overcoming these challenges.
Tokenization
Tokenization is a data preprocessing technique that breaks down text into smaller units,
called tokens, for analysis and processing and to make it easier for machines to process. It's a
fundamental step in Natural Language Processing (NLP) and is used in many applications,
including search engines, machine translation, and speech recognition.
It breaks down unstructured text data into smaller units called tokens. A single token can
range from a single character or individual word to much larger textual units.
Tokenization in NLP
Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence,
information engineering, and human-computer interaction. This field focuses on how to
program computers to process and analyze large amounts of natural language data. It is
difficult to perform as the process of reading and understanding languages is far more
complex than it seems at first glance.
Tokenization is a critical step in many NLP tasks, including text processing, language
modelling, and machine translation. The process involves splitting a string, or text into a list
of tokens. One can think of tokens as parts like a word is a token in a sentence, and a
sentence is a token in a paragraph.
Tokenization involves using a tokenizer to segment unstructured data and natural language
text into distinct chunks of information, treating them as different elements. The tokens
within a document can be used as vector, transforming an unstructured text document into a
numerical data structure suitable for machine learning. This rapid conversion enables the
immediate utilization of these tokenized elements by a computer to initiate practical actions
and responses. Alternatively, they may serve as features within a machine learning pipeline,
prompting more sophisticated decision-making processes or behaviors.
Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here
are some types of tokenization:
Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use this approach,
in which words are treated as the basic units of meaning.
Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is useful for tasks
requiring individual sentence analysis or processing.
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller
units."]
Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which can be
especially useful when dealing with morphologically rich languages or rare words.
Input: "tokenization"
Output: ["token", "ization"]
Character Tokenization:
This process divides the text into individual characters. This can be useful for modelling
character-level language.
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Tokenization is one stage in text mining pipelines that converts raw text data into a
structured format for machine processing. It's necessary for other preprocessing techniques,
and, therefore, is often (one of) the first preprocessing steps in NLP pipelines. For
example, stemming and lemmatization reduce morphological variants to one base word
form (for example, running, runs, and runner become run). These text normalization
techniques only work on tokenized text, but they need some method for identifying
individual words.
The different types of tokenization essentially denote varying levels of granularity in the
tokenization process. Word tokenization is the most common type used in introductions to
tokenization, and it divides raw text into word-level units. Subword tokenization delimits
text beneath the word level; wordpiece tokenization breaks text into partial word units (for
example, starlight becomes star and light), and character tokenization divides raw text into
individual characters (for example, letters, digits, and punctuation marks). Other
tokenization methods, such as sentence tokenization, divide text above the word level.
This use the Python natural language toolkit (NLTK) to walk through tokenizing .txt files at
different levels of granularity using an open-access Asian religious texts file that is sourced
largely from Project Gutenberg. It focuses on tokenization as a means to prepare raw text
data for use in machine learning models and NLP tasks. Other libraries and packages, such
as Keras and Genism, also come with tokenization algorithms. Transformer
architectures such as BERT can also implement tokenization. However, this tutorial focuses
on tokenization with Python NLTK.
Tokenizers are tools or libraries that have been created for this purpose, Hugging face have
developed and open-sourced transformers which simplifies the process of utilizing a
tokenizer from a pre-trained model. For consistent preprocessing of an input data, it’s
important to replicate the method used during the model’s pre-training phase. This requires
downloading the relevant information from the Models Hugging face Repository.
Accomplishing this is straightforward with the AutoTokenizer class and its from_pretrained()
method. By specifying our model’s checkpoint name, this function automatically retrieves
and caches the model’s tokenizer data, ensuring it’s downloaded only once during the initial
execution of the code.
Need of Tokenization
Tokenization is a crucial step in text processing and natural language processing (NLP) for
several reasons.
Effective Text Processing: Tokenization reduces the size of raw text so that it can be
handled more easily for processing and analysis.
Feature extraction: Text data can be represented numerically for algorithmic
comprehension by using tokens as features in machine learning models.
Language Modelling: Tokenization in NLP facilitates the creation of organized
representations of language, which is useful for tasks like text generation and language
modelling.
Information Retrieval: Tokenization is essential for indexing and searching in systems
that store and retrieve information efficiently based on words or phrases.
Text Analysis: Tokenization is used in many NLP tasks, including sentiment
analysis and named entity recognition, to determine the function and context of
individual words in a sentence.
Vocabulary Management: By generating a list of distinct tokens that stand in for words in
the dataset, tokenization helps manage a corpus’s vocabulary.
Task-Specific Adaptation: Tokenization can be customized to meet the needs of
particular NLP tasks, meaning that it will work best in applications such as summarization
and machine translation.
Preprocessing Step: This essential preprocessing step transforms unprocessed text into a
format appropriate for additional statistical and computational analysis.
Tokenization serves as the backbone for a myriad of applications in the digital realm,
enabling machines to process and understand vast amounts of text data. By breaking down
text into manageable chunks, tokenization facilitates more efficient and accurate data
analysis. Here are some prominent use cases, along with real-world applications:
Search engines
When we type a query into a search engine like Google, it employs tokenization to dissect
the input. This breakdown helps the engine sift through billions of documents to present with
the most relevant results.
Machine translation
Tools such as Google Translate utilize tokenization to segment sentences in the source
language. Once tokenized, these segments can be translated and then reconstructed in the
target language, ensuring the translation retains the original context.
Speech recognition
Voice-activated assistants like Siri or Alexa rely heavily on tokenization. When it pose a
question or command, the spoken words are first converted into text. This text is then
tokenized, allowing the system to process and act upon the request.
Sentiment analysis in reviews
Tokenization plays a crucial role in extracting insights from user-generated content, such as
product reviews or social media posts. For instance, a sentiment analysis system for e-
commerce platforms might tokenize user reviews to determine whether customers are
expressing positive, neutral, or negative sentiments. For example:
The review: "This product is amazing, but the delivery was late."
After tokenization: ["This", "product", "is", "amazing", ",", "but", "the", "delivery", "was",
"late", "."]
The tokens "amazing" and "late" can then be processed by the sentiment model to assign
mixed sentiment labels, providing actionable insights for businesses.
Tokenization enables chatbots to understand and respond to user inputs effectively. For
example, a customer service chatbot might tokenize the query:
Which is tokenized as: ["I", "need", "to", "reset", "my", "password", "but", "can't", "find",
"the", "link"].
This breakdown helps the chatbot identify the user's intent ("reset password") and respond
appropriately, such as by providing a link or instructions.
Tokenization Challenges
Navigating the intricacies of human language, with its nuances and ambiguities, presents a
set of unique challenges for tokenization. Here's a deeper dive into some of these obstacles,
along with recent advancements that address them:
Ambiguity
Language is inherently ambiguous. Consider the sentence "Flying planes can be dangerous."
Depending on how it's tokenized and interpreted, it could mean that the act of piloting
planes is risky or that planes in flight pose a danger. Such ambiguities can lead to vastly
different interpretations.
Implementing Tokenization
The landscape of Natural Language Processing offers many tools, each tailored to specific
needs and complexities. Here's a guide to some of the most prominent tools and
methodologies available for tokenization:
NLTK (Natural Language Toolkit). A stalwart in the NLP community, NLTK is a comprehensive
Python library that caters to a wide range of linguistic needs. It offers both word and
sentence tokenization functionalities, making it a versatile choice for beginners and seasoned
practitioners alike.
Spacy. A modern and efficient alternative to NLTK, Spacy is another Python-based NLP
library. It boasts speed and supports multiple languages, making it a favorite for large-scale
applications.
BERT tokenizer. Emerging from the BERT pre-trained model, this tokenizer excels in context-
aware tokenization. It's adept at handling the nuances and ambiguities of language, making it
a top choice for advanced NLP projects (see this tutorial on NLP with BERT).
Advanced techniques.
Byte-Pair Encoding (BPE). An adaptive tokenization method, BPE tokenizes based on the most
frequent byte pairs in a text. It's particularly effective for languages that form meaning by
combining smaller units.
SentencePiece. An unsupervised text tokenizer and detokenizer mainly for Neural Network-
based text generation tasks. It handles multiple languages with a single model and can
tokenize text into subwords, making it versatile for various NLP tasks.
One of the most popular tools for NLP tasks, the Hugging Face Transformers library provides
a seamless integration with PyTorch, making it ideal for both research and production. This
library includes advanced tokenizers designed to work with state-of-the-art transformer
models like BERT, GPT, and RoBERTa. Key features include:
Fast tokenizers: Built using Rust, these tokenizers offer significant speed improvements,
enabling faster pre-processing for large datasets.
Support for subword tokenization: The library supports Byte-Pair Encoding (BPE),
WordPiece, and Unigram tokenization, ensuring efficient handling of out-of-vocabulary
words and complex languages.
Built-in pretrained tokenizers: Each model in the Hugging Face Transformers library comes
with a corresponding pretrained tokenizer, ensuring compatibility and ease of use. For
instance, the BERT tokenizer splits text into subwords, making it adept at handling language
nuances.
Word Embeddings
Embeddings are numeric representations of words in a lower-dimensional space, capturing
semantic and syntactic information. They play a vital role in Natural Language Processing
(NLP) tasks. This article explores traditional and neural approaches, such as TF-IDF,
Word2Vec, and GloVe, offering insights into their advantages and disadvantages.
Understanding the importance of pre-trained word embeddings, providing a comprehensive
understanding of their applications in various NLP scenarios. In the traditional "one-hot"
representation of words as vectors that has a vector of the same dimension as the
cardinality of the vocabulary. To reduce dimensionality usually stop words are removed, as
well as applying stemming, lemmatizing, etc. to normalize the features that want to perform
some NLP task on.
The quality of datasets being used to train models applies to every type of AI model,
including Foundation Models, such as ChatGPT and Google’s BERT. The Washington Post
took a closer look at the vast datasets being used to train some of the world’s most popular
and powerful large language models (LLMs). In particular, the article reviewed the content
of Google’s C4 dataset, finding that quality and quantity are equally important, especially
when training LLMs.
In image recognition tasks, if the training data used to teach the model contains images with
inaccurate or incomplete labels, then the model may not be able to recognize or classify
similar images in its predictions accurately.
At the same time, if the training data is biased towards certain groups or demographics,
then the model may learn and replicate those biases, leading to unfair or discriminatory
treatment of certain groups. For instance, Google, too, succumbed to bias traps in a recent
incident where its Vision AI model generated racist outcomes.
The TF-IDF score for a term t in a document d is then given by multiplying the TF and IDF
values:
The higher the TF-IDF score for a term in a document, the more important that term is to
that document within the context of the entire corpus. This weighting scheme helps in
identifying and extracting relevant information from a large collection of documents, and
it is commonly used in text mining, information retrieval, and document clustering.
TF-IDF is a widely used technique in information retrieval and text mining, but its
limitations should be considered, especially when dealing with tasks that require a deeper
understanding of language semantics. For example:
TF-IDF treats words as independent entities and doesn’t consider semantic
relationships between them. This limitation hinders its ability to capture contextual
information and word meanings.
Sensitivity to Document Length: Longer documents tend to have higher overall term
frequencies, potentially biasing TF-IDF towards longer documents.
Neural Approach
Word2Vec
Word2Vec is a neural approach for generating word embeddings. It belongs to the family
of neural word embedding techniques and specifically falls under the category of
distributed representation models. It is a popular technique in natural language
processing (NLP) that is used to represent words as continuous vector spaces. Developed
by a team at Google, Word2Vec aims to capture the semantic relationships between
words by mapping them to high-dimensional vectors. The underlying idea is that words
with similar meanings should have similar vector representations. In Word2Vec every
word is assigned a vector. We start with either a random vector or one-hot vector.
There are two neural embedding methods for Word2Vec, Continuous Bag of Words
(CBOW) and Skip-gram.
Continuous Bag of Words(CBOW)
Continuous Bag of Words (CBOW) is a type of neural network architecture used in the
Word2Vec model. The primary objective of CBOW is to predict a target word based on its
context, which consists of the surrounding words in a given window. Given a sequence of
words in a context window, the model is trained to predict the target word at the center
of the window.
CBOW is a feedforward neural network with a single hidden layer. The input layer
represents the context words, and the output layer represents the target word. The
hidden layer contains the learned continuous vector representations (word embeddings)
of the input words.
The architecture is useful for learning distributed representations of words in a continuous
vector space.
Fig: 1
The hidden layer contains the continuous vector representations (word embeddings) of
the input words.
The weights between the input layer and the hidden layer are learned during training.
The dimensionality of the hidden layer represents the size of the word embeddings
(the continuous vector space).
Skip-Gram
The Skip-Gram model learns distributed representations of words in a continuous vector
space. The main objective of Skip-Gram is to predict context words (words surrounding a
target word) given a target word. This is the opposite of the Continuous Bag of Words
(CBOW) model, where the objective is to predict the target word based on its context. It is
shown that this method produces more meaningful embeddings.
After applying the above neural embedding methods we get trained vectors of each word
after many iterations through the corpus. These trained vectors preserve syntactical or
semantic information and are converted to lower dimensions. The vectors with similar
meaning or semantic information are placed close to each other in space.
Pretrained Word-Embedding
Pre-trained word embeddings are representations of words that are learned from large
corpora and are made available for reuse in various natural language processing (NLP)
tasks. These embeddings capture semantic relationships between words, allowing the
model to understand similarities and relationships between different words in a
meaningful way.
GloVe
GloVe is trained on global word co-occurrence statistics. It leverages the global context to
create word embeddings that reflect the overall meaning of words based on their co-
occurrence probabilities. this method, we take the corpus and iterate through it and get the
co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix
through this. The words which occur next to each other get a value of 1, if they are one
word apart then 1/2, if two words apart then 1/3 and so on.
Advantages and Disadvantage of Word Embeddings
It is much faster to train than hand build models like WordNet (which uses graph
embeddings).
Almost all modern NLP applications start with an embedding layer.
It Stores an approximation of meaning.
Disadvantages
It can be memory intensive.
It is corpus dependent. Any underlying bias will have an effect on the model.
It cannot distinguish between homophones.
Eg: brake/break, cell/sell, weather/whether etc.
Positional encoding
It is a data preprocessing technique that adds information about the order of words in a
sequence to a model. positional encoding is used to provide positional information to the
model. In detail, a position-dependent signal is added to each word embedding for each
input sequence to help the model incorporate the order of words. The output of positional
encoding has the same dimension as the embedding layer.
Importance of positional encodings important
Positional encodings are crucial in Transformer models for several reasons:
Preserving Sequence Order: Transformer models process tokens in parallel, lacking
inherent knowledge of token order. Positional encodings provide the model with
information about the position of tokens in the sequence, ensuring that the model can
differentiate between tokens based on their position. This is essential for tasks where
word order matters, such as language translation and text generation.
Maintaining Contextual Information: In natural language processing tasks, the
meaning of a word often depends on its position in the sentence. For example, in the
sentence "The cat sat on the mat," the word "cat" has a different meaning than in "The
mat sat on the cat." transformer
Enhancing Generalization: By incorporating positional information, transformer
models can generalize better across sequences of different lengths. This is particularly
important for tasks where the length of the input sequence varies, such as document
summarization or question answering. Positional encodings enable the model to
handle input sequences of varying lengths without sacrificing performance.
Mitigating Symmetry: Without positional encodings, the self-attention mechanism in
Transformer models would treat tokens symmetrically, potentially leading to
ambiguous representations. Positional encodings introduce an asymmetry into the
model, ensuring that tokens at different positions are treated differently, thereby
improving the model's ability to capture long-range dependencies.
Before the sentence is fed into the Transformer model, it undergoes tokenization, where
each word is converted into a token. Let's assume the tokens for this sentence are:
They enable Transformer models to effectively process and understand input sequences,
leading to improved performance across a wide range of natural language processing
tasks.
1. Function Parameters:
position: Total positions or length of the sequence.
d_model: Dimensionality of the model's output.
2. Generating the Base Matrix:
angle_rads: Creates a matrix where rows represent sequence positions and
columns represent feature dimensions. Values are scaled by dividing each position
index by 10000 raised to (2 * index / d_model).
3. Applying Sine and Cosine Functions:
Even indices: Apply the sine function to encode positions.
Odd indices: Apply the cosine function for a phase-shifted encoding.
4. Creating the Positional Encoding Tensor:
The matrix is expanded to match input shape expectations of models like
Transformers and cast to tf.float32.
5. Output:
Returns a TensorFlow tensor of shape (1, position, d_model), ready to be added to
input embeddings to incorporate positional information.
Fig: 5
STEP 3 - Stack of Encoder Layers
The Transformer encoder consists of a stack of identical layers (6 in the original
Transformer model).
The encoder layer serves to transform all input sequences into a continuous, abstract
representation that encapsulates the learned information from the entire sequence. This
layer comprises two sub-modules:
A multi-headed attention mechanism.
A fully connected network.
Additionally, it incorporates residual connections around each sublayer, which are then
followed by layer normalization.
Fig: 6
Attention:
Attention, in general, refers to the ability to focus on one thing and ignore other things
that seem irrelevant at the time. In machine learning, this concept is applied by teaching
the model to focus on certain parts of the input data and disregard others to better solve
the task at hand.
In tasks like machine translation, for example, the input data is a sequence of some text.
When we humans read a piece of text, it seems natural to attend to some parts more than
others. Usually, it’s the who, when, and where part of a sentence that captures our
attention. Since this is a skill we develop from birth, we don’t acknowledge its importance.
But without it, we wouldn’t be able to contextualize.
For instance, if we see the word bank, in our heads, we might think about a financial
institution or a place where blood donations are stored, or even a portable battery. But if
we read the sentence ” I am going to the bank to apply for a loan”, we immediately catch
up on what bank is mentioned. This is because we implicitly attended to a few clues. From
the “going to” part, we understood that a bank is a place in this context, and from the
“apply for a loan” part, we got that that can receive a loan there.
The whole sentence gives out information that adds up to create a mental picture of what
a bank is. Suppose a machine could do the same thing as go as we do. In that case, most of
the significant natural language processing problems like words with multiple meanings,
sentences with multiple grammatical structures, and uncertainty about what a pronoun
refers to will be solved.
The Birth of Transformers
The Transformer architecture, introduced in the paper “Attention Is All You Need” by
Vaswani et al. in 2017, redefined the game. It relied on the self-attention mechanism to
process sequences in parallel, making it highly efficient. This was the birth of the
Transformer model.
The Building Blocks of Attention Mechanisms
Self-Attention: The Basics
Self-attention, also known as scaled dot-product attention, is a mechanism that allows a
Transformer to weigh the importance of different words in a sentence when processing a
specific word. It can be likened to a spotlight focusing on different sentence parts as the
model processes each word. This mechanism is mathematically defined as follows:
Query, Key, and Value: For a given word, the self-attention mechanism computes three
vectors: Query (Q), Key (K), and Value (V). These vectors are learned during training.
Attention Scores: The model calculates attention scores by taking the dot product of the
Query vector for the current word and the Key vectors for all the words in the input
sequence. These scores indicate how much focus each word should receive.
Softmax and Scaling: The attention scores are passed through a softmax function to get a
probability distribution. This distribution is then used to weigh the Value vectors, deciding
how much each word’s information should contribute to the current word’s
representation.
Weighted Sum: Finally, the Value vectors are weighted by the attention scores and
summed to create the new representation of the current word.
Multi-Head Attention
In practice, Transformers use what is known as multi-head attention. Instead of relying on
a single attention mechanism, the model uses multiple heads or sets of Query, Key, and
Value vectors. Each head can focus on different input parts, capturing different aspects of
word relationships.
Positional Encoding
One challenge with self-attention is that it doesn’t inherently capture the order of words
in a sequence. To address this, Transformers incorporate positional encoding into their
input embeddings. Positional encodings are added to the word embeddings, allowing the
model to consider the position of each word in the sequence.
Self-Attention:
The self-attention mechanism is at the core of what makes Transformers powerful. Here
are some reasons why it’s so essential:
Long-Range Dependencies
Self-attention can capture relationships between words that are far apart in a sequence.
In contrast, RNNs struggle with long-range dependencies because information must flow
step by step.
Parallelization
Traditional sequence models like RNNs process data sequentially, one step at a time. Self-
attention, on the other hand, can process the entire sequence in parallel, making it more
computationally efficient.
Adaptability
The attention mechanism is not limited to language processing. It can be adapted for
various tasks and domains. For instance, in computer vision, self-attention mechanisms
can capture relationships between pixels in an image.
Attention Mechanisms in Real-Life
The BERT model, developed by Google, uses self-attention to pre-train on a massive text
corpus. BERT has set new benchmarks in various NLP tasks, from sentiment analysis to
text classification.
OpenAI’s GPT-3 is one of the largest language models in existence. It uses self-attention to
generate coherent and contextually relevant text, making it ideal for applications like
chatbots and language translation.
Image Analysis
The power of attention mechanisms isn’t limited to text. In computer vision, models like
the Vision Transformer have demonstrated that self-attention can capture complex
relationships between pixels in an image, enabling state-of-the-art image recognition.
Model Size
Large-scale models with multiple heads and layers can become computationally
expensive. This can limit the accessibility of these models to a broader range of
applications.
Interpretability
A Primer on Transformers
While far from perfect, transformers are our best current solution to
contextualization. The type of attention used in them is called self-attention. This
mechanism relates different positions of a single sequence to compute a representation of
the same sequence. It is instrumental in machine reading, abstractive summarization, and
even image description generation.
Since they were used initially for machine translation, the Transformers are based on the
encoder-decoder architecture, meaning that they have two major components. The first
component is an encoder which takes a sequence as input and transforms it into a state
with a fixed shape. The second component is the decoder. It maps the encoded state of a
fixed shape to an output sequence. Here is a diagram:
Transformers and attention mechanisms have revolutionized the field of deep learning,
offering a powerful way to process sequential data and capture long-range dependencies.
Attention mechanisms are crucial in transformers, allowing different tokens to be
weighted based on their importance, enhancing model context and output quality.
Transformers operate on self-attention, enabling the capture of long-range
dependencies without sequential processing.
Multi-head attention in transformers enhances model performance by allowing the
model to focus on different aspects of the input data simultaneously.
Transformers outperform RNNs and LSTMs in handling sequential data due to their
parallel processing capabilities.
Applications of transformers span across NLP, computer vision, and state-of-the-art
model development.
Attention Mechanism has been a powerful tool for improving the performance of Deep
Learning and NLP models by allowing them to extract the most relevant and important
information from data, giving them the ability to simulate cognitive abilities of humans.
This article at OpenGenus aims to explore and walk it through the main types of Attention
Mechanism models and the main approaches to Attention. Attention Mechanism enables
enhances the performance of models by introducing the ability to mimic cognitive
attention the way humans do in order to make relevant predictions by understanding the
context of given data. Attention can be defined as 'Memory per unit of time'.
It is used in Deep Learning models to selectively focus on certain parts of the input and
assign weights to them based on their relevance to the current task, such that the model
can assign more resource and 'attention' to the most important parts of the input while
ignoring the less relevant parts. In Natural Language Processing (NLP), attention
mechanisms have been particularly successful in improving the performance of machine
translation, text summarization, and sentiment analysis models.
Fig: 7
The encoder’s role is to meticulously extract features from the input sequence. This is
achieved through a series of layers, each comprising a multi-head attention mechanism
followed by a feed-forward neural network. These layers are further enhanced with
normalization and residual connections to ensure stability during training. Remarkably,
the entire sequence is processed in parallel, which is a stark departure from the sequential
processing of traditional recurrent neural networks (RNNs).
A lot is going on in the diagram, but for our purposes, it’s only worth noticing that the
encoder module here is painted in blue while the decoder is in green. We can also see that
both the encoder and decoder modules use a layer called Multi-Head Attention. Let’s also
forget about the multi-headed part for now and focus only on what is inside one “head”.
The main component is called scaled dot-product attention and it’s very elegant in that it
achieves so much with just a few linear algebra operations. It’s made up of three
matrices, and, which are called a query, key, and value matrices and each has a dimension.
The concept of using queries, keys, and values is directly inspired by how databases work.
Each database storage has its data values indexed by keys, and users can retrieve the data
by making a query.
The self-attention operation is very similar, except that there isn’t a user or a controller
issuing the query, but it’s learned from the data. By the use of backpropagation, the
neural network updates its Q, K, and V matrices in order to mimic a user-database
interaction. To prove that this is possible, let’s reimagine the retrieval process as a vector
dot product:
where is a one-hot vector consisting of only ones and zeroes, and is a vector with the
values we’re retrieving.
In this case, the vector alpha is the de facto query because the output will consist only of
the values of where is 1:
Fig: 8
Now let’s remove the restriction for the query vector and allow float values between 0
and 1. By doing that, we would get a weighted proportional retrieval of the values:
Fig: 9
The scaled dot product attention uses vector multiplication in the same exact way. To
obtain the final weights on the values, first, the dot product of the query with all keys is
computed and then divided by . Then a softmax function is applied.
1. Input Sequence (Query, Key and Value)- The input sequence is transformed into
three vectors: query, key, and value. These vectors are learned during training and
represent different aspects of the input sequence.
2. First Matrix Multiplication- Computes the similarity between the Query Matrix and
the Key Matrix by performing a dot product operation. The resulting score
represents the relevance of each element in the input sequence to the current
state of the model.
3. Scaling Of Matrix- To prevent the scores from being to large and to avoid the
Vanishing Gradient problem, scaling of the matrix is done to stabilize the gradients
during training.
4. SoftMax Function- The scaled scores are passed through a softmax function to
normalize them to ensure that the attention weights sum up to 1. It is used to
convert a set of numbers into a probability distribution and enables the model to
learn which parts of the input are most relevant to the current task.
5. Second Matrix Multiplication- The output from the SoftMax function is then
multiplied with the Value Matrix which is the final output of the Attention Layer.
Multi-Head Attention Mechanism
Instead of performing and obtaining a single Attention on large matrices Q, K and V, it is
found to be more effecient to divide them into multiple matrices of smaller dimensions
and perform Scaled-Dot Product on each of those smaller matrices. Each attention module
can focus on calculating different types of relationships between the inputs and create
specific contextualized embeddings. As shown in the diagram above, these embeddings
can then be concatenated and put through an ordinary linear neural network layer,
together making the final output of the so-called Multi-Headed Attention Module. As it
turns out, this approach not only improves the model’s performance but improves training
stability as well:
Fig: 11
The decoder, on the other hand, is tasked with generating the output sequence. It mirrors
the encoder’s structure but includes an additional layer of cross-attention that allows it to
focus on relevant parts of the input sequence as it produces the output. The advantages of
the Transformer architecture are manifold. Its parallel processing capabilities drastically
accelerate training and inference times. Coupled with self-attention, the architecture adeptly
handles long-range dependencies, capturing intricate relationships within the data that span
considerable sequence lengths.
Obtain Query, Key and Value matrices- Obtain the value 'h', which is the number of heads.
Usually, a value of 8 is considered. Then we obtain the same number of set of Q, K and V
matrices. A triple set from 1 to h will be obtained for Q, K and V.
Scaled-Dot Product Attention- We perform scaled-dot product attention for each triple set
Qi, Ki and Vi to Qh, Kh and Vh.
MultiHead attention uses multiple attention heads to attend to different parts of the input
sequence which allows the model to learn different relationships between the different parts
of the input sequence. For example, the model can learn to attend to the local context of a
word, as well as the global context of the entire input sequence.
Additive-Attention Mechanism
The decoder then begins to generate the output sequence, one word at a time. The decoder
takes the previous word, the current hidden state, and the attention weights as input. The
attention weights are used to compute a weighted sum of the encoder's hidden states and
this weighted sum is then used to update the decoder's hidden state. The decoder then
generates the next word, and the process is repeated until the decoder generates the end-of-
sequence token.
Dynamic-Convolution Attention
Dynamic convolution attention was first introduced in the paper "Dynamic Convolution
Attention over Convolution Kernels" by Chen et al. (2019). It increases model complexity
without increasing the depth or width of a network. Instead of using a single kernel per layer,
dynamic convolution clusters multiple parallel convolution kernels in a dynamic method
based upon their attentions, which are dependent on the input. Aggregation of multiple
kernels is not only computationally efficient due to the small kernel size, but it also has
representation power as these kernels are aggregated in a non-linear fashion via attention. It
can be integrated easily into existing network architectures.
Entity-Aware Attention
Entity-aware attention was first introduced in the paper "Entity-Aware Attention for Relation
Extraction" by Sun et al. (2019). In this paper, the authors showed that entity-aware
attention could be used to improve the performance of relation extraction models. It is an
attention mechanism that focuses on specific entities or entities' interactions within a
sequence or a context.
In traditional attention mechanisms, attention weights are typically computed based on the
similarity between the query and key vectors derived from the input sequence. However, in
entity-aware attention, the attention mechanism takes into account the entities mentioned
in the input sequence and their respective roles using Named Entity Recognition (NER) or
Entity Linking.
Location-Based Attention
Location-based attention, also known as Positional Attention, was first introduced in the
paper "Spatial Attention in Natural Language Processing" by Xu et al. (2015). It takes into
account the relative position or location of elements within a sequence or context using
Learned Position Embeddings or Sinusoidal Encoding. It is commonly used in models that
process sequential data, such as natural language processing (NLP) tasks, where the position
of words or tokens can carry important information.
This can be particularly useful in tasks where the order or position of elements matters, such
as machine translation or text generation, as it helps the model capture sequential
dependencies and generate more accurate and coherent outputs.
Global Attention
While Scaled-Dot Product and Multi-Head Attention are methods to compute attention,
there are mainly two approaches to how the Attention itself is applied, i.e. Global Attention
and Local Attention. Global attention attends to all of the input sequence, while local
attention only attends to a subset of the input sequence.
Global attention can capture long-range dependencies between different parts of the input
sequence and contextual information in the input sequence, as it allows the model to attend
to all input elements at each decoding step. It can be computationally expensive as it has to
attend to the entire sequence. It is typically used for tasks that require the model to consider
the entire input sequence, such as text summarization.
Local Attention
Local attention can be more effective for tasks that require the model to focus on a specific
part of the input sequence. Local attention is easier to train as it only has to learn to attend
to a subset of the input sequence. It is typically used for tasks that require the model to focus
on a specific part of the input sequence, such as machine translation.
The advent of attention mechanisms has been nothing short of revolutionary in the realm of
deep learning. Attention allows models to dynamically focus on pertinent parts of the input
data, akin to the way humans pay attention to certain aspects of a visual scene or
conversation. This selective focus is particularly crucial in tasks where context is key, such as
language understanding or image recognition.
They provide a means to handle variable-sized inputs by focusing on the most relevant
parts.
Attention-based models can capture long-range dependencies that earlier models like
RNNs struggled with.
Encoder-Decoder Model
At the heart of the encoder-decoder model lies a symphony of sequential data translation,
where the encoder processes the input sequence and distills it into a fixed-length
representation, often referred to as the context vector. This vector serves as a condensed
summary of the input, capturing its essence for the decoder to interpret.
Fig: 12
The decoder, mirroring the encoder’s structure, awakens with the context vector, infusing
its initial hidden state. It embarks on a generative quest, conjuring the first token of the
output sequence, and continues to weave the subsequent tokens, each prediction delicately
influenced by the previously materialized tokens and the persistent whisper of the context
vector. This iterative dance persists until the narrative is complete, signaled by an end-of-
sequence token or the bounds of a predefined sequence length.
Types of Attention
The Scaled Dot-Product Attention is the fundamental building block of the Transformer's
attention mechanism. It involves three main components: queries (Q), keys (K), and values
(V). The attention score is computed as the dot product of the query and key vectors,
scaled by the square root of the dimension of the key vectors. This score is then passed
through a softmax function to obtain the attention weights, which are used to compute a
weighted sum of the value vectors.
2. Multi-Head Attention
Multi-Head Attention enhances the model's ability to focus on different parts of the input
sequence simultaneously. It involves multiple attention heads, each with its own set of
query, key, and value matrices. The outputs of these heads are concatenated and linearly
transformed to produce the final output. This allows the model to capture different
features and dependencies in the input sequence.
3. Self-Attention
Self-Attention, also known as intra-attention, allows the model to consider different
positions of the same sequence when computing the representation of a word. In the
context of the Transformer, self-attention is applied in both the encoder and decoder layers.
It enables the model to capture long-range dependencies and relationships within the input
sequence.
It is a mechanism that allows a model to attend to different parts of the same input
sequence. This is done by computing a weighted sum of the input sequence, where the
weights are determined by how relevant each part of the sequence is to the current task.
The basic idea behind self-attention is to compute attention weights for each word/token in
a sequence with respect to all other words/tokens in the same sequence. These attention
weights indicate the importance or relevance of each word/token to the others.
4. Encoder-Decoder Attention
Encoder-Decoder Attention, also known as cross-attention, is used in the decoder layers of
the Transformer. It allows the decoder to focus on relevant parts of the input sequence
(encoded by the encoder) when generating each word of the output sequence. This type
of attention ensures that the decoder has access to the entire input sequence, helping it
produce more accurate and contextually appropriate translations.
5. Causal or Masked Self-Attention
Causal or Masked Self-Attention is used in the decoder to ensure that the prediction for a
given position only depends on the known outputs at positions before it. This is crucial for
tasks like language modeling, where future tokens should not be visible during training.
The attention scores for future tokens are masked out, ensuring that the model cannot
look ahead.
Normalization has been a very crucial step during data preprocessing in traditional Machine
Learning that involves scaling the features of a dataset to a similar range to ensure that no
particular feature dominates the learning process due to its scale. Some common data
normalization techniques include Min-Max scaling and Z-score normalization. In neural
networks, weights are also normalized to ensure that the weights do not become too large or
too small which can lead to numerical instability during training. In deep learning,
normalization can be applied in two key areas:
4. Input Data: The data that feed into the neural network (e.g., features like f1, f2, f3)
can be normalized before entering the network. This step is similar to what we’ve
done in traditional machine learning.
5. Hidden Layer Activations: That can also normalize the activations (outputs) from
hidden layers in the neural network. This is often done to stabilize and accelerate
training, especially in deep networks.
Let’s dive into why normalization is so important in deep learning. To train a neural network,
and as that update the weights, some of them start getting really big. When that happens,
the activations tied to those weights also become large, making it harder for the model to
learn effectively. It slows things down and can cause problems in training.
Normalization helps fix this by keeping activations within a stable range. This not only makes
the training process more stable but also speeds it up, allowing the model to learn more
efficiently.
Another big benefit of normalization is that it prevents a problem called internal covariate
shift. This happens when the input data’s distribution changes as it moves through the layers
of the network, which can confuse the model. By normalizing the activations, that keep
things consistent, so the model can keep learning without getting thrown off.
Layer Normalization
Layer Normalization directly estimates the normalization statistics from the summed inputs
to the neurons within a hidden layer so the normalization does not introduce any new
dependencies between training cases. It works well for RNNs and improves both the training
time and the generalization performance of several existing RNN models. More recently, it
has been used with Transformer models.
LayerNorm with respect to the standard feed-forward neural network where a denotes the
weight-summed inputs to neurons, a(i) is the ith value of vector a(cap), g is the gain
parameter used to re-scale the standardized summed inputs and μ and σ2 are the mean and
variance statistic respectively estimated from raw summed inputs a.
Root Mean Square Normalization or RMSNorm regularizes the summed inputs to a neuron in
one layer according to root mean square (RMS) giving the model re-scaling invariance
property and implicit learning rate adaptation ability. This is because the scale of the
activations influences the magnitude of weight updates during training and hence RMSNorm
adjusts the learning rate based on the RMS value of the inputs contributing to stable training
and faster convergence. Hence, RMSNorm only focuses on re-scaling invariance and
regularizes the summed inputs simply according to the root mean square (RMS) statistic as
given below.
Intuitively, RMSNorm simplifies LayerNorm by totally removing the mean statistic from
LayerNorm at the cost of sacrificing the invariance that mean normalization affords. When
the mean of summed inputs is zero, RMSNorm is exactly equal to LayerNorm. Through
various experiments, the author have shown that this property is not fundamental to the
success of LayerNorm and that RMSNorm is similarly or more effective.
Encoder, Decoder
A key part of the Transformer model, a popular deep learning model used for natural
language processing (NLP). Encoders in Transformers are neural network layers that process
the input sequence and produce a continuous representation, or embedding, of the input. The
decoder then uses these embeddings to generate the output sequence. The encoder typically
consists of multiple self-attention and feed-forward layers, allowing the model to process
and understand the input sequence effectively.
Fig: 13
The transformer encoder architecture typically consists of multiple layers, each of which
includes a self-attention mechanism and a feed-forward neural network. The encoder
processes the input sequence and produces a continuous representation, or embedding, of
the input, which is then passed to the decoder to generate the output sequence.
The transformer encoder comprises multiple self-attention and feed-forward layers, allowing
the model to process and understand the input sequence effectively. The decoder also
typically consists of multiple layers, including a self-attention mechanism and a feed-forward
network.` The decoder uses the embeddings produced by the encoder and its internal states
to generate the output sequence.
The decoder also typically consists of `multiple neural network layers that generate the
output data based on the context provided by the encoder. The decoder may also include
self-attention layers, which allow it to consider the context and dependencies between
different parts of the output when generating the data.
Need of an Encoder
The Transformer architecture is a neural network model that uses an encoder to process and
encode input data.
Extracting useful features and patterns from complex and unstructured input data is crucial for
understanding the input and making predictions.
The encoder also plays a crucial role in the encoder-decoder architecture, a common
structure used in natural languages processing tasks such as machine translation and
language generation.
In this architecture, the encoder processes and encodes the input data, and the decoder
generates the output data based on the encoded representation.
The encoded representation provided by the encoder serves as the “context” for the
decoder, which helps it generate more accurate and coherent output by capturing the
essence of the input.
In the first step of the Transformer process, the input data is transformed into a vector
representation using an embedding layer. This layer maps each word in the input to a fixed-
length vector, which helps the model capture the meaning and context of the words.
After the input data has been embedded, it is augmented with positional encoding.
The positional encoding is a series of sinusoidal functions that encode the relative position of
each word in the input sequence, and it is added to the input embeddings element-wise. This
is necessary because the transformer has no inherent understanding of the order of the input
sequence, and the position of each word in the sequence can affect its meaning.
(3) The next step in the Transformer process is the multi-headed attention stage. This is where
the transformer differs significantly from other models, using self-attention mechanisms to
capture dependencies between different parts of the input data. This enables the transformer
to capture the input data's long-range dependencies and contextual relationships.
The Transformer model revolutionizes language processing with its unique architecture,
which includes a crucial component known as the Feedforward Network (FFN). Positioned
within both the encoder and decoder modules of the Transformer, the FFN plays a vital role
in refining the data processed by the attention mechanisms.
Fig: 14
The FFN within both the encoder and decoder of the Transformer is constructed as a fully
connected, position-wise network. This design means that each position in the input
sequence is processed separately but in the same manner, which is crucial for maintaining
the positional integrity of the input data.
The FFN comprises two linear (fully connected) layers that transform the input data. The first
layer expands the input dimension from dmodel=512 to a larger dimension dff=2048, and the
second layer projects it back to dmodel.
2. Activation Function:
A Rectified Linear Unit (ReLU) activation function is applied between these two linear layers.
This function is defined as ReLU(x)=max(0,x) and is used to introduce non-linearity into the
model, helping it to learn more complex patterns.
3. Position-wise Processing:
Despite the sequential nature of the input data, each position (i.e., each word’s
representation in a sentence) is processed independently with the same FFN. This is akin to
applying the same transformation across all positions, ensuring uniformity in extracting
features from different parts of the input sequence.
Mathematical Representation
The operations within the FFN can be mathematically described by the following equations:
FFN(x)=max(0,xW1+b1)W2+b2
Where:
W1 and W2 are the weight matrices for the first and second linear layers, respectively.
The ReLU activation is applied element-wise after the first linear transformation.
x=[0.5,−0.2,0.1,…]x=[0.5,−0.2,0.1,…] (512-dimensional)
The first layer of the FFN transforms this vector into a higher 2048-dimensional space, adds a
bias, and applies the ReLU activation:
x′=max(0,xW1+b1)
Assuming non-negative outputs from ReLU for simplicity, the second layer then projects this
vector back down to the original 512-dimensional space:
FFN output=x′W2+b2
This output is then normalized by a subsequent post-LN step and either fed into the next
layer of the encoder or used as part of the input to the multi-head attention layer in the
decoder.
Softmax
Softmax is a mathematical function that takes a set of numbers and transforms them into a
set of probabilities between 0 and 1. These probabilities always add up to 1. It’s commonly
used in neural networks to convert raw scores from the network into a probability
distribution.
Softmax in Transformers
Softmax comes into play after these scores are computed. It takes these scores, which can be
any number, and converts them into a probability distribution. This distribution tells the
model how likely it is that each word is important in the current context.
Softmax is a mathematical function that takes a set of numbers and transforms them into a
set of probabilities between 0 and 1. These probabilities always add up to 1. It’s commonly
used in neural networks to convert raw scores from the network into a probability
distribution. This function, which plays a crucial role in the model’s attention mechanism.
Fig: 15
Softmax is commonly employed to convert raw attention scores into a probability
distribution, ensuring that the sum of attention weights equals 1. This normalization allows
the model to effectively focus on certain parts of the input sequence. However, I wonder if
there are alternative activation functions that could be less constraining and still allow the
optimization process to determine the best way to allocate attention, similar to how tanh or
other activations work in different layers of a neural network.
softmax is typically used only as the last layer in these networks in order to generate the final
probabilities used in classification tasks, and thus represents only a small fraction of
computation time and energy. However, this is no longer true for Transformer networks,
which use softmax as a key component of the attention mechanism. For these networks,
softmax can become a significant bottleneck, as shown in Figure 1. The softmax operation is
inefficient in current hardware for two main reasons. First, softmax requires the use of the
exponential function. Exponential functions tend to require large look-up table (LUTs) to
compute the result through the use of Taylor expansions. This is particularly true for general-
purpose hardware such as CPUs and GPGPUs, which cater to exponential computations with
high accuracy requirements due to their use in various scientific computing applications. This
large area and power overhead makes it difficult to instantiate a large number of these units.
Second, in order to improve training stability, deep neural networks typically use a
numerically stable softmax, which subtracts the max of the vector on which softmax is being
performed in order to ensure that the result does not blow up to infinity. However, this
stability comes at a cost, as calculating the max introduces an additional pass through the
vector, incurring latency and memory overheads.
Scaling up the number of parameters in a transformer model can improve its performance
and make it more capable of understanding and generating complex text. For example, GPT-
3 has 175 billion parameters, while GPT-4 has over 1 trillion.
Using larger datasets to train a transformer model can improve its performance.
Using more computational resources to train a transformer model can improve its
performance.
Using distillation
Knowledge from a larger model can be transferred to a smaller model using distillation. This
can be useful because larger models are more expensive and slower to use.
Using TokenFormer
TokenFormer is an architecture that treats model parameters as tokens, allowing for efficient
scaling without retraining from scratch.
The scale of a transformer model, which is determined by the number of parameters, the
size of the dataset, and the computational resources used for training, has a greater impact
on model loss than the model's architectural structure
Application: Time Series Data, Sequence Based Data, Text and Vision.
Transformers treat time series data as a sequence of values, with each value representing a
time step. The model uses an encoder-decoder architecture, where the encoder takes in the
time series history and the decoder predicts future values. The decoder uses an attention
mechanism to learn which parts of the history are most useful for making predictions.
Benefits
Transformers can capture temporal patterns across different time scales, which can help
provide a more comprehensive understanding of the data.
Examples
Informer: Uses distillation to extract active data points and pass them to the next encoder
layer, which can help reduce memory usage.
Applications
Transformers have been used in various aspects of time series analysis, such as time series
forecasting
In time series forecasting, two main approaches have been prevalent: temporal
convolutional networks (TCNs) and recurrent neural networks (RNNs), such as LSTM and
GRU. TCNs leverage convolutional layers to capture local patterns within the data, while
RNNs process sequences recursively, retaining memory of past states.
Using multi-head attention enabled by transformers could help improve the way time series
models handle long-term dependencies, offering benefits over current approaches. To give
an idea of how well transformers work for long dependencies, think of the long and detailed
responses that ChatGPT can generate in language-based models. Applying multi-head
attention to time series could produce similar benefits by allowing one head to focus on long-
term dependencies while another head focuses on short-term dependencies. We believe
transformers could make it possible for time series models to predict as many as 1,000 data
points into the future, if not more.
The way transformers calculate multi-head self-attention is problematic for time series.
Because data points in a series must be multiplied by every other data point in the series,
each data point that add to the input exponentially increases the time it takes to calculate
attention. This is called quadratic complexity, and it creates a computational bottleneck when
dealing with long sequences.
Spacetimeformer proposes a new way to represent inputs. Temporal attention models like
Informer represent the value of multiple variables per time step in a single input token,
which fails to consider spatial relationships between features. Graph attention models allow
it to manually represent relationships between features but rely on hardcoded graphs that
cannot change over time. Spacetimeformer combines both temporal and spatial attention
methods, creating an input token to represent the value of a single feature at a given time.
This helps the model understand more about the relationship between space, time, and
value information.
In transformers, the softmax function is commonly used as part of the mechanism for
calculating attention scores, which are critical for the self-attention mechanism that forms
the basis of the model. It is essential for several reasons:
2. Probability Distribution: Softmax ensures that the attention scores are transformed
into a valid probability distribution, with all values between 0 and 1 and the sum
equal to 1. This property is important for correctly weighing the input tokens while
taking into account their relative importance.
3. Stabilizing Gradients: The softmax function has a smooth gradient, which makes it
easier to train deep neural networks like transformers using techniques like
backpropagation. It helps with gradient stability during training, making it easier for
the model to learn and adjust its parameters.
4. The softmax function is typically applied to the raw attention scores obtained from
the dot product of query and key vectors in the self-attention mechanism. The
formula for computing the softmax attention weights for a given query token in a
transformer is as follows:
Here, Q represents the query vector, K represents the key vectors of the input tokens, and
the exponential function (exp) is used to transform the raw scores into positive values. The
denominator ensures that the resulting values form a probability distribution.
In summary, the softmax function is a crucial component of transformers that enables them
to learn how to weigh input tokens based on their relevance to the current context, making
the model’s self-attention mechanism effective in capturing dependencies and relationships
in the data.
And the most important thing is the softmax is used to prevent exploding gradient or
vanishing gradient problems.