100% found this document useful (1 vote)
151 views64 pages

Week 12

The document discusses Generative AI and Transformers, highlighting their roles in creating content and processing sequences, respectively. Generative AI utilizes models like GANs and VAEs to generate new data, while Transformers leverage self-attention mechanisms to understand relationships within input sequences. Key components of Transformers include embedding layers, attention mechanisms, and output probabilities, which together enhance the model's ability to generate coherent text.

Uploaded by

anamtoc9anam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
151 views64 pages

Week 12

The document discusses Generative AI and Transformers, highlighting their roles in creating content and processing sequences, respectively. Generative AI utilizes models like GANs and VAEs to generate new data, while Transformers leverage self-attention mechanisms to understand relationships within input sequences. Key components of Transformers include embedding layers, attention mechanisms, and output probabilities, which together enhance the model's ability to generate coherent text.

Uploaded by

anamtoc9anam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Deep Learning

Dr. Irfan Yousuf


Institute of Data Science, UET, Lahore
(Week 12; April 06, 2025)
Outline
• Generative AI (GenAI)
• Transformers
Generative AI
• Generative AI is a type of artificial intelligence technology
that generates new text, audio, video, or any other type of
content by using architectures like Generative Adversarial
Networks (GANs) or Variational Auto Encoders (VAEs).

• It learns patterns from existing training data and produces


new and unique output that resembles real-world data.
Generative AI
• Generative AI (GenAI) broadly describes machine learning
(ML) models or algorithms.

• The technology behind the working of OpenAI’s extremely


intelligent chatbot called ChatGPT, is generative AI. This
smart technology serves as the brain of ChatGPT and enables
it to generate responses like a real person.

• In contrast, other types of AI, like classification and


regression modes, focus on analyzing or making predictions
on input data. In simple terms, Generative AI is all about
creation, while other AI types are about analysis or
prediction.
How does Generative AI work?
• Generative AI initiates from prompts in various formats, like
text, images, videos, designs, or musical notes, with diverse
algorithms creating essays, problem solutions, or realistic
fakes in response.

• Initially, using generative AI was complex, involving API


submissions and specialized tools in languages like Python.
However, user experiences have evolved, allowing plain
language requests.
Generative AI Models
• Generative AI models seamlessly integrate a diverse array of
AI algorithms to comprehend and process content.
• Under text generation, various natural language processing
techniques transform raw characters—letters, punctuation,
and words—into elements such as sentences, parts of speech,
entities, and actions. These intricacies are then skillfully
represented as vectors through the application of multiple
encoding methods.
• Likewise, when it comes to images, they undergo a
transformative process, emerging as diverse visual elements,
also captured as vectors.
Generative AI Models
• Once a comprehensive representation of the world is
established, developers leverage specific neural networks to
generate fresh content based on queries or prompts.

• Techniques such as Generative Adversarial Networks


(GANs) and Variational Autoencoders (VAEs), featuring both
a decoder and an encoder, demonstrate efficacy in crafting
realistic human faces, generating synthetic data for AI
training, or even replicating specific individuals.
Generative AI Models
• Recent strides in transformer technology, embodied by
Google’s Bidirectional Encoder Representations from
Transformers (BERT), OpenAI’s GPT, and Google’s
AlphaFold, have substantially expanded the capabilities of
neural networks.
Transformers
• Transformers are a type of neural network architecture that
transforms or changes an input sequence into an output
sequence.

• They do this by learning context and tracking relationships


between sequence components.

• For example, consider this input sequence: "What is the color


of the sky?" The transformer model uses an internal
mathematical representation that identifies the relevancy and
relationship between the words color, sky, and blue. It uses
that knowledge to generate the output: "The sky is blue."
Transformers
• Early deep learning models that focused extensively on
natural language processing (NLP) tasks aimed at getting
computers to understand and respond to natural human
language. They guessed the next word in a sequence based on
the previous word.

• To understand better, consider the autocomplete feature in


your smartphone. It makes suggestions based on the
frequency of word pairs that you type. For example, if you
frequently type "I am fine," your phone autosuggests fine
after you type am.
Sequential Processing
• Traditional neural networks that deal with data sequences
often use an encoder/decoder architecture pattern.

• The encoder reads and processes the entire input data


sequence, such as an English sentence, and transforms it into
a compact mathematical representation.

• This representation is a summary that captures the essence of


the input. Then, the decoder takes this summary and, step by
step, generates the output sequence, which could be the same
sentence translated into French.
Sequential Processing
• This process happens sequentially, which means that it has to
process each word or part of the data one after the other.

• The process is slow and can lose some finer details over long
distances.
Self-attention mechanism
• Transformer models modify this process by incorporating
something called a self-attention mechanism.

• Instead of processing data in order, the mechanism enables


the model to look at different parts of the sequence all at once
and determine which parts are most important.

• Imagine that you're in a busy room and trying to listen to


someone talk.

• It’s also more effective, especially when dealing with long


pieces of text where context from far back might influence
the meaning of what's coming next.
Transformer Architecture
Transformer Architecture
• Every text-generative Transformer consists of these three key
components:

1. Embedding: Text input is divided into smaller units called


tokens, which can be words or subwords. These tokens are
converted into numerical vectors called embeddings, which
capture the semantic meaning of words.
Transformer Architecture
2. Transformer Block is the fundamental building block of
the model that processes and transforms the input data. Each
block includes:
Attention Mechanism, the core component of the
Transformer block. It allows tokens to communicate with
other tokens, capturing contextual information and
relationships between words.

MLP (Multilayer Perceptron) Layer, a feed-forward


network that operates on each token independently. While the
goal of the attention layer is to route information between
tokens, the goal of the MLP is to refine each token's
representation.
Transformer Architecture
3. Output Probabilities: The final linear and softmax layers
transform the processed embeddings into probabilities,
enabling the model to make predictions about the next token in
a sequence.
Transformer Architecture: Input Embeddings
• This stage converts the input sequence into the mathematical
domain that algorithms understand.
• At first, the input sequence is broken down into a series of
tokens or individual sequence components.
• For instance, if the input is a sentence, the tokens are words.
• Embedding then transforms the token sequence into a
mathematical vector sequence.
• The vectors carry semantic and syntax information,
represented as numbers, and their attributes are learned
during the training process.
Transformer Architecture: Input Embeddings
• “Data visualization empowers users to”.
• This input needs to be converted into a format that the model
can understand and process. That is where embedding comes
in: it transforms the text into a numerical representation
that the model can work with. To convert a prompt into
embedding, we need to 1) tokenize the input, 2) obtain token
embeddings, 3) add positional information, and finally 4) add
up token and position encodings to get the final embedding.
Transformer Architecture: Input Embeddings
• Step 1: Tokenization:
• Tokenization is the process of breaking down the input text
into smaller, more manageable pieces called tokens. These
tokens can be a word or a subword. The words "Data" and
"visualization" correspond to unique tokens, while the word
"empowers" is split into two tokens. The full vocabulary of
tokens is decided before training the model: GPT-2's
vocabulary has 50,257 unique tokens.
Transformer Architecture: Input Embeddings
• Step 2: Token Embedding
• GPT-2 (small) represents each token in the vocabulary as a
768-dimensional vector; the dimension of the vector depends
on the model. These embedding vectors are stored in a matrix
of shape (50,257, 768), containing approximately 39 million
parameters! This extensive matrix allows the model to assign
semantic meaning to each token.
Transformer Architecture: Input Embeddings
• Step 3: Positional Encoding
• The Embedding layer also encodes information about each
token's position in the input prompt. Positional encoding adds
information to each token's embedding to indicate its position
in the sequence. Different models use various methods for
positional encoding. GPT-2 trains its own positional
encoding matrix from scratch, integrating it directly into the
training process.
Transformer Architecture: Input Embeddings
• Step 4: Final Embedding
• Finally, we sum the token and positional encodings to get the
final embedding representation. This combined
representation captures both the semantic meaning of the
tokens and their position in the input sequence.
Transformer Architecture: Input Embeddings
Transformer Architecture
Transformer Architecture: Transformer block
• The core of the Transformer's processing lies in the
Transformer block, which comprises multi-head self-
attention and a Multi-Layer Perceptron layer. Most models
consist of multiple such blocks that are stacked sequentially
one after the other. The token representations evolve through
layers, from the first block to the last one, allowing the model
to build up an complex understanding of each token. This
layered approach leads to higher-order representations of the
input. The GPT-2 (small) model we are examining consists of
12 such blocks.
Transformer Architecture: Transformer block
• For instance, consider the sentences "Speak no lies" and "He
lies down."

• In both sentences, the meaning of the word lies can’t be


understood without looking at the words next to it.

• The words speak and down are essential to understand the


correct meaning. Self-attention enables the grouping of
relevant tokens for context.
Transformer Architecture: Transformer block
• The self-attention mechanism enables the model to focus on
relevant parts of the input sequence, allowing it to capture
complex relationships and dependencies within the data.
Transformer Architecture: Transformer block
• Step 1: Query, Key, and Value Matrices
• Each token's embedding vector is transformed into three
vectors: Query (Q), Key (K), and Value (V). These vectors
are derived by multiplying the input embedding matrix with
learned weight matrices for Q, K, and V.
Transformer Architecture: Transformer block
• Query (Q) is the search text you type in the search engine
bar. This is the token you want to "find more information
about".
• Key (K) is the title of each web page in the search result
window. It represents the possible tokens the query can
attend to.
• Value (V) is the actual content of web pages shown. Once
we matched the appropriate search term (Query) with the
relevant results (Key), we want to get the content (Value) of
the most relevant pages.
By using these QKV values, the model can calculate attention
scores, which determine how much focus each token should
receive when generating predictions.
Transformer Architecture: Transformer block
• Step 2: Multi-Head Splitting
• Query, key, and Value vectors are split into multiple heads—
in GPT-2 (small)'s case, into 12 heads. Each head processes a
segment of the embeddings independently, capturing
different syntactic and semantic relationships. This design
facilitates parallel learning of diverse linguistic features,
enhancing the model's representational power.
Transformer Architecture: Transformer block
• Step 3: Masked Self-Attention
• In each head, we perform masked self-attention calculations.
This mechanism allows the model to generate sequences by
focusing on relevant parts of the input while preventing
access to future tokens.
Transformer Architecture: Transformer block
• Step 3: Masked Self-Attention
Transformer Architecture: Transformer block
• Step 3: Masked Self-Attention
• Attention Score: The dot product of Query and Key
matrices determines the alignment of each query with each
key, producing a square matrix that reflects the relationship
between all input tokens.

• Masking: A mask is applied to the upper triangle of the


attention matrix to prevent the model from accessing future
tokens, setting these values to negative infinity. The model
needs to learn how to predict the next token without
“peeking” into the future.
Transformer Architecture: Transformer block
• Step 3: Masked Self-Attention
• Softmax: After masking, the attention score is converted into
probability by the softmax operation which takes the
exponent of each attention score. Each row of the matrix
sums up to one and indicates the relevance of every other
token to the left of it.
Transformer Architecture: Transformer block
Transformer Architecture: Transformer block
Transformer Architecture: Transformer block
• Step 4: Output and Concatenation
• The model uses the masked self-attention scores and
multiplies them with the Value matrix to get the final output
of the self-attention mechanism. GPT-2 has 12 self-attention
heads, each capturing different relationships between tokens.
The outputs of these heads are concatenated and passed
through a linear projection.
Transformer Architecture: Transformer block
• MLP: Multi-Layer Perceptron
• After the multiple heads of self-attention capture the diverse
relationships between the input tokens, the concatenated
outputs are passed through the Multilayer Perceptron (MLP)
layer to enhance the model's representational capacity. The
MLP block consists of two linear transformations with a
GELU activation function in between.
Transformer Architecture: Transformer block
• The GELU (Gaussian Error Linear Unit) is an activation
function widely used in modern deep learning models,
particularly transformer-based models like BERT, GPT, and
T5.
Transformer Architecture: Output Probabilities
• After the input has been processed through all Transformer
blocks, the output is passed through the final linear layer to
prepare it for token prediction. This layer projects the final
representations into a 50,257 dimensional space, where every
token in the vocabulary has a corresponding value called
logit. Any token can be the next word, so this process allows
us to simply rank these tokens by their likelihood of being
that next word. We then apply the softmax function to
convert the logits into a probability distribution that sums to
one. This will allow us to sample the next token based on its
likelihood.
Transformer Architecture: Output Probabilities
Transformer Architecture: Output Probabilities
• The final step is to generate the next token by sampling from
this distribution The temperature hyperparameter plays a
critical role in this process. Mathematically speaking, it is a
very simple operation: model output logits are simply divided
by the temperature:
• temperature = 1: Dividing logits by one has no effect on the
softmax outputs.
• temperature < 1: Lower temperature makes the model more
confident and deterministic by sharpening the probability
distribution, leading to more predictable outputs.
• temperature > 1: Higher temperature creates a softer
probability distribution, allowing for more randomness in the
generated text – what some refer to as model “creativity”.
Transformer Architecture: Output Probabilities
• In addition, the sampling process can be further refined using
top-k and top-p parameters:

• top-k sampling: Limits the candidate tokens to the top k


tokens with the highest probabilities, filtering out less likely
options.

• top-p sampling: Considers the smallest set of tokens whose


cumulative probability exceeds a threshold p, ensuring that
only the most likely tokens contribute while still allowing for
diversity.
2017 Conference on Neural Information
Processing Systems (NeurIPS)
Attention
• An attention function can be described as mapping a query
and a set of key-value pairs to an output, where the query,
keys, values, and output are all vectors.
• The output is computed as a weighted sum of the values,
where the weight assigned to each value is computed by a
compatibility function of the query with the corresponding
key.
• The key/value/query concept is analogous to retrieval
systems. For example, when we search for videos on
Youtube, the search engine will map our query (text in the
search bar) against a set of keys (video title, description, etc.)
associated with candidate videos in their database, then
present us the best matched videos (values).
Attention: Scaled Dot-Product Attention
• We call our particular attention "Scaled Dot-Product
Attention".

• The input consists of queries and keys of dimension dk, and


values of dimension dv.

• We compute the dot products of the query with all keys,


divide each by √dk, and apply a softmax function to obtain
the weights on the values.
Attention: Scaled Dot-Product Attention
Attention: Scaled Dot-Product Attention
• The dot product between the query (“apple”) and each key
(“cat,” “apple,” “tree,” “juice”) determines how relevant each
word in the English sentence is to “apple.” Higher dot
products indicate higher relevance.

• These scores are then used to weight the corresponding


values (“chat,” “pomme,” “arbre,” “jus”) when generating
the translation for “apple.”
Attention: Scaled Dot-Product Attention
• Queries is a set of vectors you want to calculate attention for.
• Keys is a set of vectors you want to calculate attention
against.
• As a result of dot product multiplication, you'll get set of
weights a (also vectors) showing how attended each query
against Keys. Then you multiply it by Values to get resulting
set of vectors.
Transformer Architecture
Sequence Modeling
RNNs to Transformers
RNNs to Transformers
RNNs to Transformers
RNNs to Transformers
RNNs to Transformers
RNNs to Transformers
RNNs to Transformers
Useful Links
• https://poloclub.github.io/transformer-explainer/

• https://magazine.sebastianraschka.com/p/understanding-
and-coding-self-attention
Summary
• Generative AI

You might also like