Sequential Modelling
Week 11
Need for sequential modelling
• Fully Connected Network (FCN)
• Fixed input dimension
• e.g : input [x1 x2 x3 x4 ……… xn ]
• If input size is <=n → set zeros
• If input size is > n → ignore the input data
• Convolutional Neural Network (CNN)
• Carry spatial information
• Good for image data
FCN and CNN: FCN and CNN:
▪ Output for a given snapshot ▪ Fixed input dimension
▪ Next set of input is treated a new snapshot ▪ Does not carry memory
Need for sequential modelling (cont’d)
• Motivation for sequential Models:
➢ Time series data E.g.: for time series data
• Video
o Periodic cycles
• Autonomous vehicle: Object
o Trends state
o Regularity • Electric circuit
o Sudden spikes/drops • Temperature variation
• Stock price
➢ Natural Language:
o Email auto complete
o Translation (e.g.: English to French)
o Sentiment analysis
Need for sequential modelling (cont’d)
• Today is the coolest temperature in Windsor
NLP:
1 2 3 4 5 6 7
▪ Varying input size
• The historical average temperature in November is 12 degree Celsius
1 8 9 5 6 10 2 11 13 14
Tokenization is the process of breaking down text into smaller, manageable pieces called "tokens.“
Word tokenization - ["I", "love", "NLP"].
Character tokenization – ["N", "L", "P"]
A token ID is a numerical identifier assigned to each token during the tokenization process
Need for sequential modelling (cont’d)
salary Loan Rejected
Credit score
Loan granted
Sequence does not matter
Problems:
Experience
• Varying input size
Needs Verifcation • Too much computation
Age • No parameter sharing
I Neutral
like Positive
Sequence matters
this
Negative
dish
Recurrent Neural Network (RNN)
Neural Network unwrap
Simple RNN : one hidden layer
Deep RNN : many hidden layer
Issues with RNN
• Vanishing gradient
• Exploding gradient
Long Short-Term Memory (LSTM)
• LSTMs introduce special units called memory cells to store information across time steps in a sequence. These
cells can maintain their state (memory) over a longer period of time than traditional RNN units.
• The memory cells are controlled by three gates: input gate, forget gate, and output gate. These gates allow
LSTMs to decide which information to keep, which to discard, and which new information to add.
Gated Recurrent Unit (GRU)
• GRUs are similar to Long Short-Term Memory (LSTMs) but have a simpler structure and fewer parameters,
making them computationally more efficient.
Large Language Models (LLM)
• Language Models:
➢ Basic NLP tasks (answering questions, translation, sentiment analysis)
• LLM is a form of Generative Artificial Intelligence (GenAI – Able to generate new content)
• LLM is a Neural Network designed to
➢ Understand
➢ Generate
➢ Respond
to human like texts
• Deep NN trained on massive (large) amount of data
Why do we call Large Language Models?
• Training on massive amount of data
• Billions of parameters
Large Language Models (cont’d.)
LLM vs Earlier NLP (or simple LM) Models
• NLP/LM:
➢ very specific tasks (e.g., translation, sentiment analysis)
➢ Not able to write an email from given instructions
• LLM:
➢ Can do wide range of NLP tasks
➢ Able to write email for a given set of instructions and more
• Why LLM is so good compared to earlier NLP/LM?
TRANSFORMER ARCHITECTURE
➢ Not all LLMs are transformers
➢ Not all transformers are LLMs
Large Language Models (cont’d.)
• Generative Artificial Intelligence (GenAI): Generate new contents
• LLM typically deals with text, but do they have to be limited to text only?
NO
• GPT 4 is a multimodal model that can process text and images, however referred as LLM due to its primary
fucus and fundamental design being around text-based tasks
• Waymo's multimodal end-to-end model refers to their integrated approach for autonomous driving, where
multiple types of data inputs (camera, radar and lidar) are processed together to make driving decisions.
Use Cases of LLM
o Machine translation: LLMs can be used to translate text from one language to another.
o Content generation: LLMs can generate new text, such as fiction, articles, and even computer
code.
o Sentiment analysis: LLMs can be used to analyze the sentiment of a piece of text, such as
determining whether it is positive, negative, or neutral.
o Text summarization: LLMs can be used to summarize a long piece of text, such as an article or a
document.
o Chatbots and virtual assistants: LLMs can be used to power chatbots and virtual assistants,
such as OpenAI's ChatGPT or Google's Gemini (formerly called Bard).
o Knowledge retrieval: LLMs can be used to retrieve knowledge from vast volumes of text in
specialized areas such as medicine or law.
Stages of Building LLMs Huge computational cost (e.g.: GPT3 training
cost is approximately 4.6 million dollars)
▪ Stage 1: Implementing the LLM
architecture and data preparation
process. This stage involves preparing
and sampling the text data and
understanding the basic mechanisms
behind LLMs.
▪ Stage 2: Pretraining an LLM to create a
foundation model. This stage involves
pretraining the LLM on unlabeled data.
Typically training on a large diverse data
set. (Also known as general data set)
▪ Stage 3: Fine-tuning the foundation
model to become a personal assistant Why is fine tuning important?
or text classifier. This stage involves fine • Train your specific data set
tuning the pretrained LLM on labeled data, • Customize for your application of
which can be either an instruction organization (e.g. health care, airline, law
dataset or a dataset with class labels. firm, educational institute etc.,)
Simplified Transformer Architecture
• An encoder that processes the input text
and produces an embedding representation
(a numerical representation that captures
many different factors in different
dimensions) of the text
• Encodes input text into vectors
• Decoder can use to generate the translated
text one word at a time.
• Generate output text from encoded
vectors
Self-attention mechanism:
• Key part of transformers that allows to weigh importance of different words/tokens relative to
each other.
• Enables model to capture long range dependencies
Transformer Architecture
Attention Is All You Need
https://arxiv.org/pdf/1706.03762
BERT Vs GPT Architecture
• Bidirectional encode representations from
transformers (BERT): the encoder segment
exemplifies BERT-like LLMs, which focus on
masked word prediction and are primarily
used for tasks like text classification
• Predict hidden words in a given
sentence
• Generate pre-trained transformers (GPT):
the decoder segment showcases GPT-like
LLMs, designed for generative tasks and
producing coherent text sequences
• Generate new words
GPT Architecture
• The GPT architecture employs only the
decoder portion of the original transformer.
• It is designed for unidirectional, left-to-right
processing, making it well suited for text
generation and next-word prediction tasks.
• Generate text in an iterative fashion, one
word at a time.
GPT Architecture (cont’d.)
Working with text
Embedding
Vector Embedding
Words corresponding to similar concepts
often appear close to each other in the
embedding space. For instance, different
types of birds appear closer to each
other in the embedding space than in
countries and cities.
Tokenizing Texts
Here, we split an input text into
individual tokens, which are either
words or special characters, such as
punctuation characters.
Converting Tokens into Token IDs
We build a vocabulary by tokenizing the
entire text in a training dataset into
individual tokens. These individual
tokens are then sorted alphabetically,
and duplicate tokens are removed. The
unique tokens are then aggregated into
a vocabulary that defines a mapping
from each unique token to a unique
integer value. The depicted vocabulary is
purposefully small and contains no
punctuation or special characters for
simplicity.
Converting Tokens into Token IDs (cont’d.)
Starting with a new text sample, we tokenize the
text and use the vocabulary to convert the text
tokens into token IDs. The vocabulary is built from
the entire training set and can be applied to the
training set itself and any new text samples. The
depicted vocabulary contains no punctuation or
special characters for simplicity.
Adding special context tokens
We add special tokens to a vocabulary to deal with
certain contexts. For instance, we add
an <|unk|> token to represent new and unknown
words that were not part of the training data and
thus not part of the existing vocabulary.
Furthermore, we add an <|endoftext|> token that
we can use to separate two unrelated text sources.
Byte Pair Encoding (BPE)
The BPE tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.
BPE tokenizers break down unknown words into
subwords and individual characters. This way, a BPE
tokenizer can parse any word and doesn’t need to
replace unknown words with special tokens, such
as <|unk|>.