Chapter 1
Chapter 1
Field focused on
Advanced models within AI that generates new
Definition processing human
NLP using deep learning content
language
Tokenization, sentiment
Examples analysis, text GPT-3, BERT, T5 GPT-3, DALL-E, VQ-VAE
summarization
•GPT-4: 1.2trillion A powerful tool for specific tasks requiring fast processing and large-scale data analy
LLM Terms •Human Brain(80 billion): Unmatched in creativity, emotional intelligence, and adaptive learning.
• Foundation LLM (Large Language Model): AI models trained on massive datasets to understand and generate human-like text.
• Transformer architecture: A neural network architecture that uses self-attention mechanisms to process sequences of data, foundational
for many LLMs.
• Attention Mechanism: A technique within transformers where each word in a sentence is weighted by its relevance to other words,
allowing the model to capture context more effectively.
• Parameters: Numeric values within the model that determine its learning capability (e.g., GPT-3 has 175 billion parameters !!!).
Microsoft Copilot has a context window size of 4096
• Token: Smallest unit of text (words, characters, or subwords) processed by the model. tokens when working with OpenAI GPT-4o. This allows
Copilot to handle a larger amount of text compared to
GPT-3's 2048 token
• Context Window: The amount of text the model can consider at once, typically defined by the number of tokens.
• Pre-Training: The initial phase where a model is trained on a vast dataset to learn general language patterns.
• Fine-tuning: Adapting a pre-trained model to a specific task or dataset for improved performance.
• Prompt Engineering: Crafting specific input queries to elicit desired outputs from the model.
•Traditional Word Weighting (e.g., TF-IDF) in NLP:
•Attention Mechanism in LLMs: Assigns dynamic weights
based
prompt engineering is a technique aimed at optimizing how queries are presented to the AI model to elicit
•on the context of the word in relation to other words in the
better responses, rather than training users.
sentence.
Fine tuning in LLM vs Optimization in ML
• fine-tuning is a form of optimization focused on adapting a pre-
trained model to a specific task, optimization in machine learning
encompasses a variety of techniques aimed at improving model
performance by adjusting both hyperparameters and parameters.
• Fine-tuning a pre-trained BERT model for a specific NLP task is an
example of optimization. You adjust the model's weights by
continuing training on task-specific data, which optimizes the model
for that particular task. About Parameters
In a simple neural network with one input layer, one hidden layer, and one output
layer:
•If the input layer has 3 neurons, the hidden layer has 4 neurons, and the output layer
has 1 neuron, the number of parameters would be:
• Weights:
(3 inputs×4 hidden neurons)+(4 hidden neurons×1 output neuron)(3
\text{ inputs} \times 4 \text{ hidden neurons}) + (4 \text{ hidden neurons}
\times 1 \text{ output neuron})
• Biases: 4 biases for hidden layer+1 bias for output layer4 \text{ biases for
hidden layer} + 1 \text{ bias for output layer}
• Total Parameters: (3×4)+(4×1)+4+1=21
FFN vs CNN vs RNN in LLMs
Network Type Usage in LLMs Primary Application
Non-linear transformations in
Feed-Forward Networks (FFNs) Integral part of Transformer layers
Transformers
Convolutional Neural Networks Image processing, local
Rarely used in LLMs
(CNNs) dependency capture in text
Not used in modern LLMs, replaced Sequential data processing,
Recurrent Neural Networks (RNNs)
by Transformers temporal dependencies
Aspect RNNs Transformers
Processing Sequential (one step at a time) Parallel (entire sequence at once)
Efficiently captures long-range
Long-Range Dependencies Struggles with very long sequences
dependencies
Training Speed Slower due to sequential nature Faster due to parallelization
Complexity Simpler architecture More complex due to self-attention
Requires more computational
Resource Usage Less computationally intensive
resources
NLP tasks, large datasets, long
Applications Time series, smaller sequences
sequences
Qns?
• what is next to genAI?
• Can I use Autonomous AI for realtime issues when spaceship carrying astronauts?
• What would be the output of sentiment analysis from NLP and sentiment analysis of LLM for the statement given “"the climate is
very cool"
• What is GAN VAE transformers?
• Weightage of word in a sentence is there in NLP?
• How TF-IDF differs from attention mechanism in LLM?
• Is self attention mechanism a part of transformer?
• What is Context Window in LLM?
• Is prompt engineering to train the user or train the software to help models to understand the query better?
• Number of Parameters for GPT 3 and 4?
• approximately how many parameters max any human brain can handle?
• Is GPT 4 better than human brain?
• Transformer and Brain are similar ?
• Neuron and Parameters are similar?
Imagine a Transformer as a highly specialized calculator for text, capable of processing language efficiently and
accurately. In contrast, the human brain is like a vast, adaptive network, capable of creativity, emotions, and
holistic understanding.
Transformers are powerful tools that help AI understand and generate human language by looking at all parts of a sentence
together and figuring out how they relate to each other.
Hallucination
Biased output
Hallucinations(inaccurate or nonsensical
information confidently)- Example
• Input Prompt: "Tell me about the moon landing in 1969."
• Hallucinated Response: "In 1969, astronauts from the fictional
country of Zogonia landed on the moon. They discovered a hidden
civilization of moon people who communicated using light signals.
The Zogonian astronauts brought back samples of moon cheese,
which became a popular delicacy on Earth."
• Clearly, this response is completely fabricated and not based on any
factual information. It's important for AI users to verify the
information from trusted sources, especially when it comes to
historical events or scientific facts.
Biased answer
• Ethical Considerations: LLMs may inadvertently produce biased or harmful
outputs, reflecting biases in their training data. Generating gender-
stereotypical job suggestions like "Women are better suited for nursing
than engineering.“
• "Which is better, Android or iOS?"
• Biased Response: "iOS is definitely superior to Android in every way. iPhones have better
build quality, more consistent updates, and overall, it's the only platform worth using.
Android phones are just too fragmented and laggy."
• This response is biased because it expresses a strong preference for one operating system
over the other without acknowledging any potential strengths of the Android platform. It's
important for AI to remain neutral and present balanced information, allowing users to form
their own opinions.
• LLMs also face other challenges including data privacy and security
concerns, bias and fairness issues, high computational resource
requirements
•Data Collection and Filtering:
•Large Text Data: The process begins with gathering a
•vast amount of unstructured text data from various sources such as books
•, articles, and web content.
•Quality Filtering: This data is then filtered to ensure high quality.
•Typically, 1-3% of the original tokens (words or characters) are filtered out to improve the training dataset's quality.
•Embedding and Transformation:
•Embedding Layer + Transformer: The filtered text data is passed
• through an embedding layer and a transformer within the Large Language Model (LLM).
• This step converts the text into numerical representations (embeddings) that capture the semantic meaning of the word
•Token Representation Example:
•The image provides an example of how tokens are represented:
•Token String: Words like 'The' and 'teacher'.
•Token ID: Numerical IDs assigned to each token (e.g., 37 for 'The' and 3145 for 'teacher').
•Embedding / Vector Representation:
•Each token is converted into a dense vector that captures
• its semantic meaning (e.g., [-0.0513, -0.0584, 0.0230, ...] for 'The').
•Model Types:
•The image highlights three types of models used in LLMs:
•Encoder Only Models: Focus on understanding the input data.
•Encoder Decoder Models: Use both encoders and decoders to understand and generate data.
•Decoder Only Models: Focus on generating data from given input.
•Hardware:
•The training process involves the use of GPUs (Graphics Processing Units),
•Self-Attention Mechanism: The encoder applies the self-attention mechanism to the input vectors.
• This step helps the model to weigh the importance of each word in the context of the entire input sequence.
• It allows the model to focus on relevant parts of the input when forming an understanding.
•Query, Key, and Value Vectors: Each word’s embedding is transformed into three vectors: query (Q), key (K),
•and value (V). The attention scores are calculated as the dot product of the query and key vectors,
•which are then scaled and passed through a softmax function to obtain attention weights.
•These weights are used to compute a weighted sum of the value vectors.
•Feed-Forward Neural Network: After self-attention, the output goes through a feed-forward neural network (FFN).
• This step helps in learning more complex patterns and relationships in the data.
•The FFN is applied to each position separately and identically.
•Add & Norm: The output from the self-attention and FFN layers are then normalized
•and added together (residual connections). This helps in stabilizing the training process and
•preserving the learned information.
•Stacking Layers: The encoder consists of multiple layers,
•each containing the self-attention mechanism, feed-forward network, and normalization steps.
•Stacking these layers helps in learning hierarchical representations and capturing complex dependencies.
FFN vs CNN vs RNN in LLMs
Network Type Usage in LLMs Primary Application
Non-linear transformations in
Feed-Forward Networks (FFNs) Integral part of Transformer layers
Transformers
Convolutional Neural Networks Image processing, local
Rarely used in LLMs
(CNNs) dependency capture in text
Not used in modern LLMs, replaced Sequential data processing,
Recurrent Neural Networks (RNNs)
by Transformers temporal dependencies
Aspect RNNs Transformers
Processing Sequential (one step at a time) Parallel (entire sequence at once)
Efficiently captures long-range
Long-Range Dependencies Struggles with very long sequences
dependencies
Training Speed Slower due to sequential nature Faster due to parallelization
Complexity Simpler architecture More complex due to self-attention
Requires more computational
Resource Usage Less computationally intensive
resources
NLP tasks, large datasets, long
Applications Time series, smaller sequences
sequences
RNN, LSTM, GRU and Transformers
Transformers use a mechanism called self-attention that allows them to process all tokens in the input
sequence(means context) simultaneously. -> bad in long range dependency and no parallelism in RNN
Transformer Architecture
Why RNN is replaced but FFN is present in
Transformers?
• The absence of RNNs and the presence of Feed Forward Networks (FFNs) in
Transformers are key design choices that contribute to the efficiency and
effectiveness of the model.
Decoder
Tutorial 1 (06.01.2025):
Using the Hugging Face Transformers library, create a Python Hugging Face Transformers
script to perform sentiment analysis on a given set of movie •Library: Hugging Face Transformers is a library that provides
reviews. analyze the sentiment of each review, and print the pre-trained models and tools for various NLP tasks. It includes
results with confidence scores. implementations of many state-of-the-art models like BERT,
GPT-3, RoBERTa, and more.
•Model Hub: Hugging Face also hosts a model hub where you
pip install transformers
can find and share pre-trained models.
from transformers import pipeline
# Load pre-trained sentiment-analysis pipeline
classifier = pipeline('sentiment-analysis') pipeline is a high-level API provided by the Hugging Face
# Sample text for sentiment analysis Transformers library. It simplifies the use of various NLP models fo
texts = [ r specific tasks, such as sentiment analysis,
"I love this product! It's absolutely amazing.", text generation, question answering, and more
"I'm very disappointed with the service.",
"The movie was okay, not great but not terrible either."
]
# Perform sentiment analysis on the sample text
results = classifier(texts)
# Print results
for text, result in zip(texts, results):
print(f"Text: {text}\nSentiment: {result['label']}, Confidence: {result['score']:.4f}\n")
Lab Exercise
• vocab = {'I': 0, 'am': 1, 'a': 2, 'student': 3, '<pad>': 4}
• sentence = ['I', 'am', 'a', 'student’]