0% found this document useful (0 votes)

16 views

Code Explanation

explanation for my code for AI class project

Uploaded by

amanuel.ayalew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Code Explanation

explanation for my code for AI class project

Uploaded by

amanuel.ayalew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Addis Ababa Institute of Technology

AI project 2 Explanation

Transformer Based Language Model

Amanuel Ayalew...ATE/3871/13
Abdurahman mohammed... ATE/8901/13
BPE.ipynb

Creating a Tokenizer with Byte Pair Encoding (BPE)

First, we start by setting up a special kind of tokenizer known as Byte Pair Encoding (BPE). This
method helps us efficiently break down text into tokens, especially when we have large amounts
of text. We import the necessary tools from a library specifically designed for tokenization tasks.
We set up our tokenizer to recognize spaces as breaks between words and define a few special
symbols that help in understanding the structure of sentences, like the beginning or end of a
sentence.

Preparing Data for Tokenizer Training

Next, we prepare the data needed to train our tokenizer. We do this by reading a file that
contains text, and then we create a new file where we only keep 5% of the original content. This
smaller file is easier and faster for training our tokenizer without needing to process the entire
large file.

Training and Saving the Tokenizer

After preparing the data, we proceed to train our tokenizer. We specify that we want our
tokenizer to learn from the smaller file we created and recognize up to 50,000 different tokens.
This step is crucial as it teaches the tokenizer the different pieces of text it should recognize.
Once training is complete, we save the tokenizer so we can use it later without needing to
retrain it from scratch.

Using the Tokenizer

We then demonstrate how to use our trained tokenizer. We load it from the file where we saved
it and show how it can convert a simple sentence like "Hello, world!" into tokens and then back
into the original text. This shows that our tokenizer can understand and process text as
expected.

Exploring Pre-trained Tokenizers

Additionally, we explore how to use a tokenizer that someone else has already trained and
made available for others to use. This pre-trained tokenizer is part of a library called Hugging
Face, which offers many tools for text processing. We show how this tokenizer works in a similar
way to ours but uses a well-known and widely used setup.

Timing Data Transfer to the GPU

Lastly, we touch upon a technical aspect important in speeding up computations—using a GPU
(Graphics Processing Unit), which is a specialized processor that handles data much faster than
a regular CPU (Central Processing Unit). We demonstrate how to measure the time it takes to
move data to the GPU, which is useful for understanding and optimizing performance in tasks
that require processing large amounts of data quickly.

bigram .ipynb

The purpose of the notebook begins with setting up the PyTorch environment to ensure that all
subsequent operations are optimized for performance, whether they be on a CPU or GPU. This
setup is essential for leveraging the full capabilities of PyTorch in processing large datasets
efficiently.

Next, the notebook delves into reading and processing a classic text, "The Wizard of Oz," to
extract unique characters and determine the size of the vocabulary. This step is critical as it
transitions the raw text into a form that can be numerically analyzed by machine learning
algorithms, highlighting the initial preprocessing steps required in any data science workflow.

Further, the notebook introduces methods to encode and decode the text—transforming
characters into integers and vice versa. This encoding is crucial for preparing the data for
training, as machine learning models inherently work with numerical data.

An essential part of the notebook is the batch processing setup, which illustrates how to create
mini-batches of data. This method is a standard practice in training machine learning models,
allowing for efficient memory usage and faster processing by breaking down the dataset into
manageable parts.

gpt-v1 .ipynb (Generative Pre-Trained Transformers)

The Jupyter notebook you shared is structured to illustrate the process of building and training a
language model using PyTorch, a powerful machine learning library. Here's a step-by-step
breakdown of the notebook's purpose and workflow:

Environment Setup and Model Parameters

The notebook begins by importing necessary Python modules, including PyTorch for model
building and training. It sets up a computing environment to use GPU if available, which
accelerates the computation needed for training neural networks. Key parameters such as batch
size, block size, number of iterations, and learning rate are defined. These parameters control
how the model will learn from the data, including how much data it processes at once and how
quickly it updates its learning.
Data Preparation

Character data is loaded from a vocabulary file to create a mapping of characters to integers
and vice versa. This is crucial for converting text data into a numerical format that a machine
learning model can process, as models do not understand text directly.

Data Loading and Batch Processing

Functions are defined to load data in small chunks using memory mapping, which allows the
notebook to handle large text files efficiently by only loading parts of the file into memory as
needed. Another function prepares mini-batches of this data for the training process, organizing
the data into small sets that the model will learn from iteratively.

Model Building

The notebook constructs a neural network model inspired by the architecture of GPT
(Generative Pre-trained Transformer). It includes multiple components such as self-attention
heads, multi-head attention, feedforward networks, and layer normalization. These components
work together to process and learn from the text data effectively.

Training Loop

A training loop is implemented where the model learns from the data through a series of
iterations. During each iteration, the model processes a batch of data, calculates how well it is
performing (loss), and adjusts itself to improve. Periodic evaluation checks how well the model
is learning and provides feedback on its progress.

Text Generation

Finally, the model's ability to generate text is demonstrated. The model uses the learned
patterns in the text to predict and generate new text sequences based on a given prompt. This
showcases the practical application of the model as a text generator.
data extract.py

This Python script is designed to process and extract text from compressed files, specifically
those in `.xz` format, commonly found in datasets like OpenWebText. It focuses on handling
large datasets efficiently by leveraging concurrent processing, and it aims to construct a
vocabulary from the unique characters found in the dataset. Here’s a detailed breakdown of its
workflow:

Define Helper Functions:

Several helper functions are defined:

process_file(args): This function takes a tuple containing the directory, filename, output file path,
and a vocabulary set. It reads text from a compressed `.xz` file, writes it to an output file, and
collects unique characters from the text.

xz_files_in_dir(directory): Lists all .xz files in a given directory, ensuring that only files (and not
directories) are included.

process_files_in_parallel(files, folder_path, output_file): Manages the parallel processing of files.

It creates a set for vocabulary and uses a `ProcessPoolExecutor` to process files concurrently,
updating the vocabulary with unique characters from each file.

File Processing:

The script identifies all .xz files in the specified folder_path. It then splits these files into training
and validation sets, usually with a 90/10 split. Before processing, it ensures that the output files
are empty by opening them in write mode and immediately closing them.

Parallel Processing of Files:

Training Files: The script processes the training files in parallel, extracting text and updating the
training vocabulary.

Validation Files: Similarly, it processes the validation files and updates the validation vocabulary.

Combine and Save Vocabulary:

After processing, the script combines the vocabularies from the training and validation datasets.
It sorts the combined vocabulary and writes each character to a vocabulary file (`vocab.txt`).
This file will contain every unique character found across the entire dataset.

Chatbot.py inspired by chatgpt

Loading and Preparing Vocabulary:

- Vocabulary data is loaded from a file, creating mappings (`string_to_int` and `int_to_string`)
to convert characters to integers and back. These mappings are crucial for processing text data,
allowing the model to handle it numerically.

Defining the Model Architecture:

The model includes several layers typical of a transformer architecture:

Self-Attention Heads: Allow the model to weigh the importance of different words relative to
others in a sentence.

Multi-Head Attention: Combines multiple attention mechanisms to improve the context

understanding.

Feedforward Neural Network: Processes the output from the attention mechanisms to derive the
next set of outputs.

Normalization and Dropout: Used to stabilize and regularize the learning process.

Each component is designed to capture different aspects of the language, enabling the model to
generate coherent and contextually appropriate text.

Loading and Interacting with the Model: The script loads pre-trained model parameters,
allowing the model to generate text without additional training.it enters an interactive loop where
users can input prompts, and the model generates text based on these prompts. This
showcases the model's ability to apply learned language patterns to generate new sentences
that follow logically from given text.

Text Generation:

The generation process involves the model predicting the next character in a sequence
repeatedly until it has built a full response. The script uses a softmax function to convert the
model outputs into probabilities and selects the next character based on these probabilities.
train.py

Data Handling

Vocabulary and Encoding: The script reads a vocabulary file to map each character to a unique
integer, facilitating the model's handling of text data as tensors, which are required for training
neural networks.

Data Loading Using Memory Mapping: It employs memory-mapped files to efficiently load large
text datasets, ensuring that only the necessary parts of the data are loaded into memory,
reducing the overall memory footprint.

Model Definition

Model Components: The script defines several key components of the Transformer architecture:

Self-Attention Heads: These allow the model to weigh different parts of the input differently,
enhancing its ability to focus on relevant information.

Multi-Head Attention: It aggregates information from multiple attention heads, capturing various
aspects of the context.

Feedforward Network: Each transformer block contains a feedforward network that processes
the output from the attention heads.

Layer Normalization and Dropout**: These components are crucial for stabilizing training and
preventing overfitting.

Model Assembly: The model comprises multiple layers of the transformer block, each
contributing to the model's ability to understand and generate language.

Training Process

Batch Processing: The script defines functions to fetch batches of processed data, suitable for
training.

Loss Calculation and Optimization: Utilizes a custom training loop with a specified optimizer
(AdamW), calculates the loss, and adjusts the model parameters based on the gradient
information.

Evaluation: Periodically evaluates the model's performance on validation data to monitor

progress and adjust training as needed.

Model Interaction and Persistence Text Generation After training, the model can generate text
based on a given prompt, demonstrating its capability to apply learned language patterns.Model
Saving: The trained model parameters are saved to a file, allowing for later reuse without
retraining from scratch.
References

Github repository link

https://github.com/Amanuel-Ayal3w/Transformer_based_language_model

Adrej Karpathy full playlist was a huge inspiration and great resource

Attention is All you Need the paper and explanation by this person
https://youtu.be/K9j5GrH71iU?si=_dhiNgI4EVFrd4VH

We used some of Andrej karpathy code this project too

And datas are form kagel.com

04 AMD Edge AI TechDay_Singapore_2024_FrankWang
No ratings yet
04 AMD Edge AI TechDay_Singapore_2024_FrankWang
29 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
3 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Glove
100% (1)
Glove
10 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
From Everand
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
James Tudor
5/5 (1)
KNIME Essentials
From Everand
KNIME Essentials
Gábor Bakos
No ratings yet
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
No ratings yet
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
35 pages
Harvard CS197 Lecture 4 Notes
No ratings yet
Harvard CS197 Lecture 4 Notes
15 pages
cl12_huggingface
No ratings yet
cl12_huggingface
34 pages
DL Practical 09text Pre Processing
No ratings yet
DL Practical 09text Pre Processing
6 pages
Python Performance Engineering: Strategies and Patterns for Optimized Code
From Everand
Python Performance Engineering: Strategies and Patterns for Optimized Code
Aarav Joshi
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
LLM_FINE_TUNE
No ratings yet
LLM_FINE_TUNE
11 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
FineTune OPUS MT Engine
No ratings yet
FineTune OPUS MT Engine
9 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
33 pages
The Software Programmer: Basis of common protocols and procedures
From Everand
The Software Programmer: Basis of common protocols and procedures
S Mathioudakis
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
GPT 2 - Learninhg 4
No ratings yet
GPT 2 - Learninhg 4
2 pages
AI_Phase2
No ratings yet
AI_Phase2
9 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
GPT in 60 Lines of NumPy _ Jay Mody
No ratings yet
GPT in 60 Lines of NumPy _ Jay Mody
41 pages
Troubleshooting Puppet
From Everand
Troubleshooting Puppet
Uphill Thomas
No ratings yet
How to use ChatGPT
From Everand
How to use ChatGPT
Bernhard Gaum
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Python Data Persistence
From Everand
Python Data Persistence
Malhar Lathkar
No ratings yet
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
From Everand
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
Tim Warren
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Objective-C Programming Nuts and bolts
From Everand
Objective-C Programming Nuts and bolts
Keith Lee
No ratings yet
Bringing Images to Life: Exploring DALL-E with ChatGPT
From Everand
Bringing Images to Life: Exploring DALL-E with ChatGPT
Aura-Elena Turcu
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Understanding Python: Beginner's Guide to Programming
From Everand
Understanding Python: Beginner's Guide to Programming
Sabry Fattah
No ratings yet
Tensor flow chat bot
No ratings yet
Tensor flow chat bot
44 pages
Python Automation for Beginners: A Practical Guide with Examples
From Everand
Python Automation for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
taask
No ratings yet
taask
18 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Summer Course Material
No ratings yet
Summer Course Material
52 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
PGI20S02J - LAB RECORD (3)
No ratings yet
PGI20S02J - LAB RECORD (3)
24 pages
LLM From Scratch
No ratings yet
LLM From Scratch
27 pages
Beginner's guide to mastering python
From Everand
Beginner's guide to mastering python
Xilis
No ratings yet
Data Manipulation with Python Step by Step: A Practical Guide with Examples
From Everand
Data Manipulation with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
IRT Lab Programs
No ratings yet
IRT Lab Programs
9 pages
Transformer
No ratings yet
Transformer
39 pages
IBM WebSphere eXtreme Scale 6
From Everand
IBM WebSphere eXtreme Scale 6
Anthony Chaves
No ratings yet
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
From Everand
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
Prem Timsina
No ratings yet
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Colab
No ratings yet
Colab
8 pages
CH 1
No ratings yet
CH 1
17 pages
Ubuntu 18.04 LTS Desktop Installation
No ratings yet
Ubuntu 18.04 LTS Desktop Installation
38 pages
Industry 5 0 A Survey On Enabling Technologies and Potential Applications Revision Submitted To The Elsevier JII
No ratings yet
Industry 5 0 A Survey On Enabling Technologies and Potential Applications Revision Submitted To The Elsevier JII
32 pages
CoTM Thesis - Henok Hailu
100% (1)
CoTM Thesis - Henok Hailu
95 pages
Application Development Using Flutter
No ratings yet
Application Development Using Flutter
5 pages
Importing as-Staked Points
No ratings yet
Importing as-Staked Points
17 pages
MNSIM Manual
No ratings yet
MNSIM Manual
9 pages
Introduction of DBMS
No ratings yet
Introduction of DBMS
89 pages
Tech Note 727 - GRAccess Configuring Analog Field Attributes
No ratings yet
Tech Note 727 - GRAccess Configuring Analog Field Attributes
5 pages
Traintools PC Communication Engineering: Publication Um5532 (1.0.0)
100% (2)
Traintools PC Communication Engineering: Publication Um5532 (1.0.0)
78 pages
Prakash A - CV
No ratings yet
Prakash A - CV
3 pages
Tmlog
No ratings yet
Tmlog
13 pages
air quality 2
No ratings yet
air quality 2
34 pages
S Syntax: DBA Training Manual
100% (1)
S Syntax: DBA Training Manual
34 pages
Open-Source Software For Automated Rodent Behavioral Analysis
No ratings yet
Open-Source Software For Automated Rodent Behavioral Analysis
12 pages
Centrifugal Operating Manual With Starter Information
No ratings yet
Centrifugal Operating Manual With Starter Information
100 pages
Resume Template
No ratings yet
Resume Template
4 pages
Graphical Parameters R
No ratings yet
Graphical Parameters R
4 pages
Ose Paper
No ratings yet
Ose Paper
22 pages
Cambridge IGCSE™ ICT Coursebook (Victoria Wright, Denise Taylor, David Waller) (Z-Library)
100% (1)
Cambridge IGCSE™ ICT Coursebook (Victoria Wright, Denise Taylor, David Waller) (Z-Library)
560 pages
Overbridge+User+Manual ENG 231122
No ratings yet
Overbridge+User+Manual ENG 231122
34 pages
FreeFEM-documentation Manual PDF
No ratings yet
FreeFEM-documentation Manual PDF
673 pages
LIST ALL SOFTWARE LVL 2 Part 3
No ratings yet
LIST ALL SOFTWARE LVL 2 Part 3
12 pages
Unit-1 - Co Notes
No ratings yet
Unit-1 - Co Notes
69 pages
Chapter 2 Multimedia Authoring and Tools
No ratings yet
Chapter 2 Multimedia Authoring and Tools
3 pages
Lesson 4 Getting Started With Microsoft Word
No ratings yet
Lesson 4 Getting Started With Microsoft Word
10 pages
6022353db86964217e0d0f19-1612854698-Dream Board or Vision Board
100% (1)
6022353db86964217e0d0f19-1612854698-Dream Board or Vision Board
9 pages
C Language Course Outline
No ratings yet
C Language Course Outline
15 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
39 pages

Code Explanation

Uploaded by

Code Explanation

Uploaded by

Addis Ababa Institute of Technology

Transformer Based Language Model

Creating a Tokenizer with Byte Pair Encoding (BPE)

Preparing Data for Tokenizer Training

Training and Saving the Tokenizer

Using the Tokenizer

Exploring Pre-trained Tokenizers

Timing Data Transfer to the GPU

gpt-v1 .ipynb (Generative Pre-Trained Transformers)

Environment Setup and Model Parameters

Data Loading and Batch Processing

Define Helper Functions:

Several helper functions are defined:

process_files_in_parallel(files, folder_path, output_file): Manages the parallel processing of files.

Parallel Processing of Files:

Combine and Save Vocabulary:

Chatbot.py inspired by chatgpt

Defining the Model Architecture:

The model includes several layers typical of a transformer architecture:

Multi-Head Attention: Combines multiple attention mechanisms to improve the context

Evaluation: Periodically evaluates the model's performance on validation data to monitor

Github repository link

We used some of Andrej karpathy code this project too

And datas are form kagel.com

You might also like