0% found this document useful (0 votes)
10 views

Code Explanation

explanation for my code for AI class project

Uploaded by

amanuel.ayalew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Code Explanation

explanation for my code for AI class project

Uploaded by

amanuel.ayalew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Addis Ababa Institute of Technology

AI project 2 Explanation

Transformer Based Language Model

Amanuel Ayalew...ATE/3871/13
Abdurahman mohammed... ATE/8901/13
BPE.ipynb

Creating a Tokenizer with Byte Pair Encoding (BPE)

First, we start by setting up a special kind of tokenizer known as Byte Pair Encoding (BPE). This
method helps us efficiently break down text into tokens, especially when we have large amounts
of text. We import the necessary tools from a library specifically designed for tokenization tasks.
We set up our tokenizer to recognize spaces as breaks between words and define a few special
symbols that help in understanding the structure of sentences, like the beginning or end of a
sentence.

Preparing Data for Tokenizer Training

Next, we prepare the data needed to train our tokenizer. We do this by reading a file that
contains text, and then we create a new file where we only keep 5% of the original content. This
smaller file is easier and faster for training our tokenizer without needing to process the entire
large file.

Training and Saving the Tokenizer

After preparing the data, we proceed to train our tokenizer. We specify that we want our
tokenizer to learn from the smaller file we created and recognize up to 50,000 different tokens.
This step is crucial as it teaches the tokenizer the different pieces of text it should recognize.
Once training is complete, we save the tokenizer so we can use it later without needing to
retrain it from scratch.

Using the Tokenizer

We then demonstrate how to use our trained tokenizer. We load it from the file where we saved
it and show how it can convert a simple sentence like "Hello, world!" into tokens and then back
into the original text. This shows that our tokenizer can understand and process text as
expected.

Exploring Pre-trained Tokenizers

Additionally, we explore how to use a tokenizer that someone else has already trained and
made available for others to use. This pre-trained tokenizer is part of a library called Hugging
Face, which offers many tools for text processing. We show how this tokenizer works in a similar
way to ours but uses a well-known and widely used setup.

Timing Data Transfer to the GPU


Lastly, we touch upon a technical aspect important in speeding up computations—using a GPU
(Graphics Processing Unit), which is a specialized processor that handles data much faster than
a regular CPU (Central Processing Unit). We demonstrate how to measure the time it takes to
move data to the GPU, which is useful for understanding and optimizing performance in tasks
that require processing large amounts of data quickly.

bigram .ipynb

The purpose of the notebook begins with setting up the PyTorch environment to ensure that all
subsequent operations are optimized for performance, whether they be on a CPU or GPU. This
setup is essential for leveraging the full capabilities of PyTorch in processing large datasets
efficiently.

Next, the notebook delves into reading and processing a classic text, "The Wizard of Oz," to
extract unique characters and determine the size of the vocabulary. This step is critical as it
transitions the raw text into a form that can be numerically analyzed by machine learning
algorithms, highlighting the initial preprocessing steps required in any data science workflow.

Further, the notebook introduces methods to encode and decode the text—transforming
characters into integers and vice versa. This encoding is crucial for preparing the data for
training, as machine learning models inherently work with numerical data.

An essential part of the notebook is the batch processing setup, which illustrates how to create
mini-batches of data. This method is a standard practice in training machine learning models,
allowing for efficient memory usage and faster processing by breaking down the dataset into
manageable parts.

gpt-v1 .ipynb (Generative Pre-Trained Transformers)

The Jupyter notebook you shared is structured to illustrate the process of building and training a
language model using PyTorch, a powerful machine learning library. Here's a step-by-step
breakdown of the notebook's purpose and workflow:

Environment Setup and Model Parameters

The notebook begins by importing necessary Python modules, including PyTorch for model
building and training. It sets up a computing environment to use GPU if available, which
accelerates the computation needed for training neural networks. Key parameters such as batch
size, block size, number of iterations, and learning rate are defined. These parameters control
how the model will learn from the data, including how much data it processes at once and how
quickly it updates its learning.
Data Preparation

Character data is loaded from a vocabulary file to create a mapping of characters to integers
and vice versa. This is crucial for converting text data into a numerical format that a machine
learning model can process, as models do not understand text directly.

Data Loading and Batch Processing

Functions are defined to load data in small chunks using memory mapping, which allows the
notebook to handle large text files efficiently by only loading parts of the file into memory as
needed. Another function prepares mini-batches of this data for the training process, organizing
the data into small sets that the model will learn from iteratively.

Model Building

The notebook constructs a neural network model inspired by the architecture of GPT
(Generative Pre-trained Transformer). It includes multiple components such as self-attention
heads, multi-head attention, feedforward networks, and layer normalization. These components
work together to process and learn from the text data effectively.

Training Loop

A training loop is implemented where the model learns from the data through a series of
iterations. During each iteration, the model processes a batch of data, calculates how well it is
performing (loss), and adjusts itself to improve. Periodic evaluation checks how well the model
is learning and provides feedback on its progress.

Text Generation

Finally, the model's ability to generate text is demonstrated. The model uses the learned
patterns in the text to predict and generate new text sequences based on a given prompt. This
showcases the practical application of the model as a text generator.
data extract.py

This Python script is designed to process and extract text from compressed files, specifically
those in `.xz` format, commonly found in datasets like OpenWebText. It focuses on handling
large datasets efficiently by leveraging concurrent processing, and it aims to construct a
vocabulary from the unique characters found in the dataset. Here’s a detailed breakdown of its
workflow:

Define Helper Functions:

Several helper functions are defined:

process_file(args): This function takes a tuple containing the directory, filename, output file path,
and a vocabulary set. It reads text from a compressed `.xz` file, writes it to an output file, and
collects unique characters from the text.

xz_files_in_dir(directory): Lists all .xz files in a given directory, ensuring that only files (and not
directories) are included.

process_files_in_parallel(files, folder_path, output_file): Manages the parallel processing of files.


It creates a set for vocabulary and uses a `ProcessPoolExecutor` to process files concurrently,
updating the vocabulary with unique characters from each file.

File Processing:

The script identifies all .xz files in the specified folder_path. It then splits these files into training
and validation sets, usually with a 90/10 split. Before processing, it ensures that the output files
are empty by opening them in write mode and immediately closing them.

Parallel Processing of Files:

Training Files: The script processes the training files in parallel, extracting text and updating the
training vocabulary.

Validation Files: Similarly, it processes the validation files and updates the validation vocabulary.

Combine and Save Vocabulary:

After processing, the script combines the vocabularies from the training and validation datasets.
It sorts the combined vocabulary and writes each character to a vocabulary file (`vocab.txt`).
This file will contain every unique character found across the entire dataset.

Chatbot.py inspired by chatgpt


Loading and Preparing Vocabulary:

- Vocabulary data is loaded from a file, creating mappings (`string_to_int` and `int_to_string`)
to convert characters to integers and back. These mappings are crucial for processing text data,
allowing the model to handle it numerically.

Defining the Model Architecture:

The model includes several layers typical of a transformer architecture:

Self-Attention Heads: Allow the model to weigh the importance of different words relative to
others in a sentence.

Multi-Head Attention: Combines multiple attention mechanisms to improve the context


understanding.

Feedforward Neural Network: Processes the output from the attention mechanisms to derive the
next set of outputs.

Normalization and Dropout: Used to stabilize and regularize the learning process.

Each component is designed to capture different aspects of the language, enabling the model to
generate coherent and contextually appropriate text.

Loading and Interacting with the Model: The script loads pre-trained model parameters,
allowing the model to generate text without additional training.it enters an interactive loop where
users can input prompts, and the model generates text based on these prompts. This
showcases the model's ability to apply learned language patterns to generate new sentences
that follow logically from given text.

Text Generation:

The generation process involves the model predicting the next character in a sequence
repeatedly until it has built a full response. The script uses a softmax function to convert the
model outputs into probabilities and selects the next character based on these probabilities.
train.py

Data Handling

Vocabulary and Encoding: The script reads a vocabulary file to map each character to a unique
integer, facilitating the model's handling of text data as tensors, which are required for training
neural networks.

Data Loading Using Memory Mapping: It employs memory-mapped files to efficiently load large
text datasets, ensuring that only the necessary parts of the data are loaded into memory,
reducing the overall memory footprint.

Model Definition

Model Components: The script defines several key components of the Transformer architecture:

Self-Attention Heads: These allow the model to weigh different parts of the input differently,
enhancing its ability to focus on relevant information.

Multi-Head Attention: It aggregates information from multiple attention heads, capturing various
aspects of the context.

Feedforward Network: Each transformer block contains a feedforward network that processes
the output from the attention heads.

Layer Normalization and Dropout**: These components are crucial for stabilizing training and
preventing overfitting.

Model Assembly: The model comprises multiple layers of the transformer block, each
contributing to the model's ability to understand and generate language.

Training Process

Batch Processing: The script defines functions to fetch batches of processed data, suitable for
training.

Loss Calculation and Optimization: Utilizes a custom training loop with a specified optimizer
(AdamW), calculates the loss, and adjusts the model parameters based on the gradient
information.

Evaluation: Periodically evaluates the model's performance on validation data to monitor


progress and adjust training as needed.

Model Interaction and Persistence Text Generation After training, the model can generate text
based on a given prompt, demonstrating its capability to apply learned language patterns.Model
Saving: The trained model parameters are saved to a file, allowing for later reuse without
retraining from scratch.
References

Github repository link

https://github.com/Amanuel-Ayal3w/Transformer_based_language_model

Adrej Karpathy full playlist was a huge inspiration and great resource

Attention is All you Need the paper and explanation by this person
https://youtu.be/K9j5GrH71iU?si=_dhiNgI4EVFrd4VH

We used some of Andrej karpathy code this project too

And datas are form kagel.com

You might also like