Hugging Face Repo Project Report
Hugging Face Repo Project Report
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE & ENGINEERING
by
CHINMAYI H [4VV20CS024]
Dr.K.PARAMESHA Mr.PRAKASH M
Professor CEO
Dept. of CS & E NeuroFlares
VVCE, Mysuru Mysuru
2023-24
Vidyavardhaka College of Engineering,
Gokulam 3rd Stage, Mysuru – 570002,
Department of Computer Science and Engineering,
CERTIFICATE
This is to certify that the internship report entitled “Exploring the Depths: Diving into GPTQ
Llama’s capabilities in Artificial Intellignece” has been successfully completed by Chinmayi
Signature of the internship guide Signature of the external guide Signature of the HoD
1)
2)
ACKNOWLEDGEMENT
The Internship would not have been possible without the guidance, assistance, and
suggestions of many individuals. I would like to express my deep sense of gratitude and
indebtedness to each one who has helped me to make this Internship a success.
I heartily thank my beloved Principal, Dr. B Sadashive Gowda for his wholehearted
Computer Science and Engineering, VVCE for their constant encouragement and
Computer Science and Engineering, VVCE for their encouragement and advice
I gracefully thank my external guide, Mr. Prakash M, CEO, NeuroFlares Pvt Ltd for their
In the end, I extend my gratitude towards my family members and friends for their
CHINMAYI H (4VV20CS024)
ABSTRACT
In this internship we worked on a hugging Face repository named The Bloke and the
project we explored is Vicuna 7B quantized using GPTQ 4-bit 128g. This project aims to
run these GPTQ models in text-generation-webui. The quantization process reduces the
precision of the model's parameters while minimizing the loss of performance. The
quantized model is provided in two files, with “vicuna-7B-GPTQ-4bit-128g.safetensors”
being the recommended choice. These files were created using the latest GPTQ code and
require the latest GPTQ-for-LLama to be integrated into the text-generation-webui for
usage. To utilize the quantized model for text generation tasks we have cloned the
GPTQ-for-LLaMa and text-generation-webui repositories, creating symbolic links
between them, and installing the quantized model into the web UI's models directory.
The next project was to learn how to train a new language model (Esperanto) from
scratch using Transformers and Tokenizers. This project entails training a "small"
language model, comprising 84 million parameters, on the constructed language
Esperanto, with a focus on fine-tuning it for part-of-speech tagging.
4. CONCLUSION 23
5. REFERENCES 24
LIST OF FIGURES
Figure 3.1 The chats with the Vicuna model’s quantized GPTQ using 14
LLaMa
Figure 3.4 The model could also analyse conversations between two or 15
more people
Chapter 1
Mission: Our Mission is to provide quality assurance to clients with maximum efforts
driven towards customer satisfaction and create more employment opportunities.
Vision: Our vision is to Innovate and Automate industrial system for a better quality
product, increased productivity, efficient use of materials. We believe in providing unique
solutions in the most efficient way with robust and structured methodology, with gradual
evolution from hard-work to smart- work culture.
NeuroFlares has a dream of evolving into a Global IT Company, ensuring that the solutions
being delivered include best practice in I. T. with the chosen area of technology.
NeuroFlares India has utilized its expertise and skills in order to keep pace with the surging
need for technological breakthroughs in the society, and has accomplished the same with
absolute dedication and perseverance.
It has provided solutions in private and public sector which ranges from small scale
industries to huge business. Also provided solution to Banks, Manufacturing Companies,
Entertainment industries etc. The company is known for Automation of applications,
Quality of the software and Delivery in Time.
NeuroFlares was started with the aim of helping customers and business to provide unique
and improved services without impacting the quality at cost-effective price. They are a one-
point Engineering consulting company who can work as a guide in any of our project with
focus and aim of cost saving without compromising the quality.
• Web services
• Gaming applications
• Native Android application.
• Native Desktop Application.
• VR/MR application.
• 3D modelling and FEM automation.
• UI/UX design.
• Artificial Intelligence with Image Processing
Web Services
NeuroFlares web solutions & services to help customer reach to a wider customer base.
The web is a new and different medium for communication and requires a different
viewpoint and skill set to use it in the most effective way. We need web consulting to get
more return on our investment in our web site. The company helps us to get the most
effective solution through:
• Website Development
• Web Multimedia
• Web Promotion
• Web hosting
• E-commerce
Gaming Applications
NeuroFlares will offer game development on PC, android and web games including
Background Music composition, 2D, 3D Asset Creation. Game Corner will entertain users
with the best gaming experience possible. They provide games such as arcade games,
shooting games, strategy games, sport games, adventurous games, etc., They build simple
2D games to complex multiplayer applications which targets not only Children but also
Youths, Adults and anyone who is interested in leisure and games.
They develop android apps(native) and desktop apps(native) based on your needs for PC,
mobiles and tablets with Artificial Intelligence support. Their prior work includes, but not
limited to auto app updating and download on the server to maintain app privacy.
The company has offered many apps like: Map related app development, Finance apps
development, E-commerce and Shopping Cart Apps, Retail and Fashion Apps, Education
Apps, Travel Apps, Food and Restaurant Apps, Real Estate & Home Automation Apps, and
many more. And has done over 30+ projects.
Chapter 2
TRAINING PROGRAM
The internship duration was for 6 months starting from 16th August 2023 to 16 February
2024. In these six months we explored three major project all from Hugging Face
repositories.
The first and the major one is Quantization of Vicuna 7B model using GPTQ for LLaMa.
The second one was about exploring and learning on how to train the model from the
scratch, the model was an Esperanto model using tokenizers and transformers. Drawing
similarities from that we tried to even fine tune the model. That is the third project where
we tried to understand how to fine tune an already pretrained model.
Vicuna 7B
To utilize the quantized model for text generation tasks, integration with the text-
generation-webui is necessary. The process involves cloning the GPTQ-for-LLaMa and
text-generation-webui repositories, creating symbolic links between them, and installing
the quantized model into the web UI's models directory. Additionally, the dependencies for
both repositories must be installed to ensure seamless operation.
The quantized Vicuna 7B model offers a more resource-efficient alternative to the original
model, suitable for deployment in environments with limited computational resources. By
following the provided instructions and integrating the model into the text-generation-
webui, users can leverage advanced AI capabilities for text generation tasks while
optimizing resource utilization. This project contributes to making sophisticated AI
technology more accessible and applicable in real-world scenarios.
In this second project we are training a "small" language model, comprising 84 million
parameters, on the constructed language Esperanto, with a focus on fine-tuning it for part-
of-speech tagging. The process begins with acquiring a corpus of Esperanto text, combining
portions of the OSCAR corpus from INRIA with the Leipzig Corpora Collection to create
a training dataset of 3 GB. Subsequently, a byte-level Byte-pair encoding (BPE) tokenizer,
akin to GPT-2, is trained with a vocabulary size of 52,000, featuring special tokens similar
to RoBERTa to facilitate effective language modeling tasks. The language model is then
trained from scratch on a masked language modeling (MLM) task using the transformers
library, with custom hyperparameters optimized for training efficiency. Evaluation of the
model's performance involves utilizing the FillMaskPipeline to assess its ability to predict
masked tokens, including more complex prompts to gauge its semantic understanding.
Following successful training, the model undergoes fine-tuning for part-of-speech tagging
using annotated Esperanto POS tags in the CoNLL-2003 format. Finally, the trained model
is shared with the community, accompanied by a comprehensive README.md model card
detailing its description, training parameters, evaluation results, intended uses, and
limitations, thereby contributing to the broader NLP community and showcasing the
versatility of advanced language modeling techniques.
The project aims to demonstrate the process of fine-tuning pretrained language models
using the Transformers library, focusing on three major deep learning frameworks:
PyTorch, TensorFlow with Keras, and native PyTorch. It begins by emphasizing the
advantages of using pretrained models, such as reducing computation costs and carbon
footprint, before delving into the fine-tuning process. The tutorial walks through the steps
of preparing a dataset, specifically the Yelp Reviews dataset, for training. It then proceeds
to explain how to fine-tune a pretrained model using each of the mentioned frameworks.
For PyTorch, it showcases the use of the Trainer class provided by Transformers, which
streamlines the training process with various options for hyperparameters and training
features.
In TensorFlow with Keras, it demonstrates how to load, compile, and fit a model using the
Keras API, as well as how to use the prepare_tf_dataset method to convert datasets into a
format compatible with Keras. Lastly, for native PyTorch, it outlines how to manually post-
process tokenized datasets, create DataLoaders, set up optimizer and learning rate
scheduler, and implement the training loop. Throughout the tutorial, the focus remains on
fine-tuning pretrained models for sequence classification tasks, offering insights into best
practices and optimizations for each framework.
Chapter 3
LEARNING EXPERIENCES
Vicuna 7B
Later on we were introduced about the models in Generative AI one of which is Vicuna 7B
model, it’s a chat assistant trained by fine-tuning LLaMA on user-shared conversations
collected from ShareGPT.
The primary use of Vicuna is research on large language models and chatbots. The primary
intended users of the model are researchers and hobbyists in natural language processing,
machine learning, and artificial intelligence. Vicuna v0 is fine-tuned from LLaMA with
supervised instruction fine-tuning. The training data is around 70K conversations collected
from ShareGPT. Vicuna is evaluated with standard benchmarks, human preference, and
LLM-as-a-judge.
Llama (Large Language Model Meta AI) is a family of autoregressive large language
models (LLMs), released by Meta AI starting in February 2023.
Four model sizes were trained for the first version of LLaMA: 7, 13, 33, and 65 billion
parameters. LLaMA's developers reported that the 13B parameter model's performance on
most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters)
and that the largest model was competitive with state of the art models such
as PaLM and Chinchilla. In contrast, the most powerful LLMs have generally been
accessible only through limited APIs (if at all), Meta released LLaMA's model weights to
the research community under a non-commercial license. Within a week of LLaMA's
release, its weights were leaked to the public on 4chan via BitTorrent.
In July 2023, Meta released several models such as Llama 2, using 7, 13, and 70 billion
parameters. LLaMa2 is a suite of pretrained language models while LLaMa2- chatbot is a
fine tuned chatbot that uses reinforcement learning through human feedback.
GPTQ for LLaMa: So, "GPTQ for LLaMa" is about applying this quantization process
specifically to GPT models within the LLaMa community. It's a way of making these large
language models more lightweight and easier to work with for researchers and developers
within the LLaMa community, while still maintaining their ability to understand and
generate human-like text. Through these six months we were introduced with many models
and we were made to quantize the Vicuna model using GPTQ for LLaMa.
The Esperanto model is a language model trained specifically on the constructed language
Esperanto. It is designed to understand and generate text in Esperanto language, utilizing
advanced natural language processing techniques. The model is typically fine-tuned for
specific tasks such as text generation, classification, or translation within the context of
Esperanto language data. By training on Esperanto text corpora and incorporating linguistic
features specific to Esperanto, such as its regular grammar and vocabulary, the model
becomes proficient in processing and generating Esperanto text, contributing to various
NLP applications within the Esperanto-speaking community.
Fine Tuning
The fine-tuning demonstrates the process of the parameters of the pretrained model being
adjusted for specific tasks while still retaining the knowledge and representations learned
during the initial pretraining phase. This process allows the model to learn task-specific
patterns and features from the new dataset, improving its performance on the target task
using the Transformers library. By leveraging pretrained models trained on vast amounts
of text data, researchers can significantly reduce the computational resources required to
train models from scratch while achieving state-of-the-art performance. The project
showcases three different frameworks—PyTorch, TensorFlow with Keras, and native
PyTorch—and provides step-by-step guidance on preparing datasets, fine-tuning models,
and evaluating their performance. Fine-tuning pretrained models enables researchers to
adapt them to specific tasks or domains, making them more efficient and effective for real-
world applications.
3. Install Dependencies:
Ensure that you have all the necessary dependencies installed for both GPTQ-for-LLaMa
and text-generation-webui. The instructions to install dependencies are:
2. Install Required Python Packages: Run the following command to install the
required Python packages specified in the requirements.txt file:
➢ pip install -r requirements.txt
5. Launch the User Interface (UI): Run the following command to start the UI:
➢ cd text-generation-webui
➢ python server.py --model vicuna-7B-GPTQ-4bit-128g --wbits 4 --groupsize 128
6. Interact with the UI: Once the server is running, you can access the UI by opening a
web browser and navigating to the specified address (usually http://localhost:8000 by
default).Use the UI to input text prompts and generate responses using the quantized
Vicuna 7B model.
Model Files: Two model files are provided, one of which is vicuna-7B-GPTQ-4bit-
128g.safetensors, representing the quantized Vicuna 7B model in a newer safetensors
format with improved file security.
Triton and CUDA Branches: Depending on the operating system and requirements, users
can choose to use either the Triton or CUDA branch of GPTQ-for-LLaMa.
RESULTS
Figure 3.1 The chats with the Vicuna model’s Figure 3.2 The GPTQ’s capacity to provide long answers
quantized GPTQ using LLaMa
Figure 3.4 The model could also analyse conversations between two or more people
The aim of this project is to train a language model specifically for Esperanto, a constructed
language designed to be easy to learn. This model, named EsperBERTo, will be trained
from scratch using a dataset of Esperanto text.
Step 1: Dataset Collection: Gather a large corpus of text written in Esperanto from various
sources, including news articles, literature, and Wikipedia.Concatenate multiple datasets to
create a comprehensive training corpus.
Step 2: Tokenization: Before training the model, they need to convert the text into a format
that the model can understand. They use a technique called byte-level Byte-pair encoding
(BPE) to tokenize the text into smaller units called tokens.Use the tokenizers library to train
the tokenizer with a vocabulary size of 52,000 and special tokens similar to RoBERTa.
# Initialize a tokenizer
➢ tokenizer = ByteLevelBPETokenizer()
# Customize training
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
➢ tokenizer.save_model(".", "esperberto")
Step 3: Language Model Training:Implement a subclass of Dataset to load data from the
tokenized text files.Train the language model using the run_language_modeling.py script
from the transformers library.Use a RoBERTa-like model architecture and train on a task
of Masked Language Modeling (MLM).
Step 4: Model Evaluation:Use the trained model to fill in masked words in sentences and
check the quality of predictions.
Step 6: Sharing the Model:Upload the trained model to the Hugging Face model hub for
sharing with the community.
• Python Programming: You'll use Python to execute the code snippets provided
in the repository and interact with various libraries and frameworks.
• Training Language Models: You'll learn how to train a language model from
scratch using frameworks like Transformers and tokenizers. This includes
understanding the concepts of tokenization, model architecture, hyperparameter
tuning, and training pipelines.
• Data Preprocessing: Preprocessing text data involves tasks like cleaning,
tokenization, and formatting. You'll gain experience in preparing datasets for
training language models.
• Model Evaluation: You'll evaluate the trained models using metrics like loss
values, performance on masked token prediction tasks, and downstream task
performance (e.g., part-of-speech tagging).
• Hyperparameter Tuning: Experimenting with different sets of hyperparameters
allows you to understand their impact on model performance and training dynamics.
• Tensorboard Usage: Monitoring training progress and visualizing model
performance using Tensorboard helps in gaining insights into the training process.
• Version Control: You'll learn how to use Git for version control, including
cloning repositories and managing branches.
• Installation and Dependency Management: You'll gain experience in installing
and managing dependencies for Python packages and libraries required for running
GPTQ-for-LLaMa and text-generation-webui.
• Model Management: You'll learn how to manage model files and directories,
including linking models to the text-generation-webui repository.
• Command-Line Interface: You'll use the command line to execute commands
for launching the text-generation web UI and specifying model configurations.
• Problem-Solving: You may encounter challenges during the installation or setup
process, requiring problem-solving skills to troubleshoot and resolve issues.
• Understanding Model Formats: You'll gain an understanding of model file
formats like safetensors and how they are used in GPTQ-for-LLaMa.
CUDA Installation (for CUDA branch): Using the CUDA branch of GPTQ-for-LLaMa,
setting up CUDA and ensuring compatibility with GPU was complex.
Issues: Compatibility issues between CUDA versions and GPU drivers, as well as
dependencies on specific CUDA versions, had raised.
Solution: The installation instructions provided in the repository. Ensured that I had the
correct CUDA version installed and compatible GPU drivers.
The commands I used to clone the Triton branch of GPTQ-for-LLaMa, clone text-
generation-webui, and install GPTQ into the UI:
On Windows we cannot use the Triton branch of GPTQ so we used CUDA branch:
Chapter 4
CONCLUSION
GPTQ-for-LLaMa: This repository focuses on quantizing language models using the
GPTQ framework. It provides tools and utilities for quantization, enabling users to
convert large language models into more efficient versions suitable for deployment on
resource-constrained devices. The repository includes detailed documentation and
examples for quantizing models, along with instructions for integration into text-
generation-webui.
text-generation-webui: This repository hosts a user interface for text generation, allowing
users to interact with language models in a web-based environment. It provides a platform
for deploying and utilizing quantized language models generated using the GPTQ-for-
LLaMa framework. The repository includes features for model management, input/output
customization, and real-time text generation.
EperBERTo: This repository demonstrates the process of training a language model from
scratch for the Esperanto language. It outlines the steps involved in dataset selection,
tokenizer training, language model training, and fine-tuning for downstream tasks such as
Part-of-Speech tagging. The repository includes code snippets, configuration files, and
instructions for training an Esperanto-specific language model using the Hugging Face
transformers library.
OSCAR Corpus: This repository contains the Esperanto portion of the OSCAR corpus
from INRIA. The OSCAR corpus is a large multilingual dataset obtained from Common
Crawl dumps of the web. The Esperanto subset of this corpus serves as a valuable
resource for training language models and conducting NLP research in the Esperanto
language.
Each of these repositories plays a crucial role in the process of language model
development, training, and deployment, contributing to advancements in natural language
processing and facilitating research in linguistic diversity and accessibility.
REFERENCES
1. NeuroFlares www.neuroflares.com
2. LMSYS ORG “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%*
ChatGPT Quality”by: The Vicuna Team, Mar 30, 2023
https://lmsys.org/blog/2023-03-30-vicuna/
3. "Introducing LLaMA: A foundational, 65-billion-parameter large language
model". Meta AI. 24 February 2023.
4. Vincent, James (7 November 2019). "OpenAI has published the text-generating AI
it said was too dangerous to share". The Verge. Archived from the original on 11
June 2020. Retrieved 19 December 2020.
5. Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (1 September 2014).
"Neural Machine Translation by Jointly Learning to Align and Translate".
6. Vincent, James (14 February 2019). "OpenAI's new multitalented AI writes,
translates, and slanders". The Verge. Archived from the original on 18 December
2020. Retrieved 19 December 2020.
7. Hugging Face article “How to train a new language model from scratch using
Transformers and Tokenizers” by Julian Chaumond February 14, 2020
https://huggingface.co/blog/how-to-train
8. The Bloke / vicuna-7B-v0-GPTQ https://huggingface.co/TheBloke/vicuna-7B-v0-
GPTQ
9. “Fine-tune a pretrained model” https://huggingface.co/docs/transformers/training