0% found this document useful (0 votes)
234 views3 pages

2023 GPT4All Technical Report

This document summarizes the development of GPT4All, a chatbot trained on a large dataset of assistant interactions collected using GPT-3.5-Turbo. The dataset was cleaned and curated, removing examples with malformed responses. Models were trained using LoRA finetuned from LLaMA 7B, achieving lower perplexity than the Alpaca model in a preliminary evaluation. The data, code, and a trained model are openly released to promote research.

Uploaded by

Leonel rugama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
234 views3 pages

2023 GPT4All Technical Report

This document summarizes the development of GPT4All, a chatbot trained on a large dataset of assistant interactions collected using GPT-3.5-Turbo. The dataset was cleaned and curated, removing examples with malformed responses. Models were trained using LoRA finetuned from LLaMA 7B, achieving lower perplexity than the Alpaca model in a preliminary evaluation. The data, code, and a trained model are openly released to promote research.

Uploaded by

Leonel rugama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

GPT4All: Training an Assistant-style Chatbot with Large Scale Data

Distillation from GPT-3.5-Turbo


Yuvanesh Anand Zach Nussbaum
[email protected] [email protected]

Brandon Duderstadt Benjamin Schmidt Andriy Mulyar


[email protected] [email protected] [email protected]

Abstract

This preliminary technical report describes the


development of GPT4All, a chatbot trained
over a massive curated corpus of assistant in-
teractions including word problems, story de-
scriptions, multi-turn dialogue, and code. We
openly release the collected data, data cura-
tion procedure, training code, and final model
weights to promote open research and repro-
ducibility. Additionally, we release quantized
4-bit versions of the model allowing virtually
anyone to run the model on CPU.

1 Data Collection and Curation Figure 1: TSNE visualization of the candidate training
data (Red: Stackoverflow, Orange: chip2, Blue: P3).
We collected roughly one million prompt- The large blue balls (e.g. indicated by the red arrow)
response pairs using the GPT-3.5-Turbo OpenAI are highly homogeneous prompt-response pairs.
API between March 20, 2023 and March 26th,
2023. To do this, we first gathered a diverse sam-
ple of questions/prompts by leveraging three pub- low output diversity; P3 contains many homoge-
licly available datasets: neous prompts which produce short and homoge-
neous responses from GPT-3.5-Turbo. This exclu-
• The unified chip2 subset of LAION OIG. sion produces a final subset containing 437,605
prompt-generation pairs, which is visualized in
• Coding questions with a random sub-sample Figure 2. You can interactively explore the dataset
of Stackoverflow Questions at each stage of cleaning at the following links:

• Instruction-tuning with a sub-sample of Big- • Cleaned with P3


science/P3 • Cleaned without P3 (Final Training Dataset)
We chose to dedicate substantial attention to data
preparation and curation based on commentary in
the Stanford Alpaca project (Taori et al., 2023). 2 Model Training
Upon collection of the initial dataset of prompt-
generation pairs, we loaded data into Atlas for data We train several models finetuned from an in-
curation and cleaning. With Atlas, we removed all stance of LLaMA 7B (Touvron et al., 2023).
examples where GPT-3.5-Turbo failed to respond The model associated with our initial public re-
to prompts and produced malformed output. This lease is trained with LoRA (Hu et al., 2021)
reduced our total number of examples to 806,199 on the 437,605 post-processed examples for four
high-quality prompt-generation pairs. Next, we epochs. Detailed model hyper-parameters and
decided to remove the entire Bigscience/P3 sub- training code can be found in the associated repos-
set from the final training dataset due to its very itory and model training log.
(a) TSNE visualization of the final training data, ten-colored (b) Zoomed in view of Figure 2a. The region displayed con-
by extracted topic. tains generations related to personal health and wellness.

Figure 2: The final training data was curated to ensure a diverse distribution of prompt topics and model responses.

2.1 Reproducibility
We release all data (including unused P3 genera-
tions), training code, and model weights for the
community to build upon. Please check the Git
repository for the most up-to-date data, training
details and checkpoints.

2.2 Costs
We were able to produce these models with about
four days work, $800 in GPU costs (rented from
Lambda Labs and Paperspace) including several
failed trains, and $500 in OpenAI API spend. Figure 3: Model Perplexities. Lower is better. Our
Our released model, gpt4all-lora, can be trained in models achieve stochastically lower ground truth per-
about eight hours on a Lambda Labs DGX A100 plexities than alpaca-lora.
8x 80GB for a total cost of $100.

3 Evaluation remains. We welcome the reader to run the model


locally on CPU (see Github for files) and get a
We perform a preliminary evaluation of our model qualitative sense of what it can do.
using the human evaluation data from the Self-
Instruct paper (Wang et al., 2022). We report the 4 Use Considerations
ground truth perplexity of our model against what
is, to our knowledge, the best openly available The authors release data and training details in
alpaca-lora model, provided by user chainyo on hopes that it will accelerate open LLM research,
huggingface. We find that all models have very particularly in the domains of alignment and inter-
large perplexities on a small number of tasks, and pretability. GPT4All model weights and data are
report perplexities clipped to a maximum of 100. intended and licensed only for research purposes
Models finetuned on this collected dataset ex- and any commercial use is prohibited. GPT4All
hibit much lower perplexity in the Self-Instruct is based on LLaMA, which has a non-commercial
evaluation compared to Alpaca. This evaluation is license. The assistant data is gathered from Ope-
in no way exhaustive and further evaluation work nAI’s GPT-3.5-Turbo, whose terms of use pro-
hibit developing models that compete commer-
cially with OpenAI.

References
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. 2021. Lora: Low-rank adaptation of
large language models.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. 2023. Stan-
ford alpaca: An instruction-following llama
model. https://github.com/tatsu-lab/
stanford_alpaca.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
Edouard Grave, and Guillaume Lample. 2023.
Llama: Open and efficient foundation language
models.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
isa Liu, Noah A. Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
guage model with self generated instructions.

You might also like