2023 GPT4All Technical Report
2023 GPT4All Technical Report
Abstract
1 Data Collection and Curation Figure 1: TSNE visualization of the candidate training
data (Red: Stackoverflow, Orange: chip2, Blue: P3).
We collected roughly one million prompt- The large blue balls (e.g. indicated by the red arrow)
response pairs using the GPT-3.5-Turbo OpenAI are highly homogeneous prompt-response pairs.
API between March 20, 2023 and March 26th,
2023. To do this, we first gathered a diverse sam-
ple of questions/prompts by leveraging three pub- low output diversity; P3 contains many homoge-
licly available datasets: neous prompts which produce short and homoge-
neous responses from GPT-3.5-Turbo. This exclu-
• The unified chip2 subset of LAION OIG. sion produces a final subset containing 437,605
prompt-generation pairs, which is visualized in
• Coding questions with a random sub-sample Figure 2. You can interactively explore the dataset
of Stackoverflow Questions at each stage of cleaning at the following links:
Figure 2: The final training data was curated to ensure a diverse distribution of prompt topics and model responses.
2.1 Reproducibility
We release all data (including unused P3 genera-
tions), training code, and model weights for the
community to build upon. Please check the Git
repository for the most up-to-date data, training
details and checkpoints.
2.2 Costs
We were able to produce these models with about
four days work, $800 in GPU costs (rented from
Lambda Labs and Paperspace) including several
failed trains, and $500 in OpenAI API spend. Figure 3: Model Perplexities. Lower is better. Our
Our released model, gpt4all-lora, can be trained in models achieve stochastically lower ground truth per-
about eight hours on a Lambda Labs DGX A100 plexities than alpaca-lora.
8x 80GB for a total cost of $100.
References
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. 2021. Lora: Low-rank adaptation of
large language models.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. 2023. Stan-
ford alpaca: An instruction-following llama
model. https://github.com/tatsu-lab/
stanford_alpaca.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
Edouard Grave, and Guillaume Lample. 2023.
Llama: Open and efficient foundation language
models.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
isa Liu, Noah A. Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
guage model with self generated instructions.