0% found this document useful (0 votes)

108 views5 pages

Open AI Cookbook

This document outlines a notebook for preprocessing and analyzing a chat dataset intended for fine-tuning a chat model, specifically gpt-3.5-turbo. It includes steps for data loading, format validation, token counting, and cost estimation for fine-tuning. The analysis identifies potential issues in the dataset and provides statistical insights into message and token counts, ultimately estimating the training costs based on the dataset's token usage.

Uploaded by

officialvasquez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views5 pages

Open AI Cookbook

Uploaded by

officialvasquez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 5

Toggle theme

Data preparation and analysis for chat model fine-tuning

Michael WuVerified

Simón FishmanVerified
Michael Wu(OpenAI), Simón Fishman(OpenAI)
Aug 22, 2023
This notebook serves as a tool to preprocess and analyze the chat dataset used for
fine-tuning a chat model. It checks for format errors, provides basic statistics,
and estimates token counts for fine-tuning costs. The method shown here corresponds
to the current fine-tuning method for gpt-3.5-turbo. See legacy fine-tuning for
models like babbage-002 and davinci-002.

import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

Data loading
We first load the chat dataset from an example JSONL file.

data_path = "data/toy_chat_fine_tuning.jsonl"

# Load the dataset

with open(data_path, 'r', encoding='utf-8') as f:
dataset = [json.loads(line) for line in f]

# Initial dataset stats

print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
print(message)

Num examples: 5
First example:
{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin
on everything.'}
{'role': 'user', 'content': 'I fell off my bike today.'}
{'role': 'assistant', 'content': "It's great that you're getting exercise
outdoors!"}
Format validation
We can perform a variety of error checks to validate that each conversation in the
dataset adheres to the format expected by the fine-tuning API. Errors are
categorized based on their nature for easier debugging.

Data Type Check: Checks whether each entry in the dataset is a dictionary (dict).
Error type: data_type.
Presence of Message List: Checks if a messages list is present in each entry. Error
type: missing_messages_list.
Message Keys Check: Validates that each message in the messages list contains the
keys role and content. Error type: message_missing_key.
Unrecognized Keys in Messages: Logs if a message has keys other than role, content,
weight, function_call, and name. Error type: message_unrecognized_key.
Role Validation: Ensures the role is one of "system", "user", or "assistant". Error
type: unrecognized_role.
Content Validation: Verifies that content has textual data and is a string. Error
type: missing_content.
Assistant Message Presence: Checks that each conversation has at least one message
from the assistant. Error type: example_missing_assistant_message.
The code below performs these checks, and outputs counts for each type of error
found are printed. This is useful for debugging and ensuring the dataset is ready
for the next steps.

# Format error checks

format_errors = defaultdict(int)

for ex in dataset:
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue

messages = ex.get("messages", None)

if not messages:
format_errors["missing_messages_list"] += 1
continue

for message in messages:

if "role" not in message or "content" not in message:
format_errors["message_missing_key"] += 1

if any(k not in ("role", "content", "name", "function_call", "weight") for

k in message):
format_errors["message_unrecognized_key"] += 1

if message.get("role", None) not in ("system", "user", "assistant",

"function"):
format_errors["unrecognized_role"] += 1

content = message.get("content", None)

function_call = message.get("function_call", None)

if (not content and not function_call) or not isinstance(content, str):

format_errors["missing_content"] += 1

if not any(message.get("role", None) == "assistant" for message in messages):

format_errors["example_missing_assistant_message"] += 1

if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")

No errors found
Token Counting Utilities
Lets define a few helpful utilities to be used in the rest of the notebook.

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_wi
th_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3
return num_tokens

def num_assistant_tokens_from_messages(messages):
num_tokens = 0
for message in messages:
if message["role"] == "assistant":
num_tokens += len(encoding.encode(message["content"]))
return num_tokens

def print_distribution(values, name):

print(f"\n#### Distribution of {name}:")
print(f"min / max: {min(values)}, {max(values)}")
print(f"mean / median: {np.mean(values)}, {np.median(values)}")
print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

Data Warnings and Token Counts

With some lightweight analysis we can identify potential issues in the dataset,
like missing messages, and provide statistical insights into message and token
counts.

Missing System/User Messages: Counts the number of conversations missing a "system"

or "user" message. Such messages are critical for defining the assistant's behavior
and initiating the conversation.
Number of Messages Per Example: Summarizes the distribution of the number of
messages in each conversation, providing insight into dialogue complexity.
Total Tokens Per Example: Calculates and summarizes the distribution of the total
number of tokens in each conversation. Important for understanding fine-tuning
costs.
Tokens in Assistant's Messages: Calculates the number of tokens in the assistant's
messages per conversation and summarizes this distribution. Useful for
understanding the assistant's verbosity.
Token Limit Warnings: Checks if any examples exceed the maximum token limit (16,385
tokens), as such examples will be truncated during fine-tuning, potentially
resulting in data loss.
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
messages = ex["messages"]
if not any(message["role"] == "system" for message in messages):
n_missing_system += 1
if not any(message["role"] == "user" for message in messages):
n_missing_user += 1
n_messages.append(len(messages))
convo_lens.append(num_tokens_from_messages(messages))
assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)

print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be
truncated during fine-tuning")

Num examples missing system message: 1

Num examples missing user message: 1

#### Distribution of num_messages_per_example:

min / max: 2, 9
mean / median: 3.8, 3.0
p5 / p95: 2.0, 6.6000000000000005

#### Distribution of num_total_tokens_per_example:

min / max: 26, 8032
mean / median: 1648.4, 45.0
p5 / p95: 26.8, 4863.6

#### Distribution of num_assistant_tokens_per_example:

min / max: 4, 8000
mean / median: 1610.2, 10.0
p5 / p95: 6.0, 4811.200000000001

0 examples may be over the 16,385 token limit, they will be truncated during fine-
tuning
Cost Estimation
In this final section, we estimate the total number of tokens that will be used for
fine-tuning, which allows us to approximate the cost. It is worth noting that the
duration of the fine-tuning jobs will also increase with the token count.

# Pricing and default n_epochs estimate

MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in

convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for
during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset}
tokens")

Dataset has ~4306 tokens that will be charged for during training
By default, you'll train for 20 epochs on this dataset
By default, you'll be charged for ~86120 tokens

Ace the Trading Systems Developer Interview (C++ Edition) : Insider's Guide to Top Tech Jobs in Finance
From Everand
Ace the Trading Systems Developer Interview (C++ Edition) : Insider's Guide to Top Tech Jobs in Finance
Dennis Thompson Sr
5/5 (1)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
4.5/5 (3)
Ftejerezbrochure PDF
100% (1)
Ftejerezbrochure PDF
28 pages
Secured Party Creditor ID Card Application: Right Thumb Print
100% (5)
Secured Party Creditor ID Card Application: Right Thumb Print
1 page
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
FineTuning Process Using OpenAI 1703440516
No ratings yet
FineTuning Process Using OpenAI 1703440516
14 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Test Cases
No ratings yet
Test Cases
3 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Python for Absolute Beginners: Learn to Code Fast!
From Everand
Python for Absolute Beginners: Learn to Code Fast!
Ibnul Jaif Farabi
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
C Programming Language
From Everand
C Programming Language
Younish Pathan
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
C in 30 Pages
From Everand
C in 30 Pages
U.Q. Magnusson
4.5/5 (2)
50 Java Concepts Every Developer Should Know
From Everand
50 Java Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Autogen OpenAi Class
No ratings yet
Autogen OpenAi Class
12 pages
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
Name
No ratings yet
Name
27 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
6 - Performance Results Via Simulation
No ratings yet
6 - Performance Results Via Simulation
3 pages
Phase 5
No ratings yet
Phase 5
9 pages
C++ Functions and tutorial
From Everand
C++ Functions and tutorial
Nino Paiotta
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
Perl One-Liners: 130 Programs That Get Things Done
From Everand
Perl One-Liners: 130 Programs That Get Things Done
Peteris Krumins
4/5 (3)
State 2
No ratings yet
State 2
3 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Code
No ratings yet
Code
13 pages
Python Reference: An Alphabetical Guide
From Everand
Python Reference: An Alphabetical Guide
Jo Foster
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Exploring The ChatGPT API With Python
No ratings yet
Exploring The ChatGPT API With Python
11 pages
Whatsapp - Analyzer
No ratings yet
Whatsapp - Analyzer
8 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
ChatGPT For Coders Prompts
No ratings yet
ChatGPT For Coders Prompts
3 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Naive Bayes
No ratings yet
Naive Bayes
11 pages
OpenAI Compatible Server - VLLM
No ratings yet
OpenAI Compatible Server - VLLM
27 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Programming with Python
From Everand
Programming with Python
Enrique Vicente
No ratings yet
#1 Book on Python Programming
From Everand
#1 Book on Python Programming
Minhaj
No ratings yet
AIT TASKS2 Merged
No ratings yet
AIT TASKS2 Merged
24 pages
2 5341404471006070032
No ratings yet
2 5341404471006070032
2 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
Quiz 2
No ratings yet
Quiz 2
11 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Additional Tips
No ratings yet
Additional Tips
1 page
Python Code Explanation
No ratings yet
Python Code Explanation
4 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
OpenAI Developer Platform Guide
No ratings yet
OpenAI Developer Platform Guide
26 pages
ChatGPT Retrieval Plugin
No ratings yet
ChatGPT Retrieval Plugin
14 pages
Deploying To Render
No ratings yet
Deploying To Render
2 pages
OpenAI TypeScript and JavaScript API Library
No ratings yet
OpenAI TypeScript and JavaScript API Library
9 pages
Analysis of Strength Characteristics of GGBS PDF
No ratings yet
Analysis of Strength Characteristics of GGBS PDF
3 pages
Af115fx X-Ride Clutch
No ratings yet
Af115fx X-Ride Clutch
2 pages
Adjust Axial Bently Nevada Probes
100% (3)
Adjust Axial Bently Nevada Probes
3 pages
B.Tech - 4-1 - R15 May 2019
No ratings yet
B.Tech - 4-1 - R15 May 2019
11 pages
Business Ethics
No ratings yet
Business Ethics
13 pages
Operating Instruction ZETATOP SM160
No ratings yet
Operating Instruction ZETATOP SM160
38 pages
Framwork and Challenges of HRM
No ratings yet
Framwork and Challenges of HRM
5 pages
Seismic Behavior and Design of Composite Steel Plate Shear Walls PDF
No ratings yet
Seismic Behavior and Design of Composite Steel Plate Shear Walls PDF
73 pages
Mechanical Pending Job List
No ratings yet
Mechanical Pending Job List
24 pages
Firewall Nat (Dokter Squid - Com)
No ratings yet
Firewall Nat (Dokter Squid - Com)
4 pages
1/22 Churchill Ave Maidstone VIC 3012, Mobile 0404577371
No ratings yet
1/22 Churchill Ave Maidstone VIC 3012, Mobile 0404577371
2 pages
Chapter 1 - Lec
No ratings yet
Chapter 1 - Lec
49 pages
Certifying Translations
No ratings yet
Certifying Translations
4 pages
GATE Architecture Solved 2011
No ratings yet
GATE Architecture Solved 2011
12 pages
9.9 - 15 HP 2006
No ratings yet
9.9 - 15 HP 2006
28 pages
ATA 08 Leveling & Weighing (Rev 1)
No ratings yet
ATA 08 Leveling & Weighing (Rev 1)
10 pages
Paybooks Employee Self Service
No ratings yet
Paybooks Employee Self Service
19 pages
Er308 PDF
No ratings yet
Er308 PDF
1 page
Business English Alibi Game Worksheet PDF
No ratings yet
Business English Alibi Game Worksheet PDF
3 pages
Google Python Class Day 1 Part 1 - 6
No ratings yet
Google Python Class Day 1 Part 1 - 6
33 pages
Data Consistency Errors
No ratings yet
Data Consistency Errors
9 pages
Example Danfoss Selection Card
No ratings yet
Example Danfoss Selection Card
2 pages
Handbook1 - Final Timber Structures
100% (2)
Handbook1 - Final Timber Structures
254 pages
Kohler 400REOZD Detroit Diesel Series 60 Engine Spec Sheet
No ratings yet
Kohler 400REOZD Detroit Diesel Series 60 Engine Spec Sheet
4 pages
Compare Contrast Paper
No ratings yet
Compare Contrast Paper
3 pages
Study Theme 1 - Chapter 1 - Hello Data
No ratings yet
Study Theme 1 - Chapter 1 - Hello Data
23 pages
VNT Brochure New
No ratings yet
VNT Brochure New
5 pages
MFL70340102 - 05 - S - 190326+smart TV Guide (WebOS 4.0) ENG+RS232 Guide ENG
No ratings yet
MFL70340102 - 05 - S - 190326+smart TV Guide (WebOS 4.0) ENG+RS232 Guide ENG
32 pages

Open AI Cookbook

Uploaded by

Open AI Cookbook

Uploaded by

Toggle theme

Data preparation and analysis for chat model fine-tuning

# Load the dataset

# Initial dataset stats

# Format error checks

messages = ex.get("messages", None)

for message in messages:

if any(k not in ("role", "content", "name", "function_call", "weight") for

if message.get("role", None) not in ("system", "user", "assistant",

content = message.get("content", None)

if (not content and not function_call) or not isinstance(content, str):

if not any(message.get("role", None) == "assistant" for message in messages):

def print_distribution(values, name):

Data Warnings and Token Counts

Missing System/User Messages: Counts the number of conversations missing a "system"

print("Num examples missing system message:", n_missing_system)

Num examples missing system message: 1

#### Distribution of num_messages_per_example:

#### Distribution of num_total_tokens_per_example:

#### Distribution of num_assistant_tokens_per_example:

# Pricing and default n_epochs estimate

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in

You might also like