Open AI Cookbook
Open AI Cookbook
Michael WuVerified
Simón FishmanVerified
Michael Wu(OpenAI), Simón Fishman(OpenAI)
Aug 22, 2023
This notebook serves as a tool to preprocess and analyze the chat dataset used for
fine-tuning a chat model. It checks for format errors, provides basic statistics,
and estimates token counts for fine-tuning costs. The method shown here corresponds
to the current fine-tuning method for gpt-3.5-turbo. See legacy fine-tuning for
models like babbage-002 and davinci-002.
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict
Data loading
We first load the chat dataset from an example JSONL file.
data_path = "data/toy_chat_fine_tuning.jsonl"
Num examples: 5
First example:
{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin
on everything.'}
{'role': 'user', 'content': 'I fell off my bike today.'}
{'role': 'assistant', 'content': "It's great that you're getting exercise
outdoors!"}
Format validation
We can perform a variety of error checks to validate that each conversation in the
dataset adheres to the format expected by the fine-tuning API. Errors are
categorized based on their nature for easier debugging.
Data Type Check: Checks whether each entry in the dataset is a dictionary (dict).
Error type: data_type.
Presence of Message List: Checks if a messages list is present in each entry. Error
type: missing_messages_list.
Message Keys Check: Validates that each message in the messages list contains the
keys role and content. Error type: message_missing_key.
Unrecognized Keys in Messages: Logs if a message has keys other than role, content,
weight, function_call, and name. Error type: message_unrecognized_key.
Role Validation: Ensures the role is one of "system", "user", or "assistant". Error
type: unrecognized_role.
Content Validation: Verifies that content has textual data and is a string. Error
type: missing_content.
Assistant Message Presence: Checks that each conversation has at least one message
from the assistant. Error type: example_missing_assistant_message.
The code below performs these checks, and outputs counts for each type of error
found are printed. This is useful for debugging and ensuring the dataset is ready
for the next steps.
for ex in dataset:
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue
if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")
No errors found
Token Counting Utilities
Lets define a few helpful utilities to be used in the rest of the notebook.
encoding = tiktoken.get_encoding("cl100k_base")
# not exact!
# simplified from
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_wi
th_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3
return num_tokens
def num_assistant_tokens_from_messages(messages):
num_tokens = 0
for message in messages:
if message["role"] == "assistant":
num_tokens += len(encoding.encode(message["content"]))
return num_tokens
for ex in dataset:
messages = ex["messages"]
if not any(message["role"] == "system" for message in messages):
n_missing_system += 1
if not any(message["role"] == "user" for message in messages):
n_missing_user += 1
n_messages.append(len(messages))
convo_lens.append(num_tokens_from_messages(messages))
assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
0 examples may be over the 16,385 token limit, they will be truncated during fine-
tuning
Cost Estimation
In this final section, we estimate the total number of tokens that will be used for
fine-tuning, which allows us to approximate the cost. It is worth noting that the
duration of the fine-tuning jobs will also increase with the token count.
TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25
n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)
Dataset has ~4306 tokens that will be charged for during training
By default, you'll train for 20 epochs on this dataset
By default, you'll be charged for ~86120 tokens