Fine Tuning OpenAI API
Fine Tuning OpenAI API
platform.openai.com/docs/guides/fine-tuning/fine-tuning-integrations
Learn more
Fine-tuning
Learn how to customize a model for your application.
Introduction
Fine-tuning lets you get more out of the models available through the API by providing:
OpenAI's text generation models have been pre-trained on a vast amount of text. To use the
models effectively, we include instructions and sometimes several examples in a prompt.
Using demonstrations to show how to perform a task is often called "few-shot learning."
Fine-tuning improves on few-shot learning by training on many more examples than can fit in
the prompt, letting you achieve better results on a wide number of tasks. Once a model has
been fine-tuned, you won't need to provide as many examples in the prompt. This
saves costs and enables lower-latency requests.
Visit our pricing page to learn more about how fine-tuned model training and usage are
billed.
1/20
Fine-tuning for GPT-4 is in an experimental access program - eligible users can request
access in the fine-tuning UI when creating a new fine-tuning job.
Fine-tuning is currently available for the following models: gpt-3.5-turbo-0125
(recommended), gpt-3.5-turbo-1106, gpt-3.5-turbo-0613, babbage-002, davinci-002,
and gpt-4-0613 (experimental).
You can also fine-tune a fine-tuned model which is useful if you acquire additional data and
don't want to repeat the previous training steps.
We expect gpt-3.5-turbo to be the right model for most users in terms of results and ease
of use.
There are many tasks at which our models may not initially appear to perform well, but
results can be improved with the right prompts - thus fine-tuning may not be necessary
Iterating over prompts and other tactics has a much faster feedback loop than iterating
with fine-tuning, which requires creating datasets and running training jobs
In cases where fine-tuning is still necessary, initial prompt engineering work is not
wasted - we typically see best results when using a good prompt in the fine-tuning data
(or combining prompt chaining / tool use with fine-tuning)
Our prompt engineering guide provides a background on some of the most effective
strategies and tactics for getting better performance without fine-tuning. You may find it
helpful to iterate quickly on prompts in our playground.
One high-level way to think about these cases is when it’s easier to "show, not tell". In the
sections to come, we will explore how to set up data for fine-tuning and various examples
where fine-tuning improves the performance over the baseline model.
2/20
Another scenario where fine-tuning is effective is in reducing costs and / or latency, by
replacing GPT-4 or by utilizing shorter prompts, without sacrificing quality. If you can achieve
good results with GPT-4, you can often reach similar quality with a fine-tuned gpt-3.5-turbo
model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.
Each example in the dataset should be a conversation in the same format as our Chat
Completions API, specifically a list of messages where each message has a role, content,
and optional name. At least some of the training examples should directly target cases where
the prompted model is not behaving as desired, and the provided assistant messages in the
data should be the ideal responses you want the model to provide.
Example format
In this example, our goal is to create a chatbot that occasionally gives sarcastic responses,
these are three training examples (conversations) we could create for a dataset:
1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role":
"assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role":
"assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of
him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"},
{"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like
that really matters."}]}
The conversational chat format is required to fine-tune gpt-3.5-turbo. For babbage-002 and
davinci-002, you can follow the prompt completion pair format as shown below.
1
2
3
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
3/20
Multi-turn chat examples
Examples in the chat format can have multiple messages with the assistant role. The default
behavior during fine-tuning is to train on all assistant messages within a single example. To
skip fine-tuning on specific assistant messages, a weight key can be added disable fine-
tuning on that message, allowing you to control which assistant messages are learned. The
allowed values for weight are currently 0 or 1. Some examples using weight for the chat
format are below.
1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role":
"assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you
be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't
know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role":
"assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user",
"content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just
some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also
sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"},
{"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user",
"content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around
384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}
Crafting prompts
We generally recommend taking the set of instructions and prompts that you found worked
best for the model prior to fine-tuning, and including them in every training example. This
should let you reach the best and most general results, especially if you have relatively few
(e.g. under a hundred) training examples.
If you would like to shorten the instructions or prompts that are repeated in every example to
save costs, keep in mind that the model will likely behave as if those instructions were
included, and it may be hard to get the model to ignore those "baked-in" instructions at
inference time.
It may take more training examples to arrive at good results, as the model has to learn
entirely through demonstration and without guided instructions.
4/20
To fine-tune a model, you are required to provide at least 10 examples. We typically see
clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but
the right number varies greatly based on the exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the model shows
signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the
model is not yet production quality, clear improvements are a good sign that providing more
data will continue to improve the model. No improvement suggests that you may need to
rethink how to set up the task for the model or restructure the data before scaling beyond a
limited example set.
After collecting the initial dataset, we recommend splitting it into a training and test portion.
When submitting a fine-tuning job with both training and test files, we will provide statistics on
both during the course of training. These statistics will be your initial signal of how much the
model is improving. Additionally, constructing a test set early on will be useful in making sure
you are able to evaluate the model after training, by generating samples on the test set.
Token limits
Token limits depend on the model you select. For gpt-3.5-turbo-0125, the maximum
context length is 16,385 so each training example is also limited to 16,385 tokens. For gpt-
3.5-turbo-0613, each training example is limited to 4,096 tokens. Examples longer than the
default will be truncated to the maximum context length which removes tokens from the end
of the training example(s). To be sure that your entire training example fits in context,
consider checking that the total token counts in the message contents are under the limit.
You can compute token counts using our counting tokens notebook from the OpenAI
cookbook.
Estimate costs
Please refer to the pricing page for details on cost per 1k input and output tokens (we do not
charge for tokens that are part of the validation data). To estimate the costs for a specific
fine-tuning job, use the following formula:
base cost per 1k tokens * number of tokens in the input file * number of epochs trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be
~$2.40 USD.
5/20
Once you have compiled a dataset and before you create a fine-tuning job, it is important to
check the data formatting. To do this, we created a simple Python script which you can use
to find potential errors, review token counts, and estimate the cost of a fine-tuning job.
python
1
2
3
4
5
6
7
from openai import OpenAI
client = OpenAI()
client.files.create(
file=open("mydata.jsonl", "rb"),
purpose="fine-tune"
)
After you upload the file, it may take some time to process. While the file is processing, you
can still create a fine-tuning job but it will not start until the file processing has completed.
The maximum file upload size is 1 GB, though we do not suggest fine-tuning with that
amount of data since you are unlikely to need that large of an amount to see improvements.
python
6/20
1
2
3
4
5
6
7
from openai import OpenAI
client = OpenAI()
client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-3.5-turbo"
)
In this example, model is the name of the model you want to fine-tune (gpt-3.5-turbo,
babbage-002, davinci-002, or an existing fine-tuned model) and training_file is the file
ID that was returned when the training file was uploaded to the OpenAI API. You can
customize your fine-tuned model's name using the suffix parameter.
After you've started a fine-tuning job, it may take some time to complete. Your job may be
queued behind other jobs in our system, and training a model can take minutes or hours
depending on the model and dataset size. After the model training is completed, the user
who created the fine-tuning job will receive an email confirmation.
In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a
job, or cancel a job.
python
7/20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from openai import OpenAI
client = OpenAI()
# Cancel a job
client.fine_tuning.jobs.cancel("ftjob-abc123")
# Delete a fine-tuned model (must be an owner of the org the model was created in)
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")
After your job is completed, the model should be available right away for inference use. In
some cases, it may take several minutes for your model to become ready to handle
requests. If requests to your model time out or the model name cannot be found, it is likely
because your model is still being loaded. If this happens, try again in a few minutes.
python
8/20
1
2
3
4
5
6
7
8
9
10
11
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
You can start making requests by passing the model name as shown above and in our GPT
guide.
1. Wait until a job succeeds, which you can verify by querying the status of a job.
2. Query the checkpoints endpoint with your fine-tuning job ID to access a list of model
checkpoints for the fine-tuning job.
For each checkpoint object, you will see the fine_tuned_model_checkpoint field populated
with the name of the model checkpoint. You may now use this model just like you would with
the final fine-tuned model.
9/20
1
2
3
4
5
6
7
8
9
10
11
12
{
"object": "fine_tuning.job.checkpoint",
"id": "ftckpt_zc4Q7MP6XxulcVzj4MZdwsAB",
"created_at": 1519129973,
"fine_tuned_model_checkpoint": "ft:gpt-3.5-turbo-0125:my-org:custom-
suffix:96olL566:ckpt-step-2000",
"metrics": {
"full_valid_loss": 0.134,
"full_valid_mean_token_accuracy": 0.874
},
"fine_tuning_job_id": "ftjob-abc123",
"step_number": 2000
}
step_number: The step at which the checkpoint was created (where each epoch is
number of steps in the training set divided by the batch size)
metrics: an object containing the metrics for your fine-tuning job at the step when the
checkpoint was created.
Currently, only the checkpoints for the last 3 epochs of the job are saved and available for
use. We plan to release more complex and flexible checkpointing strategies in the near
future.
training loss
training token accuracy
valid loss
valid token accuracy
10/20
Valid loss and valid token accuracy are computed in two different ways - on a small batch of
the data during each step, and on the full valid split at the end of each epoch. The full valid
loss and full valid token accuracy metrics are the most accurate metric tracking the overall
performance of your model. These statistics are meant to provide a sanity check that training
went smoothly (loss should decrease, token accuracy should increase). While an active fine-
tuning jobs is running, you can view an event object which contains some useful metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"object": "fine_tuning.job.event",
"id": "ftevent-abc-123",
"created_at": 1693582679,
"level": "info",
"message": "Step 300/300: training loss=0.15, validation loss=0.27, full
validation loss=0.40",
"data": {
"step": 300,
"train_loss": 0.14991648495197296,
"valid_loss": 0.26569826706596045,
"total_steps": 300,
"full_valid_loss": 0.4032616495084362,
"train_mean_token_accuracy": 0.9444444179534912,
"valid_mean_token_accuracy": 0.9565217391304348,
"full_valid_mean_token_accuracy": 0.9089635854341737
},
"type": "metrics"
}
After a fine-tuning job has finished, you can also see metrics around how the training
process went by querying a fine-tuning job, extracting a file ID from the result_files, and
then retrieving that files content. Each results CSV file has the following columns: step,
train_loss, train_accuracy, valid_loss, and valid_mean_token_accuracy.
11/20
1
2
3
4
5
6
step,train_loss,train_accuracy,valid_loss,valid_mean_token_accuracy
1,1.52347,0.0,,
2,0.57719,0.0,,
3,3.63525,0.0,,
4,1.72257,0.0,,
5,1.52379,0.0,,
While metrics can he helpful, evaluating samples from the fine-tuned model provides the
most relevant sense of model quality. We recommend generating samples from both the
base model and the fine-tuned model on a test set, and comparing the samples side by side.
The test set should ideally include the full distribution of inputs that you might send to the
model in a production use case. If manual evaluation is too time-consuming, consider using
our Evals library to automate future evaluations.
12/20
Look at the agreement / consistency in the training examples
If multiple people created the training data, it’s likely that model performance will
be limited by the level of agreement / consistency between people. For instance,
in a text extraction task, if people only agreed on 70% of extracted snippets, the
model would likely not be able to do better than this
Make sure your all of your training examples are in the same format, as expected for
inference
Once you’re satisfied with the quality and distribution of the examples, you can consider
scaling up the number of training examples. This tends to help the model learn the task
better, especially around possible "edge cases". We expect a similar amount of improvement
every time you double the number of training examples. You can loosely estimate the
expected quality gain from increasing the training data size by:
In general, if you have to make a trade-off, a smaller amount of high-quality data is generally
more effective than a larger amount of low-quality data.
Iterating on hyperparameters
epochs
learning rate multiplier
batch size
We recommend initially training without specifying any of these, allowing us to pick a default
for you based on dataset size, then adjusting if you observe the following:
If the model does not follow the training data as much as expected increase the
number of epochs by 1 or 2
This is more common for tasks for which there is a single ideal completion (or a
small set of ideal completions which are similar). Some examples include
classification, entity extraction, or structured parsing. These are often tasks for
which you can compute a final accuracy metric against a reference answer.
If the model becomes less diverse than expected decrease the number of epochs by 1
or 2
This is more common for tasks for which there are a wide range of possible good
completions
13/20
If the model does not appear to be converging, increase the learning rate multiplier
python
1
2
3
4
5
6
7
8
9
10
from openai import OpenAI
client = OpenAI()
client.fine_tuning.jobs.create(
training_file="file-abc123",
model="gpt-3.5-turbo",
hyperparameters={
"n_epochs":2
}
)
Fine-tuning examples
Now that we have explored the basics of the fine-tuning API, let’s look at going through the
fine-tuning lifecycle for a few different use cases.
Structured output
Function calling
Fine-tuning Integrations
OpenAI provides the ability for you to integrate your fine-tuning jobs with 3rd parties via our
integration framework. Integrations generally allow you to track job state, status, metrics,
hyperparameters, and other job-related information in a 3rd party system. You can also use
14/20
integrations to trigger actions in a 3rd party system based on job state changes. Currently,
the only supported integration is with Weights and Biases, but more are coming soon.
1. Provide authentication credentials for your Weights and Biases account to OpenAI
2. Configure the W&B integration when creating new fine-tuning jobs
When creating a new fine-tuning job, you can enable the W&B integration by including a new
"wandb" integration under the integrations field in the job creation request. This integration
allows you to specify the W&B Project that you wish the newly created W&B Run to show up
under.
Here's an example of how to enable the W&B integration when creating a new fine-tuning
job:
15/20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
curl -X POST \\
-H "Content-Type: application/json" \\
-H "Authorization: Bearer $OPENAI_API_KEY" \\
-d '{
"model": "gpt-3.5-turbo-0125",
"training_file": "file-ABC123",
"validation_file": "file-DEF456",
"integrations": [
{
"type": "wandb",
"wandb": {
"project": "custom-wandb-project",
"tags": ["project:tag", "lineage"]
}
}
]
}' https://api.openai.com/v1/fine_tuning/jobs
By default, the Run ID and Run display name are the ID of your fine-tuning job (e.g. ftjob-
abc123). You can customize the display name of the run by including a "name" field in the
wandb object. You can also include a "tags" field in the wandb object to add tags to the W&B
Run (tags must be <= 64 character strings and there is a maximum of 50 tags).
Sometimes it is convenient to explicitly set the W&B Entity to be associated with the run. You
can do this by including an "entity" field in the wandb object. If you do not include an
"entity" field, the W&B entity will default to the default W&B entity associated with the API
key you registered previously.
The full specification for the integration can be found in our fine-tuning job creation
documentation.
16/20
Once you've created a fine-tuning job with the W&B integration enabled, you can view the
job in W&B by navigating to the W&B project you specified in the job creation request. Your
run should be located at the URL: https://wandb.ai/<WANDB-ENTITY>/<WANDB-
PROJECT>/runs/ftjob-ABCDEF.
You should see a new run with the name and tags you specified in the job creation request.
The Run Config will contain relevant job metadata such as:
Likewise, OpenAI will set some default tags on the run to make it easier for your to search
and filter. These tags will be prefixed with "openai/" and will include:
An example W&B run generated from an OpenAI fine-tuning job is shown below:
Metrics for each step of the fine-tuning job will be logged to the W&B run. These metrics are
the same metrics provided in the fine-tuning job event object and are the same metrics your
can view via the OpenAI fine-tuning Dashboard. You can use W&B's visualization tools to
track the progress of your fine-tuning job and compare it to other fine-tuning jobs you've run.
17/20
FAQ
Embeddings with retrieval is best suited for cases when you need to have a large database
of documents with relevant context and information.
By default OpenAI’s models are trained to be helpful generalist assistants. Fine-tuning can
be used to make a model which is narrowly focused, and exhibits specific ingrained behavior
patterns. Retrieval strategies can be used to make new information available to a model by
providing it with relevant context before generating its response. Retrieval strategies are not
an alternative to fine-tuning and can in fact be complementary to it.
You can explore the differences between these options further in our Developer Day talk:
GPT-4 fine-tuning is in experimental access and eligible developers can request access via
the fine-tuning UI. Fine-tuning GPT-4o and GPT-4 Turbo is not currently available.
18/20
gpt-3.5-turbo-1106 and gpt-3.5-turbo-0125 support up to 16K context examples.
We recommend generating samples from both the base model and the fine-tuned model on
a test set of chat conversations, and comparing the samples side by side. For more
comprehensive evaluations, consider using the OpenAI evals framework to create an eval
specific to your use case.
Yes, you can pass the name of a fine-tuned model into the model parameter when creating a
fine-tuning job. This will start a new fine-tuning job using the fine-tuned model as the starting
point.
Does the new fine-tuning endpoint still work with Weights & Biases for
tracking metrics?
No, we do not currently support this integration but are working to enable it in the near future.
Please refer to our rate limit guide for the most up to date information on the limits.
A fine-tuned model pulls from the same shared rate limit as the model it is based off of. For
example, if you use half your TPM rate limit in a given time period with the standard gpt-
3.5-turbo model, any model(s) you fine-tuned from gpt-3.5-turbo would only have the
remaining half of the TPM rate limit accessible since the capacity is shared across all models
of the same type.
Put another way, having fine-tuned models does not give you more capacity to use our
models from a total throughput perspective.
19/20
For users migrating from /v1/fine-tunes to the updated /v1/fine_tuning/jobs API and
newer models, the main difference you can expect is the updated API. The legacy prompt
completion pair data format has been retained for the updated babbage-002 and davinci-
002 models to ensure a smooth transition. The new models will support fine-tuning with 4k
token context and have a knowledge cutoff of September 2021.
For most tasks, you should expect to get better performance from gpt-3.5-turbo than from
the GPT base models.
20/20