0% found this document useful (0 votes)
48 views14 pages

Fine

The document provides instructions on how to fine-tune OpenAI models for specific applications. Fine-tuning involves training a model on additional examples beyond what can fit in a prompt to achieve better results. It recommends first trying to improve results with prompt engineering before fine-tuning. The key steps are preparing a diverse training data set, training a new fine-tuned model, and using the fine-tuned model.

Uploaded by

Ruben Couto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views14 pages

Fine

The document provides instructions on how to fine-tune OpenAI models for specific applications. Fine-tuning involves training a model on additional examples beyond what can fit in a prompt to achieve better results. It recommends first trying to improve results with prompt engineering before fine-tuning. The key steps are preparing a diverse training data set, training a new fine-tuned model, and using the fine-tuned model.

Uploaded by

Ruben Couto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Fine-tuning

Learn how to customize a model for your application.

Introduction
This guide is intended for users of the new OpenAI fine-tuning API. If you are a
legacy fine-tuning user, please refer to our legacy fine-tuning guide.
Fine-tuning lets you get more out of the models available through the API by
providing:

1. Higher quality results than prompting


2. Ability to train on more examples than can fit in a prompt
3. Token savings due to shorter prompts
4. Lower latency requests

GPT models have been pre-trained on a vast amount of text. To use the models
effectively, we include instructions and sometimes several examples in a
prompt. Using demonstrations to show how to perform a task is often called
"few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples


than can fit in the prompt, letting you achieve better results on a wide number of
tasks. Once a model has been fine-tuned, you won't need to provide as
many examples in the prompt. This saves costs and enables lower-latency
requests.

At a high level, fine-tuning involves the following steps:

1. Prepare and upload training data


2. Train a new fine-tuned model
3. Use your fine-tuned model

Visit our pricing page to learn more about how fine-tuned model training and
usage are billed.

What models can be fine-tuned?


We are working on enabling fine-tuning for GPT-4 and expect this feature to be
available later this year.
Fine-tuning is currently available for the following models:

 gpt-3.5-turbo-0613 (recommended)
 babbage-002
 davinci-002

We expect gpt-3.5-turbo to be the right model for most users in terms of results


and ease of use, unless you are migrating a legacy fine-tuned model.

When to use fine-tuning


Fine-tuning GPT models can make them better for specific applications, but it
requires a careful investment of time and effort. We recommend first attempting
to get good results with prompt engineering, prompt chaining (breaking complex
tasks into multiple prompts), and function calling, with the key reasons being:

 There are many tasks for which our models may initially appear to not
perform well at, but with better prompting we can achieve much better
results and potentially not need to be fine-tune
 Iterating over prompts and other tactics has a much faster feedback loop
than iterating with fine-tuning, which requires creating datasets and
running training jobs
 In cases where fine-tuning is still necessary, initial prompt engineering
work is not wasted - we typically see best results when using a good
prompt in the fine-tuning data (or combining prompt chaining / tool use
with fine-tuning)

Our GPT best practices guide provides a background on some of the most


effective strategies and tactics for getting better performance without fine-
tuning. You may find it helpful to iterate quickly on prompts in our playground.

Common use cases


Some common use cases where fine-tuning can improve results:

 Setting the style, tone, format, or other qualitative aspects


 Improving reliability at producing a desired output
 Correcting failures to follow complex prompts
 Handling many edge cases in specific ways
 Performing a new skill or task that’s hard to articulate in a prompt

One high-level way to think about these cases is when it’s easier to "show, not
tell". In the sections to come, we will explore how to set up data for fine-tuning
and various examples where fine-tuning improves the performance over the
baseline model.

Another scenario where fine-tuning is effective is in reducing costs and / or


latency, by replacing GPT-4 or by utilizing shorter prompts, without sacrificing
quality. If you can achieve good results with GPT-4, you can often reach similar
quality with a fine-tuned gpt-3.5-turbo model by fine-tuning on the GPT-4
completions, possibly with a shortened instruction prompt.

Preparing your dataset


Once you have determined that fine-tuning is the right solution (i.e. you’ve
optimized your prompt as far as it can take you and identified problems that the
model still has), you’ll need to prepare data for training the model. You should
create a diverse set of demonstration conversations that are similar to the
conversations you will ask the model to respond to at inference time in
production.

Each example in the dataset should be a conversation in the same format as


our Chat completions API, specifically a list of messages where each message
has a role, content, and optional name. At least some of the training examples
should directly target cases where the prompted model is not behaving as
desired, and the provided assistant messages in the data should be the ideal
responses you want the model to provide.

Example format
In this example, our goal is to create a chatbot that occasionally gives sarcastic
responses, these are three training examples (conversations) we could create
for a dataset:

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role":
"user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone
doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role":
"user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy
named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role":
"user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400
kilometers. Give or take a few, like that really matters."}]}

We do not currently support function calling examples but are working to enable


this.
The conversational chat format is required to fine-tune gpt-3.5-turbo. For babbage-
002 and davinci-002, you can follow the prompt completion pair format used
for legacy fine-tuning as shown below.
1
2
3
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Crafting prompts
We generally recommend taking the set of instructions and prompts that you
found worked best for the model prior to fine-tuning, and including them in every
training example. This should let you reach the best and most general results,
especially if you have relatively few (e.g. under a hundred) training examples.

If you would like to shorten the instructions or prompts that are repeated in
every example to save costs, keep in mind that the model will likely behave as if
those instructions were included, and it may be hard to get the model to ignore
those "baked-in" instructions at inference time.

It may take more training examples to arrive at good results, as the model has
to learn entirely through demonstration and without guided instructions.

Example count recommendations


To fine-tune a model, you are required to provide at least 10 examples. We
typically see clear improvements from fine-tuning on 50 to 100 training
examples with gpt-3.5-turbo but the right number varies greatly based on the
exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the
model shows signs of improvement after fine-tuning. In some cases that may be
sufficient, but even if the model is not yet production quality, clear
improvements are a good sign that providing more data will continue to improve
the model. No improvement suggests that you may need to rethink how to set
up the task for the model or restructure the data before scaling beyond a limited
example set.

Train and test splits


After collecting the initial dataset, we recommend splitting it into a training and
test portion. When submitting a fine-tuning job with both training and test files,
we will provide statistics on both during the course of training. These statistics
will be your initial signal of how much the model is improving. Additionally,
constructing a test set early on will be useful in making sure you are able to
evaluate the model after training, by generating samples on the test set.
Token limits
Each training example is limited to 4096 tokens. Examples longer than this will
be truncated to the first 4096 tokens when training. To be sure that your entire
training example fits in context, consider checking that the total token counts in
the message contents are under 4,000. Each file is currently limited to 50 MB.

You can compute token counts using our counting tokens notebook from the
OpenAI cookbook.

Estimate costs
In order to estimate the cost of a fine-tuning job, please refer to the pricing
page for details on the cost per 1k tokens. To estimate the costs for a specific
fine-tuning job, use the following formula:

base cost per 1k tokens number of tokens in the input file number of epochs


trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost
would be ~$2.40.

Check data formatting


Once you have compiled a dataset and before you create a fine-tuning job, it is
important to check the data formatting. To do this, we created a simple Python
script which you can use to find potential errors, review token counts, and
estimate the cost of a fine-tuning job.

Data formatting script


Once you have the data validated, the file needs to be uploaded in order to be
used with a fine-tuning jobs:

1
2
3
4
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)

Create a fine-tuned model


After ensuring you have the right amount and structure for your dataset, and
have uploaded the file, the next step is to create a fine-tuning job.

Start your fine-tuning job using the OpenAI SDK:

1
2
3
4
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")

model is the name of the model you're starting from ( gpt-3.5-turbo, babbage-002,


or davinci-002). You can customize your fine-tuned model's name using the suffix
parameter.
After you've started a fine-tuning job, it may take some time to complete. Your
job may be queued behind other jobs in our system, and training a model can
take minutes or hours depending on the model and dataset size. After the
model training is completed, the user who created the fine-tuning job will
receive an email confirmation.

In addition to creating a fine-tuning job, you can also list existing jobs, retrieve
the status of a job, or cancel a job.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)
# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ft-abc123")

# Cancel a job
openai.FineTuningJob.cancel("ft-abc123")

# List up to 10 events from a fine-tuning job


openai.FineTuningJob.list_events(id="ft-abc123", limit=10)

# Delete a fine-tuned model (must be an owner of the org the model was created in)
import openai
openai.Model.delete("ft-abc123")

Use a fine-tuned model


When a job has succeeded, you will see the fine_tuned_model field populated with
the name of the model when you retrieve the job details. You may now specify
this model as a parameter to in the Chat completions (for gpt-3.5-turbo) or legacy
Completions API (for babbage-002 and davinci-002), and make requests to it using
the Playground.
After your job is completed, the model should be available right away for
inference use. In some cases, it may take several minutes for your model to
become ready to handle requests. If requests to your model time out or the
model name cannot be found, it is likely because your model is still being
loaded. If this happens, try again in a few minutes.

1
2
3
4
5
6
7
8
9
10
11
12
13
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)

print(completion.choices[0].message)

You can start making requests by passing the model name as shown above
and in our GPT guide.

Analyzing your fine-tuned model


We provide the following training metrics computed over the course of training:
training loss, training token accuracy, test loss, and test token accuracy. These
statistics are meant to provide a sanity check that training went smoothly (loss
should decrease, token accuracy should increase).

However we think that evaluating samples from the fine-tuned model provides
the most relevant sense of model quality. We recommend generating samples
from both the base model and the fine-tuned model on a test set, and
comparing the samples side by side. The test set should ideally include the full
distribution of inputs that you might send to the model for inference. If manual
evaluation is too time-consuming, consider using our Evals library for how to
use GPT-4 to perform evaluations.

Iterating on data quality


If the results from a fine-tuning job are not as good as you expected, consider
the following ways to adjust the training dataset:

 Collect examples to target remaining issues


o If the model still isn’t good at certain aspects, add training
examples that directly show the model how to do these aspects
correctly
 Scrutinize existing examples for issues
o If your model has grammar, logic, or style issues, check if your
data has any of the same issues. For instance, if the model now
says "I will schedule this meeting for you" (when it shouldn’t), see
if existing examples teach the model to say it can do new things
that it can’t do
 Consider the balance and diversity of data
o If 60% of the assistant responses in the data says "I cannot
answer this", but at inference time only 5% of responses should
say that, you will likely get an overabundance of refusals
 Make sure your training examples contain all of the information needed
for the response
o If we want the model to compliment a user based on their
personal traits and a training example includes assistant
compliments for traits not found in the preceding conversation, the
model may learn to hallucinate information
 Look at the agreement / consistency in the training examples
o If multiple people created the training data, it’s likely that model
performance will be limited by the level of agreement / consistency
between people. For instance, in a text extraction task, if people
only agreed on 70% of extracted snippets, the model would likely
not be able to do better than this
 Make sure your all of your training examples are in the same format, as
expected for inference

Iterating on data quantity


Once you’re satisfied with the quality and distribution of the examples, you can
consider scaling up the number of training examples. This tends to help the
model learn the task better, especially around possible "edge cases". We
expect a similar amount of improvement every time you double the number of
training examples. You can loosely estimate the expected quality gain from
increasing the training data size by:

 Fine-tuning on your current dataset


 Fine-tuning on half of your current dataset
 Observing the quality gap between the two

In general, if you have to make a trade-off, a smaller amount of high-quality


data is generally more effective than a larger amount of low-quality data.

Iterating on hyperparameters
We allow you to specify the number of epochs to fine-tune a model for. We
recommend initially training without specifying the number of epochs, allowing
us to pick a default for you based on dataset size, then adjusting if you observe
the following:

 If the model does not follow the training data as much as expected
increase the number by 1 or 2 epochs
o This is more common for tasks for which there is a single ideal
completion (or a small set of ideal completions which are similar).
Some examples include classification, entity extraction, or
structured parsing. These are often tasks for which you can
compute a final accuracy metric against a reference answer.
 If the model becomes less diverse than expected decrease the number
by 1 or 2 epochs
o This is more common for tasks for which there are a wide range of
possible good completions

Fine-tuning examples
Now that we have explored the basics of the fine-tuning API, let’s look at going
through the fine-tuning lifecycle for a few different use cases.

Style and tone


In this example, we will explore how to build a fine-tuned model which gets the
model follow specific style and tone guidance beyond what is possible with
prompting alone.

To begin, we create a sample set of messages showing what the model should
which in this case is misspelled words.

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role":
"user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone
doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role":
"user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy
named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role":
"user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400
kilometers. Give or take a few, like that really matters."}]}

If you want to follow along and create a fine-tuned model yourself, you will need
at least 10 examples.

After getting the data that will potentially improve the model, the next step is to
check if the data meets all the formatting requirements.

Now that we have the data formatted and validated, the final training step is to
kick off a job to create the fine-tuned model. You can do this via the OpenAI CLI
or one of our SDKs as shown below:
1
2
3
openai.File.create(file=open("marv.jsonl", "rb"), purpose='fine-tune')

openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")

Once the training job is done, you will be able to use your fine-tuned model.

Collapse‍
Structured output
Another type of use case which works really well with fine-tuning is getting the
model to provide structured information, in this case about sports headlines:

1
2
3
4
{"messages": [{"role": "system", "content": "Given a sports headline, provide the following fields in a
JSON dict, where applicable: "player" (full name)", "team", "sport", and "gender".},{"role": "user",
"content": "Sources: Colts grant RB Taylor OK to seek trade"},
{"role": "assistant", "content": "{"player": "Jonathan Taylor", "team": "Colts", "sport": "football",
"gender": "male" }"},]}
{"messages": [{"role": "system", "content": "Given a sports headline, provide the following fields in a
JSON dict, where applicable: "player" (full name)", "team", "sport", and "gender".},{"role": "user",
"content": "OSU 'split down middle' on starting QB battle"},
{"role": "assistant", "content": "{"player": null, "team": "OSU", "sport": "football", "gender": null }"},]}

If you want to follow along and create a fine-tuned model yourself, you will need
at least 10 examples.

After getting the data that will potentially improve the model, the next step is to
check if the data meets all the formatting requirements.

Now that we have the data formatted and validated, the final training step is to
kick off a job to create the fine-tuned model. You can do this via the OpenAI CLI
or one of our SDKs as shown below:

1
2
3
openai.File.create(file=open("sports-context.jsonl", "rb"), purpose='fine-tune')

openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
Once the training job is done, you will be able to use your fine-tuned model and
make a request that looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "Given a sports headline, provide the following fields in a JSON dict,
where applicable: player (full name), team, sport, and gender"},
{"role": "user", "content": "Richardson wins 100m at worlds to cap comeback"}
]
)

print(completion.choices[0].message)

Based on the formatted training data, the response should look like the
following:

{"player": "Sha'Carri Richardson", "team": null", "sport": "track and field", "gender": "female"}

Collapse‍

Migration of legacy models


For users migrating from /v1/fine-tunes to the updated /v1/fine_tuning/jobs API and
newer models, the main difference you can expect is the updated API. The
legacy prompt completion pair data format has been retained for the
updated babbage-002 and davinci-002 models to ensure a smooth transition. The
new models will support fine-tuning with 4k token context and have a
knowledge cutoff of September 2021.
For most tasks, you should expect to get better performance from gpt-3.5-
turbo than from the GPT base models.

FAQ

When should I use fine-tuning vs embeddings with


retrieval?
Embeddings with retrieval is best suited for cases when you need to have a
large database of documents with relevant context and information.

By default OpenAI’s models are trained to be helpful generalist assistants. Fine-


tuning can be used to make a model which is narrowly focused, and exhibits
specific ingrained behavior patterns. Retrieval strategies can be used to make
new information available to a model by providing it with relevant context before
generating its response. Retrieval strategies are not an alternative to fine-tuning
and can in fact be complementary to it.

When can I fine-tune GPT-4 or GPT-3.5-Turbo-16k?


We plan to release support for fine-tuning both of these models later this year.

How do I know if my fine-tuned model is actually better


than the base model?
We recommend generating samples from both the base model and the fine-
tuned model on a test set of chat conversations, and comparing the samples
side by side. For more comprehensive evaluations, consider using the OpenAI
evals framework to create an eval specific to your use case.

Can I continue fine-tuning a model that has already been


fine-tuned?
No, we do not currently support continuing the fine-tuning process once a job
has finished. We plan to support this in the near future.

How can I estimate the cost of fine-tuning a model?


Please refer to the estimate cost section above.

Does the new fine-tuning endpoint still work with Weights


& Biases for tracking metrics?
No, we do not currently support this integration but are working to enable it in
the near future.

How many fine-tuning jobs can I have running at once?


Please refer to our rate limit guide for the most up to date information on the
limits.

You might also like