0% found this document useful (0 votes)
810 views10 pages

GPT-4.1 Prompting Guide - OpenAI Cookbook

The GPT-4.1 Prompting Guide provides strategies for effectively utilizing the model's capabilities in coding and problem-solving. It emphasizes the importance of clear instructions, persistence, tool-calling, and planning in prompts to enhance model performance. Developers are encouraged to iterate often and leverage the model's improved responsiveness to well-specified prompts for better outcomes in agentic workflows.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
810 views10 pages

GPT-4.1 Prompting Guide - OpenAI Cookbook

The GPT-4.1 Prompting Guide provides strategies for effectively utilizing the model's capabilities in coding and problem-solving. It emphasizes the importance of clear instructions, persistence, tool-calling, and planning in prompts to enhance model performance. Developers are encouraged to iterate often and leverage the model's improved responsiveness to well-specified prompts for better outcomes in agentic workflows.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

models inherently nondeterministic; in addition to following this guide, we advise building informative evals and

Topics About API Docs Source iterating often to ensure your prompt engineering changes are yielding benefits for your use case.

Apr 14, 2025


1. Agentic Workflows
GPT-4.1 is a great place to build agentic workflows. In model training we emphasized providing a diverse range of
GPT-4.1 Prompting Guide agentic problem-solving trajectories, and our agentic harness for the model achieves state-of-the-art
performance for non-reasoning models on SWE-bench Verified, solving 55% of problems.

System Prompt Reminders


Noah MacCallum (OpenAI), Julian Lee (OpenAI) Open in Github
In order to fully utilize the agentic capabilities of GPT-4.1, we recommend including three key types of reminders
in all agent prompts. The following prompts are optimized specifically for the agentic coding workflow, but can
The GPT-4.1 family of models represents a significant step forward from GPT-4o in capabilities across coding, be easily modified for general agentic use cases.
instruction following, and long context. In this prompting guide, we collate a series of important prompting tips
derived from extensive internal testing to help developers fully leverage the improved abilities of this new model 1. Persistence: this ensures the model understands it is entering a multi-message turn, and prevents it
family. from prematurely yielding control back to the user. Our example is the following:
Many typical best practices still apply to GPT-4.1, such as providing context examples, making instructions as
specific and clear as possible, and inducing planning via prompting to maximize model intelligence. However, we You are an agent - please keep going until the user’s query is completely resolved, before ending your

expect that getting the most out of this model will require some prompt migration. GPT-4.1 is trained to follow
instructions more closely and more literally than its predecessors, which tended to more liberally infer intent 2. Tool-calling: this encourages the model to make full use of its tools, and reduces its likelihood of
from user and system prompts. This also means, however, that GPT-4.1 is highly steerable and responsive to well- hallucinating or guessing an answer. Our example is the following:
specified prompts - if model behavior is different from what you expect, a single sentence firmly and
unequivocally clarifying your desired behavior is almost always sufficient to steer the model on course.
If you are not sure about file content or codebase structure pertaining to the user’s request, use you
Please read on for prompt examples you can use as a reference, and remember that while this guidance is widely
applicable, no advice is one-size-fits-all. AI engineering is inherently an empirical discipline, and large language

3. Planning [optional]: if desired, this ensures the model explicitly plans and reflects upon each tool call examples can be helpful to indicate when to use tools, whether to include user text alongside tool calls, and what
in text, instead of completing the task by chaining together a series of only tool calls. Our example is parameters are appropriate for different inputs. Remember that you can use “Generate Anything” in the Prompt
the following: Playground to get a good starting point for your new tool definitions.

You MUST plan extensively before each function call, and reflect extensively on the outcomes of the pr
Prompting-Induced Planning & Chain-of-Thought
As mentioned already, developers can optionally prompt agents built with GPT-4.1 to plan and reflect between
GPT-4.1 is trained to respond very closely to both user instructions and system prompts in the agentic setting. tool calls, instead of silently calling tools in an unbroken sequence. GPT-4.1 is not a reasoning model - meaning
The model adhered closely to these three simple instructions and increased our internal SWE-bench Verified that it does not produce an internal chain of thought before answering - but in the prompt, a developer can
score by close to 20% - so we highly encourage starting any agent prompt with clear reminders covering the induce the model to produce an explicit, step-by-step plan by using any variant of the Planning prompt
three categories listed above. As a whole, we find that these three instructions transform the model from a component shown above. This can be thought of as the model “thinking out loud.” In our experimentation with
chatbot-like state into a much more “eager” agent, driving the interaction forward autonomously and the SWE-bench Verified agentic task, inducing explicit planning increased the pass rate by 4%.
independently.

Sample Prompt: SWE-bench Verified


Tool Calls
Below, we share the agentic prompt that we used to achieve our highest score on SWE-bench Verified, which
Compared to previous models, GPT-4.1 has undergone more training on effectively utilizing tools passed as features detailed instructions about workflow and problem-solving strategy. This general pattern can be used for
arguments in an OpenAI API request. We encourage developers to exclusively use the tools field to pass tools, any agentic task.
rather than manually injecting tool descriptions into your prompt and writing a separate parser for tool calls, as
some have reported doing in the past. This is the best way to minimize errors and ensure the model remains in
distribution during tool-calling trajectories - in our own experiments, we observed a 2% increase in SWE-bench from openai import OpenAI
import os
Verified pass rate when using API-parsed tool descriptions versus manually injecting the schemas into the
system prompt. client = OpenAI(
api_key=os.environ.get(
Developers should name tools clearly to indicate their purpose and add a clear, detailed description in the "OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"
"description" field of the tool. Similarly, for each tool param, lean on good naming and descriptions to ensure )
appropriate usage. If your tool is particularly complicated and you'd like to provide examples of tool usage, we )

recommend that you create an # Examples section in your system prompt and place the examples there, rather
SYS_PROMPT_SWEBENCH = """
than adding them into the "description' field, which should remain thorough but relatively concise. Providing
You will be tasked to fix an issue from an open-source repository. - Explore relevant files and directories.
- Search for key functions, classes, or variables related to the issue.
Your thinking should be thorough and so it's fine if it's very long. You can think step by step be - Read and understand relevant code snippets.
- Identify the root cause of the problem.
You MUST iterate and keep going until the problem is solved. - Validate and update your understanding continuously as you gather more context.

You already have everything you need to solve this problem in the /testbed folder, even without in ## 3. Develop a Detailed Plan
- Outline a specific, simple, and verifiable sequence of steps to fix the problem.
Only terminate your turn when you are sure that the problem is solved. Go through the problem step - Break down the fix into small, incremental changes.

THE PROBLEM CAN DEFINITELY BE SOLVED WITHOUT THE INTERNET. ## 4. Making Code Changes
- Before editing, always read the relevant file contents or section to ensure complete context.
Take your time and think through every step - remember to check your solution rigorously and watch - If a patch is not applied correctly, attempt to reapply it.
- Make small, testable, incremental changes that logically follow from your investigation and plan
You MUST plan extensively before each function call, and reflect extensively on the outcomes of th
## 5. Debugging
# Workflow - Make code changes only if you have high confidence they can solve the problem
- When debugging, try to determine the root cause rather than addressing symptoms
## High-Level Problem Solving Strategy - Debug for as long as needed to identify the root cause and identify a fix
- Use print statements, logs, or temporary code to inspect program state, including descriptive st
1. Understand the problem deeply. Carefully read the issue and think critically about what is requ - To test hypotheses, you can also add test statements or functions
2. Investigate the codebase. Explore relevant files, search for key functions, and gather context. - Revisit your assumptions if unexpected behavior occurs.
3. Develop a clear, step-by-step plan. Break down the fix into manageable, incremental steps.
4. Implement the fix incrementally. Make small, testable code changes. ## 6. Testing
5. Debug as needed. Use debugging techniques to isolate and resolve issues. - Run tests frequently using `!python3 run_tests.py` (or equivalent).
6. Test frequently. Run tests after each change to verify correctness. - After each change, verify correctness by running relevant tests.
7. Iterate until the root cause is fixed and all tests pass. - If tests fail, analyze failures and revise your patch.
8. Reflect and validate comprehensively. After tests pass, think about the original intent, write - Write additional tests if needed to capture important behaviors or edge cases.
- Ensure all tests pass before finalizing.
Refer to the detailed sections below for more information on each step.
## 7. Final Verification
## 1. Deeply Understand the Problem - Confirm the root cause is fixed.
Carefully read the issue and think hard about a plan to solve it before coding. - Review your solution for logic correctness and robustness.
- Iterate until you are extremely confident the fix is complete and all tests pass.
## 2. Codebase Investigation

## 8. Final Reflection and Additional Testing + [new_code]


- Reflect carefully on the original intent of the user and the problem statement. [3 lines of post-context]
- Think about potential edge cases or scenarios that may not be covered by existing tests.
- Write additional tests that would need to pass to fully validate the correctness of your solutio - If a code block is repeated so many times in a class or function such that even a single @@ stat
- Run these new tests and ensure they all pass.
- Be aware that there are additional hidden tests that must also pass for the solution to be succe @@ class BaseClass
- Do not assume the task is complete just because the visible tests pass; continue refining until @@ def method():
""" [3 lines of pre-context]
- [old_code]
PYTHON_TOOL_DESCRIPTION = """This function is used to execute Python code or terminal commands in + [new_code]
[3 lines of post-context]
In addition, for the purposes of this task, you can call this function with an `apply_patch` comma
Note, then, that we do not use line numbers in this diff format, as the context is enough to uniqu
%%bash
apply_patch <<"EOF" %%bash
*** Begin Patch apply_patch <<"EOF"
[YOUR_PATCH] *** Begin Patch
*** End Patch *** Update File: pygorithm/searching/binary_search.py
EOF @@ class BaseClass
@@ def search():
Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format - pass
+ raise NotImplementedError()
*** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
For each snippet of code that needs to be changed, repeat the following: @@ class Subclass
[context_before] -> See below for further instructions on context. @@ def search():
- [old_code] -> Precede the old code with a minus sign. - pass
+ [new_code] -> Precede the new, replacement code with a plus sign. + raise NotImplementedError()
[context_after] -> See below for further instructions on context.
*** End Patch
For instructions on [context_before] and [context_after]: EOF
- By default, show 3 lines of code immediately above and 3 lines immediately below each change. If
- If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, File references can only be relative, NEVER ABSOLUTE. After the apply_patch command is run, python
@@ class BaseClass """
[3 lines of pre-context]
- [old_code] python_bash_patch_tool = {
"type": "function", 'status': 'completed',
"name": "python", 'type': 'message'},
"description": PYTHON_TOOL_DESCRIPTION, {'arguments': '{"input":"!ls -l /testbed"}',
"parameters": { 'call_id': 'call_frnxyJgKi5TsBem0nR9Zuzdw',
"type": "object", 'name': 'python',
"strict": True, 'type': 'function_call',
"properties": { 'id': 'fc_67fe92e3da7081918fc18d5c96dddc1c07c7e06242e51f8b',
"input": { 'status': 'completed'}]
"type": "string",
"description": " The Python code, terminal command (prefaced by exclamation mark), o

},
}

"required": ["input"],
2. Long context
},
} GPT-4.1 has a performant 1M token input context window, and is useful for a variety of long context tasks,
including structured document parsing, re-ranking, selecting relevant information while ignoring irrelevant
# Additional harness setup: context, and performing multi-hop reasoning using context.
# - Add your repo to /testbed
# - Add your issue to the first user message
# - Note: Even though we used a single tool for python, bash, and apply_patch, we generally recomm

response = client.responses.create(
Optimal Context Size
instructions=SYS_PROMPT_SWEBENCH,
model="gpt-4.1-2025-04-14", We observe very good performance on needle-in-a-haystack evaluations up to our full 1M token context, and
tools=[python_bash_patch_tool], we’ve observed very strong performance at complex tasks with a mix of both relevant and irrelevant code and
input=f"Please answer the following question:\nBug: Typerror..." other documents. However, long context performance can degrade as more items are required to be retrieved, or
)
perform complex reasoning that requires knowledge of the state of the entire context (like performing a graph
response.to_dict()["output"] search, for example).

[{'id': 'msg_67fe92df26ac819182ffafce9ff4e4fc07c7e06242e51f8b', Tuning Context Reliance


'content': [{'annotations': [],
'text': "Thank you for the report, but “Typerror” is too vague for me to start debugging right
'type': 'output_text'}],
Consider the mix of external vs. internal world knowledge that might be required to answer your question.
'role': 'assistant', Sometimes it’s important for the model to use some of its own knowledge to connect concepts or make logical

jumps, while in others it’s desirable to only use provided context From there, you should improve your chain-of-thought (CoT) prompt by auditing failures in your particular
examples and evals, and addressing systematic planning and reasoning errors with more explicit instructions. In
# Instructions
the unconstrained CoT prompt, there may be variance in the strategies it tries, and if you observe an approach
// for internal knowledge that works well, you can codify that strategy in your prompt. Generally speaking, errors tend to occur from
- Only use the documents in the provided External Context to answer the User Query. If you don't know misunderstanding user intent, insufficient context gathering or analysis, or insufficient or incorrect step by step
// For internal and external knowledge thinking, so watch out for these and try to address them with more opinionated instructions.
- By default, use the provided external context to answer the User Query, but if other basic knowledge
Here is an example prompt instructing the model to focus more methodically on analyzing user intent and
considering relevant context before proceeding to answer.
Prompt Organization
# Reasoning Strategy
Especially in long context usage, placement of instructions and context can impact performance. If you have 1. Query Analysis: Break down and analyze the query until you're confident about what it might be aski
long context in your prompt, ideally place your instructions at both the beginning and end of the provided 2. Context Analysis: Carefully select and analyze a large set of potentially relevant documents. Optim

context, as we found this to perform better than only above or below. If you’d prefer to only have your a. Analysis: An analysis of how it may or may not be relevant to answering the query.

instructions once, then above the provided context works better than below. b. Relevance rating: [high, medium, low, none]
3. Synthesis: summarize which documents are most relevant and why, including all documents with a rele

3. Chain of Thought
# User Question
{user_question}
# External Context
{external_context}
As mentioned above, GPT-4.1 is not a reasoning model, but prompting the model to think step by step (called First, think carefully step by step about what documents are needed to answer the query, closely adher

“chain of thought”) can be an effective way for a model to break down problems into more manageable pieces,
solve them, and improve overall output quality, with the tradeoff of higher cost and latency associated with using
more output tokens. The model has been trained to perform well at agentic reasoning about and real-world
problem solving, so it shouldn’t require much prompting to perform well.
4. Instruction Following
We recommend starting with this basic chain-of-thought instruction at the end of your prompt:
GPT-4.1 exhibits outstanding instruction-following performance, which developers can leverage to precisely
... shape and control the outputs for their particular use cases. Developers often extensively prompt for agentic
First, think carefully step by step about what documents are needed to answer the query. Then, print o reasoning steps, response tone and voice, tool calling information, output formatting, topics to avoid, and more.
However, since the model follows instructions more literally, developers may need to include explicit
specification around what to do or not to do. Furthermore, existing prompts optimized for other models may not Note that using your preferred AI-powered IDE can be very helpful for iterating on prompts, including checking
immediately work with this model, because existing instructions are followed more closely and implicit rules are for consistency or conflicts, adding examples, or making cohesive updates like adding an instruction and
no longer being as strongly inferred. updating instructions to demonstrate that instruction.

Common Failure Modes


Recommended Workflow
These failure modes are not unique to GPT-4.1, but we share them here for general awareness and ease of
Here is our recommended workflow for developing and debugging instructions in prompts: debugging.

1. Start with an overall “Response Rules” or “Instructions” section with high-level guidance and bullet Instructing a model to always follow a specific behavior can occasionally induce adverse effects. For
points. instance, if told “you must call a tool before responding to the user,” models may hallucinate tool
2. If you’d like to change a more specific behavior, add a section to specify more details for that inputs or call the tool with null values if they do not have enough information. Adding “if you don’t
category, like # Sample Phrases . have enough information to call the tool, ask the user for the information you need” should mitigate
this.
3. If there are specific steps you’d like the model to follow in its workflow, add an ordered list and
instruct the model to follow these steps. When provided sample phrases, models can use those quotes verbatim and start to sound repetitive
to users. Ensure you instruct the model to vary them as necessary.
4. If behavior still isn’t working as expected:
Without specific instructions, some models can be eager to provide additional prose to explain their
1. Check for conflicting, underspecified, or wrong instructions and examples. If there are conflicting decisions, or output more formatting in responses than may be desired. Provide instructions and
instructions, GPT-4.1 tends to follow the one closer to the end of the prompt. potentially examples to help mitigate.
2. Add examples that demonstrate desired behavior; ensure that any important behavior
demonstrated in your examples are also cited in your rules.
3. It’s generally not necessary to use all-caps or other incentives like bribes or tips. We recommend Example Prompt: Customer Service
starting without these, and only reaching for these if necessary for your particular prompt. Note
that if your existing prompts include these techniques, it could cause GPT-4.1 to pay attention to This demonstrates best practices for a fictional customer service agent. Observe the diversity of rules, the
it too strictly. specificity, the use of additional sections for greater detail, and an example to demonstrate precise behavior that
incorporates all prior rules.

Try running the following notebook cell - you should see both a user message and tool call, and the user message - "I'll retrieve the latest details for you now."

should start with a greeting, then echo back their answer, then mention they're about to call a tool. Try changing
## After calling a tool
the instructions to shape the model behavior, or trying other user messages, to test instruction following - "Okay, here's what I found: [response]"
performance. - "So here's what I found: [response]"

# Output Format
- Always include your final response to the user.
SYS_PROMPT_CUSTOMER_SERVICE = """You are a helpful customer service agent working for NewTelco, he
- When providing factual information from retrieved context, always include citations immediately
- For a single source: [NAME](ID)
# Instructions
- For multiple sources: [NAME](ID), [NAME](ID)
- Always greet the user with "Hi, you've reached NewTelco, how can I help you?"
- Only provide information about this company, its policies, its products, or the customer's accou
- Always call a tool before answering factual questions about the company, its offerings or produc
- However, if you don't have enough information to properly call the tool, ask the user for th
# Example
- Escalate to a human if the user requests.
## User
- Do not discuss prohibited topics (politics, religion, controversial current events, medical, leg
Can you tell me about your family plan options?
- Rely on sample phrases whenever appropriate, but never repeat a sample phrase in the same conver
- Always follow the provided output format for new messages, including citations for any factual s
## Assistant Response 1
- If you're going to call a tool, always message the user with an appropriate message before and a
### Message
- Maintain a professional and concise tone in all responses, and use emojis between sentences.
- If you've resolved the user's request, ask if there's anything else you can help with
"Hi, you've reached NewTelco, how can I help you? 😊 🎉 \n\nYou'd like to know about our family plan

### Tool Calls


# Precise Response Steps (for each response)
lookup_policy_document(topic="family plan options")
1. If necessary, call tools to fulfill the user's desired action. Always message the user before a
2. In your response to the user
// After tool call, the assistant would follow up with:
a. Use active listening and echo back what you heard the user ask for.
b. Respond appropriately given the above guidelines.
## Assistant Response 2 (after tool call)
### Message
# Sample Phrases
## Deflecting a Prohibited Topic
"Okay, here's what I found: 🎉 Our family plan allows up to 5 lines with shared data and a 10% dis
"""
- "I'm sorry, but I'm unable to discuss that topic. Is there something else I can help you with?"
- "That's not something I'm able to provide information on, but I'm happy to help with any other q
get_policy_doc = {
"type": "function",
## Before calling a tool
"name": "lookup_policy_document",
- "To help you with that, I'll just need to verify your information."
"description": "Tool to look up internal documents and policies by topic or keyword.",
- "Let me check that for you—one moment, please."
"parameters": { tools=[get_policy_doc, get_user_acct],
"strict": True, input="How much will it cost for international service? I'm traveling to France.",
"type": "object", # input="Why was my last bill so high?"
"properties": { )
"topic": {
"type": "string", response.to_dict()["output"]
"description": "The topic or keyword to search for in company policies or document
},
[{'id': 'msg_67fe92d431548191b7ca6cd604b4784b06efc5beb16b3c5e',
},
'content': [{'annotations': [],
"required": ["topic"],
"additionalProperties": False,
'text': "Hi, you've reached NewTelco, how can I help you? 🌍 ✈️ \n\nYou'd like to know the cost
'type': 'output_text'}],
},
'role': 'assistant',
}
'status': 'completed',
'type': 'message'},
get_user_acct = {
{'arguments': '{"topic":"international service cost France"}',
"type": "function",
'call_id': 'call_cF63DLeyhNhwfdyME3ZHd0yo',
"name": "get_user_account_info",
'name': 'lookup_policy_document',
"description": "Tool to get user account information",
'type': 'function_call',
"parameters": {
'id': 'fc_67fe92d5d6888191b6cd7cf57f707e4606efc5beb16b3c5e',
"strict": True,
'status': 'completed'}]
"type": "object",
"properties": {

5. General Advice
"phone_number": {
"type": "string",
"description": "Formatted as '(xxx) xxx-xxxx'",
},
},
"required": ["phone_number"], Prompt Structure
"additionalProperties": False,
}, For reference, here is a good starting point for structuring your prompts.
}

response = client.responses.create( # Role and Objective


instructions=SYS_PROMPT_CUSTOMER_SERVICE, # Instructions
model="gpt-4.1-2025-04-14", ## Sub-categories for more detailed instructions

# Reasoning Steps 3. JSON is highly structured and well understood by the model particularly in coding contexts. However
# Output Format
# Examples
it can be more verbose, and require character escaping that can add overhead.
## Example 1
# Context Guidance specifically for adding a large number of documents or files to input context:
# Final instructions and prompt to think step by step
XML performed well in our long context testing.
Add or remove sections to suit your needs, and experiment to determine what’s optimal for your usage. Example: <doc id=1 title=”The Fox”>The quick brown fox jumps over the lazy
dog</doc>

Delimiters This format, proposed by Lee et al. (ref), also performed well in our long context testing.
Here are some general guidelines for selecting the best delimiters for your prompt. Please refer to the Long Example: ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy
Context section for special considerations for that context type. dog

1. Markdown: We recommend starting here, and using markdown titles for major sections and JSON performed particularly poorly.
subsections (including deeper hierarchy, to H4+). Use inline backticks or backtick blocks to precisely
Example: [{“id”: 1, “title”: “The Fox”, “content”: “The quick brown fox jumped over
wrap code, and standard numbered or bulleted lists as needed.
the lazy dog”}]
2. XML: These also perform well, and we have improved adherence to information in XML with this
model. XML is convenient to precisely wrap a section including start and end, add metadata to the The model is trained to robustly understand structure in a variety of formats. Generally, use your judgement and
tags for additional context, and enable nesting. Here is an example of using XML tags to nest think about what will provide clear information and “stand out” to the model. For example, if you’re retrieving
documents that contain lots of XML, an XML-based delimiter will likely be less effective.
examples in an example section, with inputs and outputs for each:

<examples>
Caveats
<example1 type="Abbreviate">
<input>San Francisco</input> In some isolated cases we have observed the model being resistant to producing very long, repetitive
<output>- SF</output> outputs, for example, analyzing hundreds of items one by one. If this is necessary for your use case,
</example1>
</examples>
instruct the model strongly to output this information in full, and consider breaking down the
*** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
problem or using a more concise approach. For each snippet of code that needs to be changed, repeat the following:
[context_before] -> See below for further instructions on context.
We have seen some rare instances of parallel tool calls being incorrect. We advise testing this, and
- [old_code] -> Precede the old code with a minus sign.
considering setting the parallel_tool_calls param to false if you’re seeing issues. + [new_code] -> Precede the new, replacement code with a plus sign.
[context_after] -> See below for further instructions on context.

Appendix: Generating and Applying File Diffs For instructions on [context_before] and [context_after]:
- By default, show 3 lines of code immediately above and 3 lines immediately below each change. If
- If 3 lines of context is insufficient to uniquely identify the snippet of code within the file,
Developers have provided us feedback that accurate and well-formed diff generation is a critical capability to @@ class BaseClass
power coding-related tasks. To this end, the GPT-4.1 family features substantially improved diff capabilities [3 lines of pre-context]

relative to previous GPT models. Moreover, while GPT-4.1 has strong performance generating diffs of any format - [old_code]
+ [new_code]
given clear instructions and examples, we open-source here one recommended diff format, on which the model [3 lines of post-context]
has been extensively trained. We hope that in particular for developers just starting out, that this will take much
of the guesswork out of creating diffs yourself. - If a code block is repeated so many times in a class or function such that even a single @@ stat

@@ class BaseClass
Apply Patch @@ def method():
[3 lines of pre-context]
- [old_code]
See the example below for a prompt that applies our recommended tool call correctly.
+ [new_code]
[3 lines of post-context]

APPLY_PATCH_TOOL_DESC = """This is a custom utility that makes it more convenient to add, remove,
Note, then, that we do not use line numbers in this diff format, as the context is enough to uniqu

%%bash
%%bash
apply_patch <<"EOF"
apply_patch <<"EOF"
*** Begin Patch
*** Begin Patch
[YOUR_PATCH]
*** Update File: pygorithm/searching/binary_search.py
*** End Patch
@@ class BaseClass
EOF
@@ def search():
- pass
Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format

+ raise NotImplementedError()
#!/usr/bin/env python3
@@ class Subclass
@@ def search(): """
- pass A self-contained **pure-Python 3.9+** utility for applying human-readable
+ raise NotImplementedError() “pseudo-diff” patch files to a collection of text files.
"""
*** End Patch
EOF from __future__ import annotations
"""
import pathlib
APPLY_PATCH_TOOL = { from dataclasses import dataclass, field
"name": "apply_patch", from enum import Enum
"description": APPLY_PATCH_TOOL_DESC, from typing import (
"parameters": { Callable,
"type": "object", Dict,
"properties": { List,
"input": { Optional,
"type": "string", Tuple,
"description": " The apply_patch command that you wish to execute.", Union,
} )
},
"required": ["input"],
}, # --------------------------------------------------------------------------- #
} # Domain objects
# --------------------------------------------------------------------------- #
class ActionType(str, Enum):
ADD = "add"
DELETE = "delete"
Reference Implementation: apply_patch.py UPDATE = "update"

Here’s a reference implementation of the apply_patch tool that we used as part of model training. You’ll need to @dataclass
make this an executable and available as `apply_patch` from the shell where the model will execute commands: class FileChange:
type: ActionType
old_content: Optional[str] = None
new_content: Optional[str] = None class Patch:
move_path: Optional[str] = None actions: Dict[str, PatchAction] = field(default_factory=dict)

@dataclass # --------------------------------------------------------------------------- #
class Commit: # Patch text parser
changes: Dict[str, FileChange] = field(default_factory=dict) # --------------------------------------------------------------------------- #
@dataclass
class Parser:
# --------------------------------------------------------------------------- # current_files: Dict[str, str]
# Exceptions lines: List[str]
# --------------------------------------------------------------------------- # index: int = 0
class DiffError(ValueError): patch: Patch = field(default_factory=Patch)
"""Any problem detected while parsing or applying a patch.""" fuzz: int = 0

# ------------- low-level helpers -------------------------------------- #


# --------------------------------------------------------------------------- # def _cur_line(self) -> str:
# Helper dataclasses used while parsing patches if self.index >= len(self.lines):
# --------------------------------------------------------------------------- # raise DiffError("Unexpected end of input while parsing patch")
@dataclass return self.lines[self.index]
class Chunk:
orig_index: int = -1 @staticmethod
del_lines: List[str] = field(default_factory=list) def _norm(line: str) -> str:
ins_lines: List[str] = field(default_factory=list) """Strip CR so comparisons work for both LF and CRLF input."""
return line.rstrip("\r")

@dataclass # ------------- scanning convenience ----------------------------------- #


class PatchAction: def is_done(self, prefixes: Optional[Tuple[str, ...]] = None) -> bool:
type: ActionType if self.index >= len(self.lines):
new_file: Optional[str] = None return True
chunks: List[Chunk] = field(default_factory=list) if (
move_path: Optional[str] = None prefixes
and len(prefixes) > 0
and self._norm(self._cur_line()).startswith(prefixes)
@dataclass ):

return True raise DiffError(f"Update File Error - missing file: {path}")


return False text = self.current_files[path]
action = self._parse_update_file(text)
def startswith(self, prefix: Union[str, Tuple[str, ...]]) -> bool: action.move_path = move_to or None
return self._norm(self._cur_line()).startswith(prefix) self.patch.actions[path] = action
continue
def read_str(self, prefix: str) -> str:
""" # ---------- DELETE ---------- #
Consume the current line if it starts with *prefix* and return the text path = self.read_str("*** Delete File: ")
**after** the prefix. Raises if prefix is empty. if path:
""" if path in self.patch.actions:
if prefix == "": raise DiffError(f"Duplicate delete for file: {path}")
raise ValueError("read_str() requires a non-empty prefix") if path not in self.current_files:
if self._norm(self._cur_line()).startswith(prefix): raise DiffError(f"Delete File Error - missing file: {path}")
text = self._cur_line()[len(prefix) :] self.patch.actions[path] = PatchAction(type=ActionType.DELETE)
self.index += 1 continue
return text
return "" # ---------- ADD ---------- #
path = self.read_str("*** Add File: ")
def read_line(self) -> str: if path:
"""Return the current raw line and advance.""" if path in self.patch.actions:
line = self._cur_line() raise DiffError(f"Duplicate add for file: {path}")
self.index += 1 if path in self.current_files:
return line raise DiffError(f"Add File Error - file already exists: {path}")
self.patch.actions[path] = self._parse_add_file()
# ------------- public entry point -------------------------------------- # continue
def parse(self) -> None:
while not self.is_done(("*** End Patch",)): raise DiffError(f"Unknown line while parsing: {self._cur_line()}")
# ---------- UPDATE ---------- #
path = self.read_str("*** Update File: ") if not self.startswith("*** End Patch"):
if path: raise DiffError("Missing *** End Patch sentinel")
if path in self.patch.actions: self.index += 1 # consume sentinel
raise DiffError(f"Duplicate update for file: {path}")
move_to = self.read_str("*** Move to: ") # ------------- section parsers ---------------------------------------- #
if path not in self.current_files: def _parse_update_file(self, text: str) -> PatchAction:
action = PatchAction(type=ActionType.UPDATE) found = True
lines = text.split("\n") break
index = 0
while not self.is_done( next_ctx, chunks, end_idx, eof = peek_next_section(self.lines, self.index)
( new_index, fuzz = find_context(lines, next_ctx, index, eof)
"*** End Patch", if new_index == -1:
"*** Update File:", ctx_txt = "\n".join(next_ctx)
"*** Delete File:", raise DiffError(
"*** Add File:", f"Invalid {'EOF ' if eof else ''}context at {index}:\n{ctx_txt}"
"*** End of File", )
) self.fuzz += fuzz
): for ch in chunks:
def_str = self.read_str("@@ ") ch.orig_index += new_index
section_str = "" action.chunks.append(ch)
if not def_str and self._norm(self._cur_line()) == "@@": index = new_index + len(next_ctx)
section_str = self.read_line() self.index = end_idx
return action
if not (def_str or section_str or index == 0):
raise DiffError(f"Invalid line in update section:\n{self._cur_line()}") def _parse_add_file(self) -> PatchAction:
lines: List[str] = []
if def_str.strip(): while not self.is_done(
found = False ("*** End Patch", "*** Update File:", "*** Delete File:", "*** Add File:")
if def_str not in lines[:index]: ):
for i, s in enumerate(lines[index:], index): s = self.read_line()
if s == def_str: if not s.startswith("+"):
index = i + 1 raise DiffError(f"Invalid Add File line (missing '+'): {s}")
found = True lines.append(s[1:]) # strip leading '+'
break return PatchAction(type=ActionType.ADD, new_file="\n".join(lines))
if not found and def_str.strip() not in [
s.strip() for s in lines[:index]
]: # --------------------------------------------------------------------------- #
for i, s in enumerate(lines[index:], index): # Helper functions
if s.strip() == def_str.strip(): # --------------------------------------------------------------------------- #
index = i + 1 def find_context_core(
self.fuzz += 1 lines: List[str], context: List[str], start: int

) -> Tuple[int, int]: old: List[str] = []


if not context: del_lines: List[str] = []
return start, 0 ins_lines: List[str] = []
chunks: List[Chunk] = []
for i in range(start, len(lines)): mode = "keep"
if lines[i : i + len(context)] == context: orig_index = index
return i, 0
for i in range(start, len(lines)): while index < len(lines):
if [s.rstrip() for s in lines[i : i + len(context)]] == [ s = lines[index]
s.rstrip() for s in context if s.startswith(
]: (
return i, 1 "@@",
for i in range(start, len(lines)): "*** End Patch",
if [s.strip() for s in lines[i : i + len(context)]] == [ "*** Update File:",
s.strip() for s in context "*** Delete File:",
]: "*** Add File:",
return i, 100 "*** End of File",
return -1, 0 )
):
break
def find_context( if s == "***":
lines: List[str], context: List[str], start: int, eof: bool break
) -> Tuple[int, int]: if s.startswith("***"):
if eof: raise DiffError(f"Invalid Line: {s}")
new_index, fuzz = find_context_core(lines, context, len(lines) - len(context)) index += 1
if new_index != -1:
return new_index, fuzz last_mode = mode
new_index, fuzz = find_context_core(lines, context, start) if s == "":
return new_index, fuzz + 10_000 s = " "
return find_context_core(lines, context, start) if s[0] == "+":
mode = "add"
elif s[0] == "-":
def peek_next_section( mode = "delete"
lines: List[str], index: int elif s[0] == " ":
) -> Tuple[List[str], List[Chunk], int, bool]: mode = "keep"
else:
raise DiffError(f"Invalid Line: {s}") if index == orig_index:
s = s[1:] raise DiffError("Nothing in this section")
return old, chunks, index, False
if mode == "keep" and last_mode != mode:
if ins_lines or del_lines:
chunks.append( # --------------------------------------------------------------------------- #
Chunk( # Patch → Commit and Commit application
orig_index=len(old) - len(del_lines), # --------------------------------------------------------------------------- #
del_lines=del_lines, def _get_updated_file(text: str, action: PatchAction, path: str) -> str:
ins_lines=ins_lines, if action.type is not ActionType.UPDATE:
) raise DiffError("_get_updated_file called with non-update action")
) orig_lines = text.split("\n")
del_lines, ins_lines = [], [] dest_lines: List[str] = []
orig_index = 0
if mode == "delete":
del_lines.append(s) for chunk in action.chunks:
old.append(s) if chunk.orig_index > len(orig_lines):
elif mode == "add": raise DiffError(
ins_lines.append(s) f"{path}: chunk.orig_index {chunk.orig_index} exceeds file length"
elif mode == "keep": )
old.append(s) if orig_index > chunk.orig_index:
raise DiffError(
if ins_lines or del_lines: f"{path}: overlapping chunks at {orig_index} > {chunk.orig_index}"
chunks.append( )
Chunk(
orig_index=len(old) - len(del_lines), dest_lines.extend(orig_lines[orig_index : chunk.orig_index])
del_lines=del_lines, orig_index = chunk.orig_index
ins_lines=ins_lines,
) dest_lines.extend(chunk.ins_lines)
) orig_index += len(chunk.del_lines)

if index < len(lines) and lines[index] == "*** End of File": dest_lines.extend(orig_lines[orig_index:])


index += 1 return "\n".join(dest_lines)
return old, chunks, index, True

raise DiffError("Invalid patch text - missing sentinels")


def patch_to_commit(patch: Patch, orig: Dict[str, str]) -> Commit:
commit = Commit() parser = Parser(current_files=orig, lines=lines, index=1)
for path, action in patch.actions.items(): parser.parse()
if action.type is ActionType.DELETE: return parser.patch, parser.fuzz
commit.changes[path] = FileChange(
type=ActionType.DELETE, old_content=orig[path]
) def identify_files_needed(text: str) -> List[str]:
elif action.type is ActionType.ADD: lines = text.splitlines()
if action.new_file is None: return [
raise DiffError("ADD action without file content") line[len("*** Update File: ") :]
commit.changes[path] = FileChange( for line in lines
type=ActionType.ADD, new_content=action.new_file if line.startswith("*** Update File: ")
) ] + [
elif action.type is ActionType.UPDATE: line[len("*** Delete File: ") :]
new_content = _get_updated_file(orig[path], action, path) for line in lines
commit.changes[path] = FileChange( if line.startswith("*** Delete File: ")
type=ActionType.UPDATE, ]
old_content=orig[path],
new_content=new_content,
move_path=action.move_path, def identify_files_added(text: str) -> List[str]:
) lines = text.splitlines()
return commit return [
line[len("*** Add File: ") :]
for line in lines
# --------------------------------------------------------------------------- # if line.startswith("*** Add File: ")
# User-facing helpers ]
# --------------------------------------------------------------------------- #
def text_to_patch(text: str, orig: Dict[str, str]) -> Tuple[Patch, int]:
lines = text.splitlines() # preserves blank lines, no strip() # --------------------------------------------------------------------------- #
if ( # File-system helpers
len(lines) < 2 # --------------------------------------------------------------------------- #
or not Parser._norm(lines[0]).startswith("*** Begin Patch") def load_files(paths: List[str], open_fn: Callable[[str], str]) -> Dict[str, str]:
or Parser._norm(lines[-1]) != "*** End Patch" return {path: open_fn(path) for path in paths}
):
return "Done!"
def apply_commit(
commit: Commit,
write_fn: Callable[[str, str], None], # --------------------------------------------------------------------------- #
remove_fn: Callable[[str], None], # Default FS helpers
) -> None: # --------------------------------------------------------------------------- #
for path, change in commit.changes.items(): def open_file(path: str) -> str:
if change.type is ActionType.DELETE: with open(path, "rt", encoding="utf-8") as fh:
remove_fn(path) return fh.read()
elif change.type is ActionType.ADD:
if change.new_content is None:
raise DiffError(f"ADD change for {path} has no content") def write_file(path: str, content: str) -> None:
write_fn(path, change.new_content) target = pathlib.Path(path)
elif change.type is ActionType.UPDATE: target.parent.mkdir(parents=True, exist_ok=True)
if change.new_content is None: with target.open("wt", encoding="utf-8") as fh:
raise DiffError(f"UPDATE change for {path} has no new content") fh.write(content)
target = change.move_path or path
write_fn(target, change.new_content)
if change.move_path: def remove_file(path: str) -> None:
remove_fn(path) pathlib.Path(path).unlink(missing_ok=True)

def process_patch( # --------------------------------------------------------------------------- #


text: str, # CLI entry-point
open_fn: Callable[[str], str], # --------------------------------------------------------------------------- #
write_fn: Callable[[str, str], None], def main() -> None:
remove_fn: Callable[[str], None], import sys
) -> str:
if not text.startswith("*** Begin Patch"): patch_text = sys.stdin.read()
raise DiffError("Patch text must start with *** Begin Patch") if not patch_text:
paths = identify_files_needed(text) print("Please pass patch text through stdin", file=sys.stderr)
orig = load_files(paths, open_fn) return
patch, _fuzz = text_to_patch(text, orig) try:
commit = patch_to_commit(patch, orig) result = process_patch(patch_text, open_file, write_file, remove_file)
apply_commit(commit, write_fn, remove_fn) except DiffError as exc:

print(exc, file=sys.stderr) <edit>


return <file>
print(result) path/to/file.py
</file>
<old_code>
if __name__ == "__main__": def search():
main() pass
</old_code>
<new_code>
def search():
raise NotImplementedError()
Other Effective Diff Formats </new_code>
</edit>
"""
If you want to try using a different diff format, we found in testing that the SEARCH/REPLACE diff format used
in Aider’s polyglot benchmark, as well as a pseudo-XML format with no internal escaping, both had high success
rates.

These diff formats share two key aspects: (1) they do not use line numbers, and (2) they provide both the exact
code to be replaced, and the exact code with which to replace it, with clear delimiters between the two.

SEARCH_REPLACE_DIFF_EXAMPLE = """
path/to/file.py
```
>>>>>>> SEARCH
def search():
pass
=======
def search():
raise NotImplementedError()
<<<<<<< REPLACE
"""

PSEUDO_XML_DIFF_EXAMPLE = """

You might also like