0% found this document useful (0 votes)
109 views41 pages

ChatGPT Mastery - Prompt Engineering

The document is a comprehensive guide on prompt engineering for large language models, detailing strategies for crafting effective prompts to enhance AI output quality. It covers essential principles, techniques for utilizing external tools, and methods for systematic testing and evaluation of model performance. The guide aims to equip users with the skills to effectively communicate with AI systems and achieve desired outcomes across various applications.

Uploaded by

todd mcguire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views41 pages

ChatGPT Mastery - Prompt Engineering

The document is a comprehensive guide on prompt engineering for large language models, detailing strategies for crafting effective prompts to enhance AI output quality. It covers essential principles, techniques for utilizing external tools, and methods for systematic testing and evaluation of model performance. The guide aims to equip users with the skills to effectively communicate with AI systems and achieve desired outcomes across various applications.

Uploaded by

todd mcguire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

ChatGPT Mastery:

Prompt Engineering
___

Alfonso Fuertes Fuentes


CHAPTER 1: INTRODUCTION TO PROMPT
ENGINEERING 3

CHAPTER 2: CRAFTING CLEAR AND EFFECTIVE


INSTRUCTIONS 5

CHAPTER 3: STRATEGIES FOR ENHANCED PROMPT


ENGINEERING** 7

CHAPTER 4: GIVING THE MODEL TIME TO "THINK" 9

CHAPTER 5: UTILIZING EXTERNAL TOOLS 11

CHAPTER 6: SYSTEMATIC TESTING AND


EVALUATION 14

CHAPTER 7: STRATEGIES FOR HANDLING LONG


DOCUMENTS AND CONVERSATIONS 17

CHAPTER 8: GUIDING THE MODEL'S REASONING


PROCESS 20

CHAPTER 9: LEVERAGING EXTERNAL TOOLS 24

CHAPTER 10: EVALUATING AND TESTING MODEL


PERFORMANCE 28

QUIZ 32

ESSAY QUESTIONS 35

GLOSSARY OF KEY TERMS 36

FAQ 38

2
Chapter 1: Introduction to Prompt
Engineering
Prompt engineering is a critical skill for effectively using
large language models (LLMs) like GPT-4. It involves
designing and refining input prompts to elicit the
desired outputs from these AI systems. The quality of
the prompt directly impacts the quality of the output,
making prompt engineering both an art and a science.

Imagine interacting with an LLM as a conversation with


a highly skilled but literal-minded assistant. This
assistant possesses a vast knowledge base and
exceptional language abilities but lacks the intuition
and contextual understanding of a human. To get the
best results, you need to provide this assistant
with clear, specific, and well-structured
instructions. This is where prompt engineering comes
into play.

Prompt engineering enables users to communicate


their intentions effectively to LLMs and guide them
towards the desired output. The process involves
understanding the model's capabilities and limitations
while carefully crafting prompts that minimize ambiguity
and maximize clarity.

Here are some key reasons why prompt engineering is


crucial:

• LLMs can't read your mind: They rely solely


on the information provided in the prompt to
generate responses.

• Ambiguity leads to unpredictable outputs:


Vague or poorly structured prompts can result
in irrelevant, nonsensical, or even fabricated
answers.

3
• Effective prompts unlock LLM potential: A
well-crafted prompt can guide the model to
produce creative, accurate, and insightful
outputs, making it a powerful tool for various
applications.

The applications of prompt engineering are vast


and continue to expand as LLMs evolve. Some
common use cases include:

• Content Creation: Generating creative stories,


articles, marketing copy, and other written
content.

• Translation: Translating text between multiple


languages accurately and efficiently.

• Code Generation and Debugging: Writing,


debugging, and explaining code in various
programming languages.

• Question Answering and Research:


Answering questions, retrieving specific
information, and summarizing complex
documents.

• Dialogue Systems and Chatbots: Building


engaging and interactive conversational agents
for various purposes.

This book will provide a comprehensive guide to


mastering the art of prompt engineering. From basic
principles and strategies to advanced techniques and
real-world examples, you'll learn how to communicate
effectively with LLMs and harness their power to
achieve your desired outcomes.

4
Chapter 2: Crafting Clear and Effective
Instructions

Crafting clear and effective instructions is the


foundation of successful prompt engineering. As
discussed in the previous chapter, LLMs rely heavily on
the clarity and specificity of the prompt to generate
relevant and accurate responses. Ambiguity in
instructions can lead to unpredictable and often
undesirable results.

To ensure that the model understands your intent,


it's crucial to follow these key principles when
writing instructions:

• Be Specific: Avoid vague language and


provide precise details about the desired
output. For example, instead of asking the
model to "write about climate change," specify
the type of content, target audience, and
desired tone. A more specific instruction could
be: "Write a 500-word informative article for a
general audience about the impact of climate
change on coastal communities."

• Use Actionable Verbs: Clearly state what you


want the model to do using action verbs like
"write," "summarize," "translate," or "generate."
For instance, instead of "information about the
American Revolution," instruct the model to
"Summarize the key events of the American
Revolution in three paragraphs."

• Define the Output Format: If you have specific


formatting requirements, clearly state them in
the prompt. For instance, you can ask the
model to "Generate a list of five bullet points
outlining the benefits of solar energy" or "Write
a haiku about the beauty of nature."
5
• Set Expectations for Length and Style:
Guide the model by specifying the desired
length of the output, whether it's in words,
sentences, paragraphs, or bullet points.
Additionally, you can specify the desired style,
such as formal, informal, technical, or creative.

• Iterate and Refine: Don't expect to get the


perfect prompt on the first try. Experiment with
different phrasings, levels of detail, and
formatting to observe how the model responds.
Refine your instructions based on the outputs
you receive to improve the model's
performance.

By following these principles and iteratively refining


your prompts, you can effectively communicate your
intentions to LLMs and consistently achieve the desired
outcomes.

6
Chapter 3: Strategies for Enhanced
Prompt Engineering**

This chapter can expand on the six key strategies


outlined in the sources for achieving better outcomes
when working with large language models.

• Write Clear Instructions: Emphasize the


importance of providing specific details in
prompts to guide the model effectively. Include
examples like those in the sources, contrasting
vague prompts with more detailed ones that
yield better results. You can further elaborate
on:

o Tactic: Ask the Model to Adopt a


Persona: Explain how instructing the
model to respond in a specific persona
can influence the style and tone of the
output.

o Tactic: Use Delimiters: Discuss the


use of delimiters like triple quotes, tags,
or section titles to clearly separate
different parts of the input, particularly
when dealing with complex tasks.

• Provide Reference Text: Explain how


providing relevant information to the model can
improve accuracy and reduce the chances of
fabricated answers. You can explore tactics
like:

o Tactic: Instruct the Model to Answer


Using a Reference Text: Show how to
direct the model to use the provided text
as the sole basis for its response.

7
o Tactic: Instruct the Model to Answer
With Citations From a Reference
Text: Explain how to prompt the model
to cite specific passages from the
provided text to support its answers,
increasing transparency and verifiability.

• Split Complex Tasks into Simpler Subtasks:


Discuss how breaking down intricate tasks into
a series of manageable steps can reduce errors
and improve overall performance. Explain
tactics like:

o Tactic: Use Intent Classification:


Demonstrate how to categorize user
queries to provide the most relevant
instructions for each specific type of
request.

o Tactic: For Dialogue Applications


That Require Very Long
Conversations, Summarize or Filter
Previous Dialogue: Explain methods
for handling long conversations,
including summarizing previous turns or
selectively filtering relevant information.

o Tactic: Summarize Long Documents


Piecewise and Construct a Full
Summary Recursively: Describe
techniques for summarizing extensive
documents by breaking them down into
smaller sections and creating
summaries hierarchically.

8
Chapter 4: Giving the Model Time to
"Think"

Given the previous chapters on prompt engineering,


we can now explore a crucial aspect of improving the
performance of large language models: giving them
adequate time to process information and reason. This
chapter will focus on the strategy of giving the model
time to "think" to enhance accuracy, especially when
dealing with complex tasks.

This concept is analogous to human problem-solving.


When faced with a challenging question, we don't
always have an immediate answer. We need time to
analyze the information, consider different
perspectives, and work through potential solutions.
Similarly, large language models can benefit from a
structured approach that allows them to "think" before
providing a final answer.

This chapter will discuss two primary tactics outlined in


the sources for implementing this strategy:

• Instruct the Model to Work Out Its Own


Solution Before Rushing to a Conclusion:
This tactic encourages the model to generate
its own solution to a problem before evaluating
other solutions. This can help identify errors
that might be missed if the model simply
focuses on assessing the correctness of a
given answer. The sources illustrate this
concept with an example of a math problem. By
first prompting the model to solve the problem
independently, it can then effectively compare
its solution to a student's solution and
accurately identify any discrepancies.

• Use Inner Monologue or a Sequence of


Queries to Hide the Model's Reasoning
9
Process: This tactic involves structuring the
model's output to separate the reasoning
process from the final answer. This is
particularly useful in scenarios where revealing
the model's thought process might be
detrimental, such as in educational applications
where students should be encouraged to solve
problems independently.

o Inner Monologue: The model can be


instructed to enclose its reasoning steps
within specific delimiters, like triple
quotes. This allows for easy parsing of
the output, enabling the removal of the
reasoning steps before presenting the
final answer to the user.

o Sequence of Queries: Alternatively, the


task can be divided into a series of
queries. The initial queries focus on
guiding the model's reasoning process,
with their outputs hidden from the user.
The final query then utilizes the model's
analysis to generate the final answer.

By implementing these tactics, you can enhance the


model's ability to provide well-reasoned and accurate
responses, particularly when dealing with complex or
sensitive tasks.

10
Chapter 5: Utilizing External Tools

In this chapter, we will explore the strategy of using


external tools to enhance the capabilities of large
language models. As powerful as these models are,
they have limitations. They might not excel at tasks like
complex calculations or accessing real-time
information. By integrating external tools, we can
overcome these limitations and create more robust and
versatile applications.

The sources present three key tactics for incorporating


external tools:

• Use Embeddings-Based Search to


Implement Efficient Knowledge Retrieval:
This tactic addresses the limitation of fixed
context windows in language models. When
dealing with large amounts of information or
dynamic data, it's crucial to retrieve relevant
information efficiently.

o Embeddings: These are vector


representations of text that capture
semantic meaning. Similar texts have
embeddings that are closer together in
vector space.

o Knowledge Retrieval: By embedding


both a user's query and chunks of text
from a knowledge base, we can use
efficient vector search algorithms to find
the most relevant information. This
retrieved knowledge can then be
provided as context to the language
model, enabling it to generate more
informed and accurate responses. The
sources mention OpenAI Cookbook as a
resource for example implementations.

11
• Use Code Execution to Perform More
Accurate Calculations or Call External APIs:
Language models are not inherently reliable for
performing precise mathematical calculations or
executing code. To address this, we can
instruct the model to generate code for specific
tasks and then execute that code using a
dedicated engine.

o Calculations: By enclosing code within


delimiters like triple backticks, we can
signal to the model that this section
should be executed as code. The model
can then generate code for
mathematical operations, and the output
of this code can be fed back into the
model for further processing.

o External APIs: This tactic extends code


execution to interact with external
systems and services. By providing the
model with documentation and
examples of API usage, it can learn to
generate code that makes API calls,
retrieving real-time information or
performing actions in external systems.

o Safety Precautions: The sources


emphasize the importance of
sandboxed code execution
environments when executing code
generated by a model. This helps
mitigate potential risks associated with
running untrusted code.

• Give the Model Access to Specific


Functions: This tactic involves providing the
model with predefined functions that it can call.
The model learns to generate function

12
arguments based on the provided function
schemas, and these arguments are used to
execute the functions. The output from these
function calls is then fed back into the model.
This approach, recommended by the sources,
streamlines the integration of external
functionality into language model applications.
The sources again point to the OpenAI
Cookbook and introductory text generation
guides for more information and examples.

By strategically employing these tactics, you can


significantly expand the capabilities of large language
models, allowing them to tackle a wider range of tasks
and generate more accurate and informed responses.

13
Chapter 6: Systematic Testing and
Evaluation

This chapter will focus on the importance of


systematically testing and evaluating changes made
to prompts or the overall system design when working
with large language models. While intuition and small-
scale testing can provide some insights, a more
rigorous approach is essential to ensure that
modifications lead to genuine improvements in
performance.

The sources highlight the significance of using


evaluation procedures, also known as "evals," to
objectively assess the impact of changes. A well-
designed evaluation process should be:

• Representative of Real-World Usage: The


evaluation should use test cases that reflect the
diversity and complexity of the tasks the model
will encounter in practical applications. While
this might not always be fully achievable,
striving for representativeness is crucial to
avoid overfitting to specific examples or
scenarios.

• Statistically Robust: The evaluation should


include a sufficient number of test cases to
provide statistically significant results. The
sources offer a table that suggests the
minimum sample size needed to detect
differences in performance with a 95%
confidence level. For instance, to reliably detect
a 10% difference in performance, the evaluation
should include at least 100 test cases.

The evaluation of model outputs can be conducted


through various methods, including:

14
• Computer-Based Evaluation: This approach
utilizes computers to automatically assess
model outputs based on predetermined criteria.
It's particularly effective for tasks with objective
answers, like multiple-choice questions or
factual recall. Computers can also be used to
evaluate outputs based on subjective criteria,
such as fluency or coherence, by employing
model-based queries.

• Human Evaluation: Human judgment is often


necessary for evaluating aspects of model
output that require subjective interpretation,
such as creativity, humor, or persuasive writing.
While human evaluation can be more time-
consuming and potentially less consistent, it
remains essential for assessing qualities that
are difficult to capture through automated
metrics.

• Hybrid Evaluation: This approach combines


computer-based and human evaluation
methods to leverage the strengths of both. For
instance, a model's factual accuracy might be
assessed automatically, while its writing quality
could be evaluated by human judges.

The sources provide examples of how to design


model-based evaluations. In one example, a model is
tasked with evaluating whether a given text contains
specific factual information. The model is provided with
a set of facts and instructed to:

1. Restate each fact.

2. Find the closest citation from the text for each


fact.

3. Determine whether someone unfamiliar with the


topic could infer the fact from the citation.

15
4. Indicate with a "yes" or "no" whether the citation
effectively supports the fact.

The evaluation then counts the number of "yes"


responses, providing a quantitative measure of the
text's factual accuracy.

Another model-based evaluation example focuses on


assessing the type of overlap between a submitted
answer and an expert answer. The model is instructed
to:

1. Reason step-by-step to determine the


relationship between the two answers: disjoint,
equal, subset, superset, or overlapping.

2. Reason step-by-step to determine whether the


submitted answer contradicts the expert
answer.

3. Output a JSON object indicating the type of


overlap and whether a contradiction exists.

These examples illustrate how models can be used to


automate the evaluation process, particularly for tasks
where objective criteria can be defined. The sources
recommend experimenting with different model-based
evaluation approaches to determine their effectiveness
for specific use cases.

Additionally, the sources mention OpenAI Evals, an


open-source framework that provides tools for creating
automated evaluations. For those interested in
exploring this further, researching OpenAI Evals would
provide more insights into building robust evaluation
procedures.

16
Chapter 7: Strategies for Handling Long
Documents and Conversations

This chapter focuses on addressing the inherent


limitations of language models when dealing with
extensive amounts of text, specifically in the context of
long documents and extended conversations. The fixed
context window size of language models poses a
challenge when processing information that exceeds
this limit.

The sources present two key tactics for handling these


scenarios:

• Summarizing or Filtering Previous Dialogue


for Long Conversations: In conversational
applications where the interaction extends over
multiple turns, it becomes crucial to manage the
growing amount of text within the context
window.

o Summarization: One approach is to


periodically summarize previous
conversation turns, condensing the
information into a more concise
representation. Once the context
window reaches a predefined threshold,
a query can be initiated to summarize a
portion of the conversation history. This
summary can then replace the original
text, freeing up space within the context
window.

o Background Summarization: The


summarization process can also occur
asynchronously in the background,
continuously summarizing the
conversation as it progresses. This
ensures that the context window
17
remains manageable without
interrupting the flow of the conversation.

o Filtering: Another approach is to filter


previous conversation turns, retaining
only the most relevant information
based on the current query or context.
This can be achieved using techniques
like embeddings-based search to
identify the most semantically similar
previous turns.

• Summarizing Long Documents Piecewise:


When dealing with documents that exceed the
model's context window, it's impossible to
summarize the entire text in a single pass.

o Piecewise Summarization: The


sources suggest breaking down the
document into smaller sections and
summarizing each section individually.
These section summaries can then be
concatenated and summarized, creating
summaries of summaries.

o Recursive Summarization: This


process can be repeated recursively,
progressively summarizing higher-level
summaries until a final, concise
summary of the entire document is
generated.

o Running Summary: To enhance


coherence and maintain context, the
sources recommend incorporating a
running summary of preceding sections
while summarizing a particular section.
This helps preserve the overall flow of
information and ensures that later

18
sections are interpreted within the
context of earlier content.

The effectiveness of recursive summarization


techniques for summarizing books using variants of
GPT-3 has been previously researched by OpenAI.

19
Chapter 8: Guiding the Model's
Reasoning Process

This chapter explores strategies for guiding the


reasoning process of large language models,
prompting them to approach problem-solving in a more
deliberate and structured manner. This can lead to
more accurate and reliable results, especially for tasks
that involve logical reasoning or complex calculations.

The sources outline two main tactics for achieving this:

• Instructing the Model to Work Out Its Own


Solution Before Rushing to a Conclusion:
Instead of directly asking the model for an
answer, we can guide it to first generate its own
solution step-by-step. This encourages the
model to think through the problem
independently, reducing the risk of errors or
biases introduced by a user's potentially
incorrect input.

o Example: In a scenario where a student


is asked to solve a math problem,
instead of simply evaluating the
student's solution, the model can be
instructed to:

1. Solve the problem


independently. This step
ensures that the model has a
clear understanding of the
correct solution.

2. Compare its solution to the


student's solution. This allows
the model to identify any
discrepancies or errors in the
student's reasoning.

20
3. Evaluate the student's
solution. Based on the
comparison, the model can
provide feedback on the
student's approach.

o Benefits: This tactic helps the model to


avoid being influenced by potentially
incorrect student solutions and
encourages a more thorough analysis of
the problem.

• Using Inner Monologue or a Sequence of


Queries to Hide the Model's Reasoning
Process: In certain situations, it might be
undesirable to reveal the model's entire
reasoning process to the user.

o Inner Monologue: This tactic involves


instructing the model to structure its
output in a way that separates the
internal reasoning steps from the final
answer. For instance, the model could
be instructed to enclose its internal
workings within specific delimiters (e.g.,
triple quotes). This allows the output to
be parsed, extracting only the relevant
information for the user while concealing
the internal reasoning.

o Sequence of Queries: An alternative


approach is to break down the problem
into a series of queries, where the
output of each query (except the final
one) is hidden from the user. This allows
the model to reason through the
problem step-by-step without revealing
its internal process.

21
§ Example: Consider the same
math problem scenario. We can
use a sequence of queries:

1. Present the problem


statement to the model.
This prompts the model
to generate its solution
without being influenced
by the student's attempt.

2. Provide the model with


both its solution and
the student's solution.
This allows the model to
compare and evaluate
the student's work.

3. Instruct the model to


provide feedback to the
student, drawing on its
analysis in the previous
step.

o Benefits: These tactics are particularly


useful in educational settings where
providing the full solution might hinder a
student's learning process. They allow
the model to act as a tutor, guiding
students toward the correct answer
without simply giving it away.

These strategies, when implemented effectively,


empower language models to act not just as
information providers, but also as insightful problem
solvers. They enhance the model's ability to reason,
analyze, and provide more nuanced and helpful
responses, making them valuable tools in a wider
range of applications.

22
Additionally, the sources introduce a tactic for
prompting the model to review its previous work,
especially when dealing with tasks like information
extraction from large documents:

• Ask the Model if It Missed Anything on


Previous Passes: Language models, due to
their limited context windows, might
prematurely stop processing information,
potentially missing relevant content. This tactic
involves prompting the model to re-examine the
source material to ensure it has extracted all
pertinent information. This can be implemented
by asking the model a follow-up question, such
as "Are there any more relevant excerpts?"
after an initial response. This encourages the
model to review its work and potentially uncover
previously overlooked information.

23
Chapter 9: Leveraging External Tools

This chapter focuses on expanding the capabilities of


large language models by integrating them with
external tools. This approach allows models to access
and utilize information and functionalities that are
beyond their inherent capabilities, leading to more
powerful and versatile applications.

The sources highlight two primary strategies for


utilizing external tools:

• Embeddings-Based Search for Efficient


Knowledge Retrieval: One limitation of
language models is their fixed context window,
which restricts the amount of information they
can process at once. To overcome this, we can
use embeddings to dynamically retrieve
relevant information from external sources.

o Embeddings: Text embeddings are


vector representations of text strings
that capture semantic relationships.
Similar text strings have embeddings
that are closer together in the vector
space.

o Knowledge Retrieval: We can use


embeddings to implement efficient
knowledge retrieval by following these
steps:

1. Chunk and Embed: Divide a


large text corpus into smaller
chunks and embed each chunk.

2. Embed the Query: Embed the


user's query.

24
3. Vector Search: Perform a
vector search to find the
embedded chunks from the
corpus that are closest to the
query embedding. This retrieves
the most semantically relevant
information.

4. Provide Context: Provide the


retrieved information as context
to the language model, allowing
it to generate more informed
responses.

o Benefits: This approach allows models


to access and utilize vast amounts of
external knowledge, overcoming the
limitations of their context window. It
enables them to provide more accurate,
comprehensive, and up-to-date
answers.

o Examples: The sources provide


references to the OpenAI Cookbook for
specific implementation details and
examples of how to use knowledge
retrieval effectively.

• Code Execution for Accurate Calculations


and API Calls: Language models are not
inherently reliable for performing complex
calculations or interacting with external
systems. To address this, we can instruct them
to write and execute code instead of
attempting these tasks directly.

o Code Generation: The model can be


instructed to generate code within
specific delimiters (e.g., triple backticks).
The code can then be extracted and
25
executed by a suitable interpreter (e.g.,
a Python interpreter).

o Calculation: This technique allows


models to perform accurate arithmetic
and other calculations that would be
unreliable if performed directly by the
model.

o API Interaction: Models can also be


taught to interact with external APIs by
providing them with API documentation
and code examples. This enables them
to access and utilize a wide range of
external services and functionalities.

o Benefits: Integrating code execution


empowers language models to perform
tasks that are beyond their inherent
capabilities, making them more versatile
and powerful tools.

o Safety Precautions: The sources


emphasize the importance of taking
safety precautions when executing code
generated by a language model. They
recommend using sandboxed code
execution environments to limit
potential harm from untrusted code.

In addition to the above, the sources mention function


calling as a recommended way of using OpenAI
models to interact with external functions:

• Function Calling: The Chat Completions API


allows developers to provide descriptions of
external functions. The model can then
generate arguments for these functions, which
are returned by the API in JSON format. These
arguments can be used to execute the

26
functions, and the output can be fed back into
the model for further processing.

The sources provide references to the introductory text


generation guide and the OpenAI Cookbook for more
information on function calling and its implementation.

27
Chapter 10: Evaluating and Testing
Model Performance

This chapter shifts the focus to the crucial aspect of


evaluating and systematically testing the
performance of large language models, particularly
when making changes or improvements to prompts or
system designs. It emphasizes the importance of
moving beyond anecdotal evidence and adopting a
more rigorous and data-driven approach to assess the
impact of modifications.

The sources present the following key strategy and


tactics for evaluating model performance:

• Strategy: Test Changes Systematically:


When introducing changes to prompts or
system designs, it's essential to evaluate their
impact systematically rather than relying on
isolated examples. This involves establishing
comprehensive test suites, also known as
"evals," to measure the effects of modifications
on a representative set of inputs. This approach
ensures that changes lead to genuine
improvements rather than simply appearing
beneficial in a few specific cases while
potentially degrading performance on a broader
scale.

o Tactic: Evaluate Model Outputs with


Reference to Gold-Standard
Answers: The core principle of
evaluation involves comparing model
outputs to gold-standard answers,
which are pre-determined correct or
ideal responses. The sources highlight
various methods for conducting such
evaluations, including:

28
§ Human Evaluation: Involves
human judges assessing the
quality and accuracy of model
outputs against gold-standard
answers. This approach is
particularly valuable for
subjective or nuanced tasks
where human judgment is
crucial.

§ Computer-Based Evaluation:
Uses automated metrics and
algorithms to compare model
outputs to gold-standard
answers. This is suitable for
tasks with objective criteria and
single correct answers.

§ Model-Based Evaluation:
Employs another language
model to evaluate the outputs of
the model being tested. This can
be useful when there's a range
of acceptable outputs, and a
model can effectively judge their
quality.

§ OpenAI Evals: The sources


mention OpenAI Evals, an open-
source software framework that
provides tools for creating
automated evaluations.

o Evaluating for Factual Accuracy: The


sources provide an example of a system
message designed to evaluate a
model's ability to incorporate specific
known facts into its answer. This
involves:

29
1. Defining Required Facts: Listing the facts that
should be present in the model's answer.

2. Restating and Citing: Instructing the


evaluating model to restate each required fact and
provide a citation from the candidate answer.

3. Judging Clarity: Assessing whether the


citation is clear and understandable for someone
unfamiliar with the topic.

4. Counting "Yes" Answers: Determining how


many facts were successfully incorporated by counting
the number of "yes" responses.

o Evaluating Overlap and


Contradiction: Another example
demonstrates a model-based evaluation
that analyzes the type of overlap
between a submitted answer and an
expert answer. It also checks for any
contradictions between the two
answers. This method involves:

1. Categorizing Overlap: The evaluating model


determines whether the submitted answer is disjoint,
equal, a subset, a superset, or overlapping with the
expert answer.

2. Identifying Contradictions: The model checks


if the submitted answer contradicts any part of the
expert answer.

3. Outputting Results in JSON: The evaluation


results are provided in a structured JSON format.

o Importance of Representative Test


Cases: The sources stress the
importance of using a large and
representative set of test cases for

30
evaluation. They provide guidelines for
determining the necessary sample size
based on the desired statistical power
and the magnitude of the difference
being measured.

o The OpenAI Cookbook: The sources


recommend referring to the OpenAI
Cookbook for more examples and
inspiration for developing effective
evaluation strategies.

By adopting systematic testing and evaluation


methods, developers can ensure that changes made to
prompts or system designs genuinely enhance model
performance and contribute to the development of
more reliable and trustworthy AI systems.

31
Quiz

Instructions: Answer the following questions in 2-3


sentences each.

1. Why is it crucial to write clear and detailed


instructions when interacting with language
models?

2. Explain the concept of "persona" in the context


of prompt engineering and provide an example.

3. What are delimiters and how can they be used


to enhance the clarity of prompts?

4. Describe the strategy of splitting complex tasks


into simpler subtasks and provide a scenario
where this approach would be beneficial.

5. Why is it sometimes necessary to give the


model time to "think," and what tactics can be
employed to achieve this?

6. How can external tools such as code execution


engines be integrated into prompt engineering
to improve model performance?

7. Explain the purpose and importance of


systematic testing when refining prompts.

8. What are embeddings, and how can they be


utilized in knowledge retrieval for language
models?

9. Describe the concept of "inner monologue" as a


tactic for hiding the model's reasoning process
from the user.

32
10. Provide an example of a situation where it
would be advantageous to ask the model if it
missed anything on previous passes.

Answer Key

1. Clear and detailed instructions are crucial


because language models cannot read minds.
The more specific and unambiguous the
prompt, the better the model understands the
desired outcome and can deliver more relevant
results.

2. "Persona" refers to instructing the model to


adopt a specific character, tone, or style in its
responses. For instance, you can ask the model
to act as a helpful tutor, a sarcastic comedian,
or a formal news reporter.

3. Delimiters are symbols or phrases used to


separate different sections within a prompt.
Triple quotes, XML tags, or section headings
can clearly indicate distinct parts of the input,
helping the model differentiate instructions from
the content.

4. Splitting complex tasks into smaller,


manageable steps allows the model to focus on
one aspect at a time, reducing errors and
improving overall performance. This is
beneficial for tasks like summarizing long
documents or executing multi-step instructions.

5. Giving the model time to "think" helps it avoid


rushing to conclusions and encourages more
thoughtful reasoning. Tactics like prompting for
a "chain of thought" or using inner monologue
can encourage deliberate processing.

33
6. External tools can augment the capabilities of
language models. Code execution engines can
handle mathematical calculations or execute
API calls, providing accurate results for tasks
that models struggle with independently.

7. Systematic testing involves evaluating model


outputs against a set of predefined criteria to
assess the effectiveness of prompt
modifications. This ensures that changes lead
to consistent improvements in performance
across various scenarios.

8. Embeddings are numerical representations of


text that capture semantic meaning. They can
be used for efficient knowledge retrieval by
identifying relevant information from a large
database based on the similarity between the
query and stored text chunks.

9. "Inner monologue" involves prompting the


model to perform intermediate reasoning steps
but enclose them within delimiters. This allows
the developer to parse the output and present
only the final answer or relevant information to
the user, hiding the internal thought process.

10. When extracting information from lengthy


documents or generating lists, the model might
not capture all relevant details in a single pass.
Asking if it missed anything prompts the model
to re-evaluate the input and potentially identify
additional relevant information.

34
Essay Questions

1. Discuss the ethical considerations associated


with prompt engineering, particularly in contexts
where the model's outputs might be used for
decision-making or influencing human behavior.

2. Analyze the potential benefits and limitations of


using language models for educational
purposes. How can prompt engineering be
leveraged to create effective tutoring or learning
experiences?

3. Compare and contrast different strategies for


improving the factual accuracy and reliability of
language model outputs. How can prompt
engineering be used to mitigate the risks of
generating misleading or fabricated
information?

4. Explore the future of prompt engineering as


language models continue to evolve. What new
challenges and opportunities might arise, and
how could prompt engineering techniques
adapt to address them?

5. Critically evaluate the role of human creativity


and intuition in prompt engineering. To what
extent can prompt engineering be considered
an art as well as a science, and how does
human ingenuity influence the effectiveness of
prompts?

35
Glossary of Key Terms

Chain of thought: A prompting technique that


encourages the model to explicitly articulate its
reasoning process before arriving at a final answer.

Code execution engine: An external tool that can


execute code, enabling the model to perform
calculations or interact with APIs.

Delimiters: Symbols or phrases used to clearly


separate different sections within a prompt, improving
clarity and disambiguation.

Embeddings: Numerical representations of text that


capture semantic meaning, used for tasks like
knowledge retrieval and similarity comparisons.

Eval (evaluation procedure): A systematic process


for assessing the performance of a language model,
typically involving a set of test cases and predefined
criteria.

Few-shot prompting: A technique that provides the


model with a few examples of the desired output style
or behavior.

Inner monologue: A tactic where the model performs


reasoning steps within delimiters, allowing the
developer to filter the output before presenting it to the
user.

Intent classification: The process of categorizing user


queries based on their underlying intent or purpose.

Knowledge retrieval: The process of retrieving


relevant information from a database or external
sources based on a given query.

36
Persona: Instructing the model to adopt a specific
character, tone, or style in its responses.

Prompt engineering: The art and science of crafting


effective prompts to elicit desired responses from
language models.

Reference text: Providing the model with external


information or documents to use as a source for
answering questions or completing tasks.

Subtasks: Breaking down complex tasks into smaller,


more manageable steps for the model to process
sequentially.

Systematic testing: Evaluating model outputs against


predefined criteria to measure the impact of prompt
modifications and ensure consistent improvements.

37
FAQ

1. How can I get better results from large language


models like GPT-4?

Large language models (LLMs) can be incredibly


powerful, but they need clear instructions to perform at
their best. Here are a few key strategies:

• Write clear and detailed instructions: Avoid


ambiguity. Specify exactly what you want,
including the desired format, length, and level of
detail.

• Provide reference text: If relevant, offer


context and background information to guide
the model's responses.

• Break down complex tasks: Decompose


larger tasks into smaller, more manageable
subtasks to reduce error rates.

• Give the model time to “think”: Encourage


step-by-step reasoning or “inner monologue” to
help the model arrive at a well-reasoned
answer.

• Utilize external tools: Integrate tools like


embeddings-based search for knowledge
retrieval or code execution engines for
calculations and API calls.

• Test changes systematically: Evaluate the


impact of prompt modifications through a robust
testing framework to ensure overall
performance improvement.

2. My model keeps giving me irrelevant or made-up


answers. How can I fix this?

38
LLMs can sometimes hallucinate information,
especially when dealing with obscure topics. Here's
how to improve accuracy:

• Provide specific details in your prompt: The


more context you give, the less the model has
to guess.

• Offer reference text: Giving the model relevant


information to draw from helps reduce the
chance of fabricating answers.

• Use embeddings-based search: This can


help retrieve pertinent information from external
sources to ground the model's responses.

3. How can I control the length and format of the


model's output?

You can influence the output by:

• Specifying the desired length: Request a


specific number of words, sentences,
paragraphs, or bullet points.

• Demonstrating the desired format: Show the


model examples of the output style you prefer.

• Using delimiters: Clearly separate different


sections of the input with markers like triple
quotes or XML tags.

4. Can I make the model adopt a specific persona


or writing style?

Yes, you can guide the model's tone and style by:

• Asking the model to adopt a persona:


Provide instructions like "You are a helpful
assistant who always uses humor" or "You are
a technical expert writing a concise report."

39
• Providing examples of the desired style:
Show the model samples of the tone and
language you want it to emulate.

5. How do I handle very long conversations or


documents that exceed the model’s context
window?

• Summarize or filter previous dialogue:


Condense past interactions to free up space in
the context window.

• Summarize long documents piecewise:


Break down lengthy text into smaller sections,
summarize each part, and then combine the
summaries recursively.

6. Can I prevent the model from revealing its


reasoning process in certain situations?

Yes, you can use techniques like:

• Inner monologue: Instruct the model to


enclose its reasoning steps within specific
delimiters, allowing you to parse and hide them
before presenting the output to the user.

• Sequence of queries: Break down the task


into multiple queries, hiding the output of
intermediate steps from the end user.

7. How can I utilize external tools to enhance the


model's capabilities?

• Embeddings-based search: Retrieve relevant


information from external knowledge bases to
enrich the model's input.

• Code execution: Allow the model to execute


code for calculations, data manipulation, or API
calls.
40
• Function calling: Provide descriptions of
external functions that the model can call,
enabling interaction with various tools and
services.

8. How can I test and evaluate whether changes to


my prompts are actually improving the model’s
performance?

• Define a comprehensive test suite: Create a


diverse set of test cases that reflect real-world
usage.

• Evaluate model outputs against gold-


standard answers: Use human evaluation or
model-based evaluation methods to assess the
quality and accuracy of the responses.

• Use OpenAI Evals or other evaluation


frameworks: Leverage available tools for
automating and streamlining the evaluation
process.

41

You might also like