Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
2
Research Center for Mathematics, Beijing Normal University, No.18,
Jingfeng Road, Zhuhai 519087, Guangdong, China.
Abstract
This comprehensive review delves into the pivotal role of prompt engineering in
unleashing the capabilities of Large Language Models (LLMs). The development
of Artificial Intelligence (AI), from its inception in the 1950s to the emer-
gence of advanced neural networks and deep learning architectures, has made a
breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in Vision-
Language Models (VLMs), with models such as CLIP and ALIGN. Prompt
engineering is the process of structuring inputs, which has emerged as a cru-
cial technique to maximize the utility and accuracy of these models. This paper
explores both foundational and advanced methodologies of prompt engineering,
including techniques such as self-consistency, chain-of-thought, and generated
knowledge, which significantly enhance model performance. Additionally, it
examines the prompt method of VLMs through innovative approaches such as
Context Optimization (CoOp), Conditional Context Optimization (CoCoOp),
and Multimodal Prompt Learning (MaPLe). Critical to this discussion is the
aspect of AI security, particularly adversarial attacks that exploit vulnerabili-
ties in prompt engineering. Strategies to mitigate these risks and enhance model
robustness are thoroughly reviewed. The evaluation of prompt methods is also
addressed, through both subjective and objective metrics, ensuring a robust
analysis of their efficacy. This review also reflects the essential role of prompt
engineering in advancing AI capabilities, providing a structured framework for
future research and application.
1
1 Introduction
In recent years, a significant milestone in artificial intelligence research has been
the progression of natural language processing capabilities, primarily attributed to
Large Language Models (LLMs). Many popular models, rooted in the transformer
architecture [1], undergo training on extensive datasets derived from web-based text.
Central to their design is a self-supervised learning objective, which focuses on pre-
dicting subsequent words in incomplete sentences. Those models are called Artificial
Intelligence-Generated Content (AIGC), and their ability to generate coherent and
contextually relevant responses is a result of this training process, where they learn
to associate words and phrases with their typical contexts.
LLMs operate by encoding the input text into a high-dimensional vector space,
where semantic relationships between words and phrases are preserved. The model
then decodes this representation to generate a response, guided by the learned sta-
tistical patterns [2]. The quality of the response can be influenced by various factors,
including the prompt provided to the model, the model’s hyperparameters, and the
diversity of the training data.
These models, including LLMs such as the GPT series [3, 4] by OpenAI, along
with many others (e.g. Gemini [5, 6] and Gemini (BARD) [7] by Google, Claude series
by Anthropic [8, 9], and Llama series open-source model from Meta [10, 11]), have
revolutionized tasks ranging from information extraction to the creation of engaging
content [12]. In parallel, the development of multimodal large models (MMLMs) has
introduced the ability to process and generate not just text, but also images, audio,
and other forms of data, showcasing their flexibility and effectiveness. These mod-
els integrate multiple data modalities into a single framework, demonstrating strong
capabilities in tasks such as image description and visual question answering (VQA).
Early MMLMs include the DALL-E series [13–15], which can generate images from
textual descriptions, and CLIP, which can understand and relate text and image data
in a unified manner [16, 17]. More powerful models such as GPT-4o by OpenAI [18]
and Claude 3.5 Sonnet by Anthropic [8, 9] excel in multimodal tasks involving text
generation and understanding, integrating natural language processing with various
forms of data to perform diverse and complex tasks. While numerous advanced mod-
els are currently capable of processing audio, the majority of accessible Application
Programming Interfaces (APIs) remain focused on text and vision modalities. With
the gradual introduction of audio APIs, a broad expansion of research in this modal-
ity can be expected [19]. The evolution of LLMs reflects significant strides in AI
research, characterized by increasing model complexity, enhanced training method-
ologies, and broader application potentials. These advancements underline the critical
role of prompt engineering in maximizing the utility and accuracy of these models,
ensuring that they can effectively cater to diverse and dynamic user needs. While this
survey is mainly focused on prompt engineering for LLMs, the inclusion of vision-
language models (VLMs) offers a broader perspective, revealing the potential and
challenges of prompt engineering in handling multimodal data. By integrating research
from both types of models, we can gain a deeper understanding of the applications of
prompt engineering and provide valuable insights for future research and practice.
In real applications, the prompt is the input of the model, and prompt engineering
can result in significant output differences [20]. Modifying both the structure (e.g.,
altering length, arrangement of instances) and the content (e.g., phrasing, choice of
illustrations, directives) of the prompt can exert a notable influence on the model’s
behavior [21, 22].
Prompt engineering refers to the systematic design and optimization of input
prompts to guide the responses of LLMs, ensuring accuracy, relevance, and coher-
ence in the generated output. This process is crucial in harnessing the full potential
of these models, making them more accessible and applicable across diverse domains.
Over time, prompt engineering has evolved from an empirical practice into a well-
structured research domain. As illustrated in Figure 1, the historical progression of
prompt engineering showcases significant milestones from the early days of structured
2
inputs in the 1950s to advanced methodologies such as chain-of-thought prompting
[23] and self-consistency prompting [24] developed in recent years. This review will
primarily focus on techniques emerging from the period of rapid development after
2017.
Foundations of AI: Initial developments in AI depended on structured, rule-based inputs, wherein the accuracy and
pertinence of these inputs directly impacted system performance. While this did not constitute prompt engineering
in the contemporary sense, it underscored the critical importance of formulating well-defined queries for AI systems.
2010: The deployment of deep neural networks in handling more sophisticated tasks involving unstructured text and image data
highlighted the importance of intelligent input configuration. This period saw the nascent development of what would later be
recognized as prompt engineering, aiming to refine how data inputs could more effectively guide neural network responses.
2018: The emergence of models like BERT and GPT extended the use of prompts beyond specific tasks to a broad range of general
applications. This shift turned prompt engineering into an essential competency for leveraging the full potential of these advanced
models.
2020: With the release of GPT-3, the capacity for generating contextually appropriate and nuanced responses based solely on prompts,
without requiring additional training, emphasized the critical importance of meticulous prompt design in achieving desired outcomes.
3
guide models in generating high-quality content. Section 4 discusses methodologies
specific to VLMs, including Context Optimization (CoOp), Conditional Context Opti-
mization (CoCoOp), and Multimodal Prompt Learning (MaPLe), which enhance the
performance of VLMs [29]. Section 5 assesses the efficacy of various prompt methods
through both subjective and objective evaluations, ensuring a robust analysis of their
effectiveness. Section 6 briefly explores the applications of prompt engineering across
diverse fields such as education, content creation, computer programming, and reason-
ing tasks, highlighting its broad impact. Section 7 addresses the security implications
of prompt engineering, identifying common vulnerabilities in LLMs and reviewing
strategies to enhance security such as adversarial training. Finally, Section 8 explores
prospective methodologies, emphasizing the importance of understanding AI model
structures and the potential of AI agents in advancing AI-generated content tools.
This structured framework provides an entire overview of the pivotal role of prompt
engineering in advancing AI capabilities and guiding future research and applications.
4
Model input
I want to understand some cutting-edge aspects of technology.
Model output
Here are some of the key areas:
2.4 Role-prompting
Role-prompting is another fundamental method in prompt engineering. It involves
giving the model a specific role to play, such as a helpful assistant or a knowledgeable
expert [40]. This method can be particularly effective in guiding the model’s responses
and ensuring that they align with the desired output. For instance, if the model is
prompted to act as a historian, it is more likely to provide a detailed and contextually
accurate response when asked about a historical event [41]. Another brief example is
shown in Figure 4.
5
Model input
I want to understand the cutting edge of technology, specifically related to artificial
intelligence and machine learning. Recently, artificial intelligence has become
extremely popular, especially the Large Language Models which have amazed me.
Please analyze the possible reasons behind the rapid advancement of this
technology from at least three perspectives.
Model output
Absolutely, the rise of AI and machine learning, especially with regard to Large
Language Models like GPT-4, can be seen as a convergence of several key factors.
I’ll analyze this from three perspectives: technological advancements, availability of
data, and socio-economic demand.
1. Technological advancements: The growth of AI and machine learning models has
been facilitated by several technological innovations. For starters, we’ve seen incredi-
ble improvements in computational power, particularly in Graphics Processing Units
(GPUs) and now even more specialized AI chips. These advances have made it ...
2. Availability of data: ...
3. Socio-economic demand: ...
Model output
A role prompt Here are five basic methods to
quickly get started with Large
You are an expert in artificial Language Models:
intelligence specializing in Large
1. Experiment with web-based
Language Models...
interfaces:
...
6
Standard Prompt
Model input
Q: Yes or no: would a pear sink
One-shot Prompt
in water? Model input
Q: Yes or no: is it common to see
frost during some college
commencements?
Model output A: The answer is yes.
Q: Yes or no: would a pear sink
A: Yes, a pear would typically in water?
float in water, not sink. The rea-
son for this is that pears, like
many fruits, are composed pri-
marily of water and have a density
Model output
that is less than that of water.
Objects that are less dense than
water will float when placed in A: The answer is No. ✓
water.
×
Fig. 5 Comparison of standard prompt and one-shot prompt.
investigated the intricacies of how large generative language models, such as GPT-3,
respond to prompts. One of the significant findings from this paper is that zero-shot
prompts can, in certain scenarios, outperform few-shot prompts. This suggests that
the role of few-shot examples might not be as much about teaching the model a new
task (meta-learning) but rather guiding it to recall a task it has already learned. This
insight is crucial as it challenges the conventional wisdom that more examples always
lead to better performance [3]. In the context of one-shot or few-shot prompting, it is
essential to understand that while examples can guide the model, they do not always
enhance its performance. Sometimes, a well-crafted zero-shot prompt can be more
effective than providing multiple examples [48].
3 Advanced methodologies
The foundational methods from the previous section can help us produce satisfactory
outputs. However, experiments indicate that when using LLMs for complex tasks
such as analysis or reasoning, the accuracy of the model’s outputs still has room
for improvement. In this section, advanced techniques of prompt engineering will be
introduced to guide the model in generating more specific, accurate, and high-quality
content.
7
3.1 Chain-of-thought
The concept of “Chain-of-Thought” (CoT) prompting [23] in LLMs is a relatively new
development, which has been shown to significantly improve the accuracy of LLMs on
various logical reasoning tasks [53–55]. CoT prompting involves providing intermedi-
ate reasoning steps to guide the model’s responses, which can be facilitated through
simple prompts such as “Let’s think step by step” or through a series of manual
demonstrations, each composed of a question and a reasoning chain that leads to an
answer [56, 57]. It also provides a clear structure for the model’s reasoning process,
making it easier for users to understand how the model arrived at its conclusions.
[58] illustrates the application of CoT prompting to medical reasoning, show-
ing that it can effectively elicit valid intermediate reasoning steps from LLMs. [59]
introduces the concept of Self-Education via Chain-of-Thought Reasoning (SECToR),
and argues that, in the spirit of reinforcement learning, LLMs can successfully teach
themselves new skills by chain-of-thought reasoning. In another study, [60] used CoT
prompting to train verifiers to solve math word problems, demonstrating the tech-
nique’s potential in educational applications. [61] proposed a multimodal version of
CoT, called Multimodal-CoT, to handle more complex, multimodal tasks beyond sim-
ple text-based tasks, such as visual tasks, further expanding the potential applications
of CoT. Furthermore, many works are building upon the CoT framework, for instance,
[62] introduces Automate-CoT, an automated approach for augmenting and selecting
rationale chains to enhance the reasoning capabilities of LLMs, thereby reducing the
reliance on manually crafted CoT prompts.
Model input
Imagine an infinitely wide entrance, which is more likely to pass through it, a
military tank or a car?
Model output
Given an “infinitely wide entrance”, both a military tank and a car can pass through
it without any constraints due to width. Therefore, neither is more likely than the
other based on width alone. Both can pass through with equal likelihood.
However, if other factors such as ground conditions, accessibility, purpose, or autho-
rization come into play, the likelihood might change. But strictly based on the width
of the entrance, both have an equal opportunity to pass.
8
Model input
Imagine an infinitely wide entrance, which is more likely to pass through it, a
military tank or a car? Let’s think step by step.
Model output
Given an infinitely wide entrance, both a military tank and a car could pass through
without any issues related to the width. However, there are still several factors that
could influence which is more likely to pass through:
1. Ground surface: tanks are designed to handle rough terrains that cars may struggle
with. If the ground leading to and from the entrance is too rugged or steep for a car,
the tank would be more likely to pass through.
2. Weight restrictions: if the entrance has a load limit due to the materials used in
its construction, the car, being lighter, could be more likely to pass through. The
average car weighs around 1.5 tons, while a military tank can weigh anywhere from
60 to 70 tons.
3. Height clearance: if the entrance is infinitely wide but not infinitely tall, the height
of the vehicle matters. The average height of a car is around 1.5 meters, while tanks
can be around 2.4 meters. If the entrance is shorter than around 2.4 meters, a car
would be more likely to pass through.
4. Regulations and permissions:
...
In conclusion, without specific information about the conditions at the entrance, we
would assume that a car is more likely to pass through, simply because cars are more
common, more likely to meet potential height and weight restrictions and face fewer
regulatory barriers.
3.2 Self-consistency
In the assessment of InstructGPT [64] and GPT-3 [3] on a new synthetic QA dataset
called Proof and Ontology-Generated Question-Answering (PrOntoQA) [65, 66], it
was observed that although the most extensive model exhibited capability in rea-
soning tasks, it encountered challenges in proof planning and the selection of the
appropriate proof step amidst multiple options, which caused accuracy uncertainties
[65]. Self-consistency is one of the methods for LLMs to solve this situation, which is
an advanced prompting technique that aims to ensure the model’s responses are con-
sistent with each other [23, 24]. This method greatly increases the odds of obtaining
highly accurate results. The principle of self-consistency in language models posits
that for a complex reasoning problem, there can be multiple reasoning paths leading
to the correct answer. In this approach, a language model generates a diverse set of
reasoning paths for the same problem. The most accurate and consistent answer is
9
then determined by evaluating and marginalizing across these varied paths, ensuring
that the final answer reflects the convergence of multiple lines of thought.
The self-consistency method contains three steps. Firstly, prompt a language model
using CoT prompting, then replace the “greedy decode” (1-Best) [32, 67] in CoT
prompting by sampling from the language model’s decoder to generate a diverse set
of reasoning paths, and finally, marginalize out the reasoning paths and aggregate by
choosing the most consistent answer in the final answer set.
It is noteworthy that self-consistency can be harmoniously integrated with most
sampling algorithms, including but not limited to, temperature sampling [49, 50],
top-k sampling [32, 68, 69], and nucleus sampling [34]. Nevertheless, such an operation
may necessitate the invocation of the model’s API to fine-tune these hyperparameters.
In light of this, an alternative approach could be to allow the model to generate
results employing diverse reasoning paths, and then generate a diverse set of candidate
reasoning paths. The response demonstrating the highest degree of consistency across
the various reasoning trajectories is then more inclined to represent the accurate
solution [70].
[2, 71] have shown that self-consistency enhances outcomes in arithmetic, com-
monsense, and symbolic reasoning tasks. Furthermore, in practice, self-consistency
can be combined with other techniques to further enhance the model’s performance.
[72] found that combining self-consistency with a discriminator-guided multi-step
reasoning approach significantly improved the model’s reasoning capabilities.
10
Model input
Generate two key analyses related to detailed size data on military tanks and cars,
and then generate three key influencing factors regarding whether an object can
pass through an infinitely wide entrance.
Model output
Detailed size analysis: military tanks vs. cars
1. Dimensions and weight: - Military tanks: typically, main battle tanks (like the M1
Abrams or T-90) have widths ranging from 10 to 13 feet, lengths from 25 to 35 feet
(including the main gun), and weights from 60 to 70 tons. - Cars: most consumer cars
have a width of about 6 to 8 feet, a length of 12 to 20 feet, and weigh between 1 to 3
tons. Some large SUVs and trucks can exceed these figures but are still substantially
smaller and lighter than tanks.
2. Maneuverability & design: - Military tanks: tanks are designed to traverse rough
terrains and have tracks that distribute weight and prevent the vehicle from getting
easily stuck in muddy or uneven ground. However, their sheer weight and size limit
their maneuverability in urban or congested settings. - Cars: designed for paved
roads, cars are generally more agile and can navigate tight spaces more efficiently
than tanks. Their design emphasizes aerodynamics and comfort, unlike tanks that
prioritize armor and firepower.
Three key influencing factors on whether an object can pass through an infinitely
wide entrance:
1. Height of the object: the height is critical. An infinitely wide entrance does not
give information about its height. If the object is taller than the entrance, it won’t
pass through.
2. Structural integrity: depending on the material and design of an object, squeezing or
forcing it through an entrance could compromise its structural integrity. For instance,
certain tanks have reactive armor or sensitive equipment that may be damaged if
scraped or bumped forcefully.
3. Ground clearance and maneuverability: ...
the key principles of problem decomposition and sequential problem solving, enabling
the model to manage and solve complex tasks more effectively.
Upon rigorous experimentation in domains including symbolic manipulation, com-
positional generalization, and mathematical reasoning, [74] substantiate that the
least-to-most prompting paradigm exhibits the capacity to generalize across challenges
of greater complexity than those initially presented in the prompts. [75] introduced
Program Aided Language models (PAL), using the LLMs to read natural language
problems and generate programs as the intermediate reasoning steps. By using least
to most prompting, PAL shows enhancement on GSM8K [60] and SVAMP [76], which
are benchmarks about complex mathematical problems for LLMs.
11
Model input
“““
Detailed size analysis: military tanks vs. cars
1. Dimensions and weight:
- Military tanks: typically, main battle tanks (like the M1 Abrams or T-90) have
widths ranging from 10 to 13 feet, lengths from 25 to 35 feet (including the main
gun), and weights from 60 to 70 tons.
- Cars: most consumer cars have a width of about 6 to 8 feet, a length of 12 to 20
feet, and weigh between 1 to 3 tons. Some large SUVs and trucks can exceed these
figures but are still substantially smaller and lighter than tanks.
2. Maneuverability & design...
”””
Use the above information to answer the following question: Imagine an infinitely
wide entrance, which is more likely to pass through it, a military tank or a car?
Model output
Given the information provided about the dimensions, design, and the factors influ-
encing whether an object can pass through an infinitely wide entrance, we can infer
the following:
1. Height of the object: the height is not specified for either tanks or cars. How-
ever, given that tanks are built for rough terrains and often have a turret and other
equipment on top, they generally have a taller profile compared to most cars. If the
entrance had a restrictive height, cars would generally have an advantage here.
2. Structural integrity: ...
3. Ground clearance and maneuverability:
...
Conclusion: If the entrance height is not restrictive and the surface is rough, a mili-
tary tank would be more likely to pass through it without issues. On the other hand,
if the entrance is low-height or has tight squeezes that could compromise the struc-
tural integrity of larger, robust vehicles, a car might be at an advantage.
However, given no height constraints and only the factor of width (since it is “infinitely
wide”), both a military tank and a car would easily pass through it. In such a scenario,
the distinguishing factor would be the terrain of the entrance. If it is rough or uneven,
the tank has the advantage. If it is smooth, both would pass with relative ease.
[77] demonstrates that this formulation is more versatile and can handle challeng-
ing tasks where standard prompts might fall short. Another research by [78] further
emphasizes the potential of this technique in enhancing the performance of LLMs by
structuring their thought processes.
[7] introduces the “tree-of-thought prompting”, an approach that assimilates the
foundational principles of the ToT frameworks and transforms them into a stream-
lined prompting methodology. This technique enables LLMs to assess intermediate
cognitive constructs within a singular prompt. An exemplar ToT prompt is delineated
in Figure 11.
12
Problem
If a train travels 60 kilometers in 1 hour, how far will it travel in 2.5 hours?
Step1
Decompose
the Problem
Subproblem 1 Subproblem 2
Calculate the distance traveled in 2.5 hours
Determine the speed of the train.
using the speed.
Model input Model input
The train travels 60 kilometers in 1 hour. What is the If the train travels at 60 kilometers per hour, how far
speed of the train in kilometers per hour? will it travel in 2.5 hours?
Step2
Solve
Sequentially
Solution1 Solution2
60 kilometers per hour. Using the speed from Subproblem 1, the model
calculates that 60 km/h multiplied by 2.5 hours
equals 150 kilometers.
Step3
Aggregate the
Solutions
Final answer
The train will travel 150 kilometers in 2.5 hours.
individual insights can subsequently be interlinked based on their pertinence and inter-
dependencies, culminating in a detailed graph. This constructed graph permits diverse
traversal methods, ensuring the final solution is both precise and comprehensive,
encompassing various dimensions of the challenge.
The efficacy of the GoT framework is anchored in its adaptability and the pro-
found insights it can yield, particularly for intricate issues necessitating multifaceted
resolutions. Nonetheless, it is imperative to recognize that while GoT facilitates a sys-
tematic approach to problem-solving, it also necessitates a profound comprehension
of the subject matter and meticulous prompt design to realize optimal outcomes [80].
13
Key Components
Decomposer LLM
Prompting Program
1. Created by Decomposer LLM.
2. Consists of a series of sub-queries and associated sub-task functions.
Sub-task Handlers
1. Specialized modules or functions that handle specific types of sub-tasks.
2. Examples include functions for text splitting, extracting specific information from text, or concatenating strings.
Controller
1. Task scheduling: Determines the execution sequence of tasks.
2. Data transfer and coordination: Transfer of data between sub-task processors. Coordinate the flow of data
between the Decomposer and the sub-task handlers.
3. Status management: Tracks the execution status of a task and determines when to stop or continue the execution.
represented as:
P = {(f1 , Q1 , A1 ), . . . , (fk , Qk , Ak )}
where Ak is the final answer predicted by P , and Qi is a sub-query directed to the
sub-task function fi ∈ F . A high-level imperative controller manages the execution of
P , passing inputs and outputs between the decomposer and sub-task handlers until
the final output is obtained.
To teach the decomposer LLM, in-context examples are used. These examples
demonstrate the decomposition of complex queries into simpler sub-queries. Each
example Ej takes the form:
where Aj,kj = Aj is the final answer for Qj , and (Qj,1 , . . . , Qj,kj ) represents the
decomposition of Qj . Each sub-task function f is operationalized through sub-task
handlers, which can be additional LLM prompts or symbolic or learned functions [81].
An illustration of the process flow is shown in Figure 13.
Process Flow
Initial Query
A complex task is submitted as an initial query Q.
Decomposition
The Decomposer LLM generates the first sub-query �1 and determines the appropriate sub-task handler �1 .
Final Output
1. The Controller checks for a stopping condition in the prompting program.
2. Once the stopping condition is met, the Controller returns the final result.
14
The DECOMP approach has several advantages. First, its modularity allows each
sub-task handler to be independently optimized, debugged, and upgraded, which
facilitates systematic performance improvements and easier integration of new meth-
ods or models. Second, DECOMP can incorporate error-correcting sub-task handlers,
improving the overall accuracy and reliability of the system. Third, the approach
allows for diverse decomposition structures, including hierarchical and recursive
decompositions, which are particularly useful for handling complex and large-scale
problems. Finally, sub-task handlers can be shared across different tasks, enhancing
the efficiency of the problem-solving process.
DECOMP and Least-to-Most Prompting [74] both decompose complex tasks
to enhance large language models’ reasoning abilities, but DECOMP distinguishes
itself through its flexible, modular approach. Unlike Least-to-Most Prompting’s lin-
ear progression from easy to hard sub-questions, DECOMP allows for non-linear and
recursive decomposition, with dedicated sub-task handlers that can be independently
optimized and replaced. This modularity not only enhances flexibility and reusabil-
ity across tasks but also introduces potential error-correcting mechanisms, making
DECOMP more robust and adaptable to complex, multi-step reasoning tasks. While
DECOMP has demonstrated superior performance in specific domains, such as sym-
bolic reasoning and multi-step question answering, its advantages over Least-to-Most
Prompting may vary depending on the nature of the task [81].
In case studies, DECOMP demonstrated superior performance in various sce-
narios. For instance, in the k-th letter concatenation task, DECOMP outperformed
CoT prompting by effectively teaching the sub-task of extracting the k-th letter
through further decomposition. In list reversal, DECOMP showed better length gen-
eralization compared to CoT by recursively decomposing the task into reversing
smaller sub-lists, achieving higher accuracy for longer input sequences. In long-context
question answering (QA), DECOMP allowed for handling more examples than fea-
sible with CoT prompting, leading to improved performance. In open-domain QA,
incorporating symbolic retrieval APIs within the DECOMP framework enhanced per-
formance on multi-hop QA datasets compared to CoT prompting. Additionally, in
Math QA, DECOMP improved accuracy by post-processing CoT prompts to fix
frequent formatting errors, resulting in significant performance gains [81].
By leveraging the modular, flexible, and systematic approach of DECOMP, com-
plex tasks can be effectively decomposed and solved, showcasing its superiority over
traditional CoT prompting and other contemporary methods.
15
Question �1 �2 �3 �l
Collect
�8 , �9 …
�� : 0.2
�� : 0.6
�� : 0.8
… …
�� : 1.0 �� : 0.6
����� : 1.0
Level 4 �2 , �3 … �1 , �7 , … �1000
Level 5 �5 …
Level 1
Annotation 3
Model Input
1. Selected quantitatively
Exemplar set E 2. Annotated by human
4 �1 �2 �n Test Question
�1 + Answer �7 + Answer �1000 + Answer
(with rationale chains) (with rationale chains)
… (with rationale chains)
Fig. 14 Illustration of the whole process. (1) Uncertainty Estimation. (2) Collection, Ranking and
Selection. (3) Annotation (by human). (4) Inference.
The active prompt method offers several key benefits, including efficient task adap-
tation and significant performance improvements across various reasoning domains.
This approach aligns with the broader trend towards more interactive and adaptive
AI systems, emphasizing the importance of responsive design in prompt engineering.
Its ability to reduce human engineering efforts by focusing on the most uncertain and
informative questions makes it an important tool for advancing LLM capabilities. This
method not only enhances the quality of task-specific prompts but also maximizes
the use of human expertise, paving the way for more sophisticated and accurate AI
systems [82].
16
to design patterns in software engineering, providing structured and documented
approaches to enhance the output and interaction quality of LLMs. The framework for
documenting these prompt patterns includes a detailed structure that ensures their
adaptability across different domains [87].
To systematically categorize these prompt patterns, the authors have divided them
into five primary categories: Input Semantics, Output Customization, Error Identifica-
tion, Prompt Improvement, and Interaction. This classification helps in organizing the
patterns based on their functional roles and the specific problems they address. Within
this framework, the research introduces a comprehensive catalog of 16 distinct prompt
patterns. Each pattern is meticulously documented with the following components:
name and classification, intent and context, motivation, structure and key ideas, exam-
ple implementation, and practical consequences. The prompt patterns cover a wide
range of functionalities. For instance, the Input Semantics category includes patterns
such as Meta Language Creation, which helps in defining custom input languages for
LLMs. The Output Customization category features patterns such as Output Auto-
mater and Visualization Generator, which tailor the generated outputs to specific
formats or visualizations. Error Identification patterns such as Fact Check List ensure
the accuracy of generated content by highlighting critical facts for verification. Prompt
Improvement patterns, including Question Refinement and Alternative Approaches,
enhance the quality of interactions by refining questions and suggesting multiple ways
to achieve a goal. Lastly, Interaction patterns such as Flipped Interaction and Game
Play facilitate dynamic and engaging user-LLM interactions [87].
The methodology also emphasizes the combinatory use of these patterns to tackle
more complex prompt engineering tasks. By providing detailed examples and practical
implementations, the research demonstrates how multiple prompt patterns can be
integrated to create sophisticated and efficient prompting strategies. This structured
approach not only improves the effectiveness of LLMs in various applications but also
contributes to the broader understanding and advancement of prompt engineering as
a field [87].
Research supports the effectiveness of prompt pattern catalogs. [87] outlines that
the development and use of a prompt pattern catalog can improve the effectiveness
and efficiency of prompt engineering with LLMs. [89] explores how predefined struc-
tured prompt patterns can enhance user interaction and improve model outputs in
conversational AI. [88] investigates the application of prompt engineering patterns in
enterprise settings, demonstrating their utility in optimizing model performance across
various tasks. Additionally, [90] highlights the benefits of using predefined structured
prompt patterns in software development, demonstrating significant improvements in
code quality, requirements elicitation, and refactoring efficiency.
17
customized optimization for specific models. The choice of method depends on task
requirements, model complexity, and available resources.
18
Optimization (MAPO), a novel approach designed to fine-tune prompts for individ-
ual LLMs, thereby maximizing their effectiveness across various downstream tasks.
MAPO addresses the inherent variability in how different LLMs respond to the same
prompt by introducing a two-phase optimization process. The first phase involves
establishing a warm-up dataset, where candidate prompts are generated and evaluated
for their suitability to each LLMs. This is followed by a combination of Supervised
Fine-Tuning (SFT) and Reinforcement Learning (RL), particularly employing tech-
niques such as Proximal Policy Optimization (PPO) and Ranking Responses from
Model Feedback (RRMF). This joint learning approach refines the prompts, ensuring
they align with the specific preferences of each LLMs.
Empirical studies demonstrate that MAPO significantly improves performance in
tasks such as question-answering, classification, and text generation when compared to
conventional task-specific prompt optimization methods. By shifting the focus from a
one-size-fits-all approach to a more nuanced, model-specific strategy, MAPO enhances
the robustness and generalization of LLMs, making it a powerful tool in the prompt
engineering toolkit [92].
3.10.4 PromptAgent
The PromptAgent method suggests framing prompt optimization as a strategic plan-
ning problem. A kind core of this method is the use of Monte Carlo Tree Search
(MCTS), a principled planning algorithm that strategically navigates the vast space
of expert-level prompts. Unlike conventional methods that generate prompts through
local variations, PromptAgent employs a trial-and-error mechanism, inspired by
human problem-solving strategies. This approach allows the model to iteratively
refine prompts based on error feedback, simulating future rewards and prioritizing
high-reward paths [94]. Another core of this method, for instance, PromptSource
[95], collects over 2,000 open-source prompts for roughly 170 datasets, by dataset
exploration, prompt writing and documentation to provide an enhanced prompt.
PromptAgent’s effectiveness has been demonstrated across a diverse set of tasks,
spanning general NLP challenges and domain-specific applications such as biomedical
text processing. By autonomously generating prompts that incorporate domain-
specific knowledge and detailed task instructions, PromptAgent consistently outper-
forms both human-designed prompts and other automated optimization methods
[94], highlighting the importance of integrating strategic planning and self-reflection
capabilities into prompt optimization frameworks.
19
3.10.6 GPTs (plugins)
Before ending this discussion on prompt optimization techniques, we need to mention
the use of external prompt engineering assistants that have been developed recently
and exhibit promising potential. Unlike the methods introduced previously, these
instruments can help us to polish the prompt directly. They are adept at analyzing
user inputs and subsequently producing pertinent outputs within a context that is
defined by itself, thereby amplifying the efficacy of prompts. Some of the plugins pro-
vided by the OpenAI GPT store are good examples of such tools [98]. Some popular
GPT store apps that specialize in generating or optimizing prompts are shown in
Figure 15.
Fig. 15 Examples of GPT apps that specialize in generating or optimizing prompts [99].
20
3.11 Retrieval augmentation
Another direction of prompt engineering research is to aim to reduce hallucinations.
When using AIGC tools such as GPT-4, it is common to face a problem called “hal-
lucinations”, which refer to the presence of unreal or inaccurate information in the
model’s generated output [26, 107]. While these outputs may be grammatically cor-
rect, they can be inconsistent with facts or lack real-world data support. Hallucinations
arise because the model may not have found sufficient evidence in its training data to
support its responses, or it may overly generalize certain patterns when attempting
to generate fluent and coherent output [108].
An approach to reduce hallucinations and enhance the effectiveness of prompts is
the so-called retrieval augmentation technique, which aims at incorporating up-to-date
external knowledge into the model’s input [109, 110]. It is emerging as an AI frame-
work for retrieving facts from external sources. [111] examines the augmentation of
context retrieval through the incorporation of external information. It proposes a
sophisticated operation: the direct concatenation of pertinent information obtained
from an external source to the prompt, which is subsequently treated as founda-
tional knowledge for input into the expansive language model. Additionally, the
paper introduces auto-regressive techniques for both retrieval and decoding, facilitat-
ing a more nuanced approach to information retrieval and fusion. [111] demonstrates
that in-context retrieval-augmented language models, when constructed upon read-
ily available general-purpose retrievers, yield significant LLM enhancements across a
variety of model dimensions and diverse corpora. In another research, [112] showed
that GPT-3 can reduce hallucinations by studying various implementations of the
retrieval augmentation concept, such as Retrieval Augmented Generation (RAG)
[113], Fusion-in-Decoder (FiD) [114], Seq2seq [115–117] and others. [118] developed
the Chain-of-Verification (CoVe) approach to reduce hallucinations, based on letting
the LLM deliberate on its own responses before self-correcting them. They suspect
that extending this approach with retrieval augmentation would likely bring further
gains. UNIWEB [119] converting knowledge-intensive tasks into a unified text-to-text
framework and treating the web as a general source of knowledge.
21
For example, a prompt designed using ART might guide an LLM to first outline the
steps required to solve a complex mathematical problem and then use a calculator tool
to perform the necessary calculations. This combination of reasoning and tool usage
ensures that the model’s outputs are both logically coherent and computationally
accurate.
[121] have demonstrated that ART can help models navigate complex problem
spaces more effectively by breaking down tasks into manageable steps and utiliz-
ing appropriate tools at each stage. For instance, the integration of ART in natural
language processing tasks has shown promising results in areas such as automated
customer service, where models need to access and process information dynamically
[122].
Moreover, ART’s approach aligns with ongoing efforts to develop more robust and
versatile AI systems capable of handling real-world tasks that demand a combina-
tion of cognitive and computational skills. [123] explores advanced ART techniques to
achieve better accuracy and reliability in AI applications. These findings underscore
the importance of ART in enhancing the functionality and performance of LLMs,
making them more adept at handling a broader range of tasks, in particular techni-
cal problem-solving tasks that require specific and precise outputs such as financial
calculations or data analysis.
22
Model Input
Problem to solve: You have lost a key in your house and need to find it.
Detailed Description: You just came back from outside and placed your bag on the entryway table. You remember the last time you used the
key was to unlock the door. The room has an entryway table, a kitchen, a living room, a bedroom, and a bathroom. You need to find the key.
Model Output
Obs 2: Found the key on the Thought 2: The key is not in the entryway, next step is to check the kitchen.
kitchen table. Act 2: Check the kitchen
Obs 2: Found the key on the kitchen table.
Result: Result:
Analysis: Althought correct result, but blind
Analysis: Combines reasoning and action, maximizes the probability and efficiency of finding the key.
actions, relies on luck to find the key.
the importance of VLMs and their applications in multimodal tasks to provide a more
comprehensive perspective.
VLMs are based on the transformer architecture, and are trained on extensive
datasets to learn complex semantic relationships. However, unlike early unimodal
models, VLMs process both textual and visual information, enabling them to establish
connections between image understanding and text generation. As can be expected,
this multimodal integration makes VLMs particularly effective at handling complex
tasks that involve both images and text.
To seamlessly integrate and interpret these diverse data types, VLMs require
sophisticated prompt designs that ensure contextual coherence and accuracy [131,
132]. Challenges such as data alignment, modality integration, and context preserva-
tion are addressed through advanced techniques such as Context Optimization (CoOp,
subsection 4.3) and Multimodal Prompt Learning (MaPLe, subsection 4.5). These
advanced prompt engineering techniques enhance the ability of VLMs to generate
nuanced and contextually rich outputs, thereby facilitating their effective utilization
in various applications and enabling them to tackle more complex tasks [131].
23
highlighting the model’s ability to generalize across different domains. Additionally,
[133] presented a method for adapting CLIP to few-shot classification tasks without
additional training, emphasizing practical benefits in real-world applications.
24
Regarding CoOp’s effectiveness, [135] showed that models using CoOp signifi-
cantly outperform traditional models in tasks such as image recognition and VQA.
Additionally, [138] highlighted the benefits of ensembling context optimization, which
further enhances the model’s performance by combining multiple context vectors. This
approach has been shown to improve the robustness and generalization of VLMs in
real-world applications [139].
25
CoOp CoCoOp MaPLe
Method (context optimization) (Conditional Context Optimization) (Multi-modal Prompt Learning)
Characteristic
Handling of Multi-modal Less effective, as it focuses on language Effective, as it integrates both image and Highly effective, as it integrates both vision
Data prompts language prompts and language prompts
Computational Complexity Moderate Increased, due to dynamic prompt High, due to multi-modal and multi-level
generation prompt learning
Advantages Simplifies prompt engineering; performs Dynamic prompts enhance generalization to Multi-modal prompt learning and coupling
well on seen classes unseen classes; performs well across tasks enhance model collaboration and
and datasets generalization
Disadvantages Static prompts perform poorly on unseen Increased computational complexity, More complex implementation, may
classes, limited generalization; less adaptive potentially requiring more computational require more computational resources and
to different tasks and datasets. resources. training time.
Fig. 17 Comparison between the MaPLe and traditional method (CoOp and CoCoOp).
26
generated content. Objective evaluations, also known as automatic evaluation meth-
ods, use algorithms to score the quality of text generated by LLMs or test on various
benchmarks to quantitatively measure the efficacy of prompt methods.
Subjective evaluation and objective evaluation methods each have their advantages
and disadvantages. Subjective evaluation is more in line with human intuition, but it
is also more expensive and time-consuming [146]. Objective evaluation is less expen-
sive and quicker than subjective evaluation. For instance, despite numerous pieces of
research highlighting the limited correlation between BLEU and alternative metrics
based on human assessments, their popularity has remained unaltered [147, 148]. The
best way to evaluate the quality of LLM output depends on the specific application
[149].
27
model’s abilities, because not a single model can perform best in all kinds of tasks
[161].
28
methods in catching essential information from large content. AG’s News [189] is a
subset of the larger AG’s Corpus which is built by compiling titles and description
fields from articles belonging to different categories in AG’s Corpus. By pairing varied
task instructions with the corresponding text, SentiEval [190] decreases the sensi-
tivities associated with prompt design during the evaluation of different LLMs. CR
[191], the sentiment of sentences mined from customer reviews, and MR [192], a movie
review snippet sentiment on a five-star scale, are benchmarks that instruct models to
classify sentiment from contents. “Less Likely Brainstorming” [193] is a benchmark
that tests by asking the model to generate outputs that humans think are relevant but
less likely to happen. Subj [192] is the benchmark including the subjectivity of sen-
tences from movie reviews and plot summaries. SAlient Long-Tail Translation Error
Detection (SALTED) [194] focuses on identifying errors in translations, emphasizing
linguistic proficiency and attention to detail. These evaluations highlight the models’
ability to understand and process text, making accurate predictions based on the con-
tent. Coin Flip [23] dataset assesses symbolic reasoning that asks the model to answer
whether a coin still heads up after people either flip or don’t flip the coin.
Multimodal Tasks
Multimodal tasks are designed to evaluate a MMLMs ability to process and integrate
information from multiple sources, such as text and images. RefCOCO, RefCOCO+
[195] and RefCOCOg [196] provide referring expressions for objects in images, testing
models’ ability to link descriptions with visual content. These evaluations are crucial
for developing models capable of cross-modal understanding and interaction, essential
for applications like visual question answering and image captioning.
29
ReAct in four domains, which are “explicit plan”, “iterative prompting”, “does not
rely on external tools” and ‘Long documents”. [152] combines human and automatic
evaluations to assess whether the method aligns with human reasoning. [77] compares
CoT with ToT by human-rated “creative writing” task.
Other studies experiment mainly on certain models or tasks and employ disparate
evaluation metrics, restricting comparability across methods [96, 227]. Nevertheless,
recent research proposed a general evaluation framework called InstructEval [228] that
enables a comprehensive assessment of prompting techniques across multiple mod-
els and tasks. InstructEval reached the following conclusions: in few-shot settings,
omitting prompts or using generic task-agnostic prompts tends to outperform other
methods, with prompts having little impact on performance; in zero-shot settings,
expert-written task-specific prompts can significantly boost performance, with auto-
mated prompts not outperforming simple baselines; the performance of automated
prompt generation methods is inconsistent, varying across different models and task
types, displaying a lack of generalization.
30
well-designed prompts, can analyze a vast amount of assessment data, thus provid-
ing valuable insights into learning patterns and informing educators about areas that
require attention or improvement [232, 233].
31
on various reasoning benchmarks, including GSM8K and others. All these works show
that in the application of reasoning tasks, properly customized prompts can obtain
better results from the model.
7 LLMs security
Prompt engineering is the process of designing and refining the inputs (prompts)
given to LLMs to elicit desired and accurate responses. This technique is crucial
not only for optimizing model performance but also for enhancing security. By care-
fully crafting prompts, researchers and developers can identify and help to mitigate
vulnerabilities in LLMs. Effective prompt engineering can expose weaknesses that
might be exploited through adversarial attacks, data poisoning, or other malicious
activities[246]. Conversely, poorly designed prompts can inadvertently reveal or intro-
duce security vulnerabilities in the model [246], which could then be exploited by
malicious actors, leading to issues such as the disclosure of sensitive information or
susceptibility to adversarial attacks.
Thus, prompt engineering serves as both a tool for improving LLMs functional-
ity and a critical component of their security framework. The proactive, open, and
in-depth efforts of researchers in identifying and mitigating vulnerabilities through
prompt engineering are essential for maintaining the integrity and safety of LLMs in
diverse applications [246].
This is particularly true in critical sectors such as healthcare, finance, and cyber-
security, where prompt attacks against LLMs could lead to significant breaches of
sensitive information or disrupt essential services [247]. For example, adversarial
attacks can manipulate model outputs to spread harmful or misleading information
[3], while data poisoning during training can corrupt the model’s learning process,
leading to unreliable outputs. In healthcare, compromised models could lead to incor-
rect diagnoses and treatment plans, endangering patient lives. Similarly, in finance,
compromised models could result in significant financial losses and undermine trust
in automated financial services [248].
Consequently, there is a critical need for continuous and in-depth research in
prompt engineering security to fully realize its benefits and address emerging chal-
lenges. A deeper understanding of attack methods and their mechanisms in relation
to prompt engineering is essential for both large model developers and users to better
defend against these threats. In this section, we will explore some mainstream attack
methods related to prompt engineering and also discuss how to defend against them.
32
The potential for adversarial attacks is particularly concerning in applications such
as automated customer service or legal document analysis, where the integrity and
accuracy of responses are critical [252]. One example of adversarial attacks in image
recognition is illustrated in Figure 19 [253].
33
The implications of data poisoning are far-reaching, affecting sectors that rely on
accurate data analysis and generation, such as healthcare, finance, and legal services
[259]. Thus, integrating robust prompt engineering practices is crucial for preventing
the inadvertent inclusion of poisoned data and safeguarding the reliability of LLMs.
Normal
Situation 1
Behavior
Clean
Input
Clean Model
Situation 2 Normal
Clean
Input Behavior
Poisoned Model
Trigger
Malicious
Situation 3
Behavior
Poisoned Model
Clean
Input
Fig. 20 An illustration of three scenarios in the backdoor attack. (1) a clean model receiving clean
inputs and producing normal outputs; (2) a contaminated model receiving clean inputs but still
producing normal outputs; and (3) the same model producing harmful or incorrect outputs when an
implanted backdoor trigger is present.
34
Backdoor threats not only pose immediate security risks but also can erode trust in
AI systems, emphasizing the need for transparency in AI model development. Imple-
menting rigorous testing frameworks to detect and eliminate hidden vulnerabilities is
essential before deploying models in real-world scenarios.
35
By crafting strategically designed prompts, attackers can systematically interact with
the target model, gradually reconstructing its internal mechanics or sensitive data.
This process, known as “query-based extraction”, allows the attacker to build a
surrogate model that mimics the target model’s responses. This gradual model recon-
struction approach relies on the ability to generate diverse and informative prompts
that cover a wide range of inputs the model might encounter [283], and can be par-
ticularly effective when the target model is a black-box system, where the attacker
has no access to the internal architecture but can observe the outputs generated in
response to the inputs [284].
This stealing process highlights the vulnerabilities inherent in LLMs when exposed
to malicious prompt manipulations, potentially resulting in intellectual property theft,
erosion of competitive advantages, and the unethical deployment of cloned models in
unauthorized contexts [285–287].
One notable example of a model stealing attack is the extraction of the projec-
tion matrix from OpenAI’s language models. Researchers demonstrated how, through
a series of carefully crafted prompts, they could extract significant portions of the
model’s architecture and parameters, effectively creating a replica of the original
model [288]. Another incident involved adversaries using prompt engineering tech-
niques to replicate commercial LLMs used in customer service, resulting in substantial
intellectual property theft and financial losses for the companies involved [289].
Numerous studies have explored model stealing of LLMs. For instance, [290]
proposes a novel prompt stealing attack against LLMs by introducing a two-
stage approach involving parameter extraction and prompt reconstruction, effectively
demonstrating the vulnerability of LLMs to reverse engineering of prompts based on
their generated responses. Furthermore, [291] introduces PRSA, a novel framework
for prompt stealing attacks against LLMs, which effectively infers the intent of tar-
get prompts and generates functionally equivalent surrogate prompts, highlighting
the significant risks of prompt leakage in both non-interactive and interactive prompt
services.
The effectiveness of these attacks underscores the need for robust defenses. Pro-
posed countermeasures include limiting the number of queries a single user can make,
implementing anomaly detection to identify suspicious querying patterns, and using
defensive perturbations to mislead potential attackers [292].
36
generate adversarial examples that expose and address vulnerabilities in the model
[255].
To conclude this section on security, while prompt engineering can greatly enhance
the capabilities of LLMs, it can also introduce significant risks if not managed prop-
erly. The current race to release new and improved LLM functionalities cannot
disregard the critical need for secure and robust design practices to combat adversar-
ial prompts. As LLMs become further embedded in critical applications, advancing
secure prompt engineering practices is essential to safeguard against misuse, minimize
security vulnerabilities, and ensure safe deployment [3, 295].
8 Prospective methodologies
Beyond the advanced methodologies discussed in Section 3, several key developments
on the horizon promise to substantially advance prompt engineering capabilities. This
brief section discusses some noteworthy trajectories, which could shape the future of
prompt engineering.
37
8.2 Agent for AIGC tools
The concept of AI agents has emerged as a potential trajectory in AI research [304].
In this brief subsection, we explore the relationship between agents and prompt engi-
neering and project how agents might influence the future trajectory of AI-generated
content (AIGC) tools. By definition, an AI agent comprises large models, memory,
active planning, and tool use. AI agents are capable of remembering and understand-
ing a vast array of information, actively planning and strategizing, and effectively
using various tools to generate optimal solutions within complex problem spaces [305].
The evolution of AI agents can be delineated into five distinct phases: mod-
els, prompt templates, chains, agents, and multi-agents. Each phase carries its
specific implications for prompt engineering. Foundational models, exemplified by
architectures such as GPT-4, underpin the realm of prompt engineering.
In particular, prompt templates offer an effective way of applying prompt engineer-
ing in practice [23]. By using these templates, one can create standardized prompts
to guide large models, making the generated output more aligned with the desired
outcome. The usage of prompt templates is a crucial step towards enabling AI agents
to better understand and execute user instructions.
AI agents amalgamate these methodologies and tools into an adaptive framework.
Possessing the capability to autonomously modulate their behaviors and strategies,
they strive to optimize both efficiency and precision in task execution. A salient chal-
lenge for prompt engineering emerges: devising and instituting prompts that adeptly
steer AI agents toward self-regulation [22].
9 Conclusion
In conclusion, prompt engineering has established itself as an essential technique
for optimizing the performance of LLMs. By employing foundational methods such
as clear instructions and role-prompting, alongside advanced methodologies such as
chain-of-thought and self-consistency, the capabilities of LLMs can be significantly
enhanced. For VLMs, innovative strategies such as CoOp and MaPLe ensure effective
integration and optimization of visual and textual data. The efficacy of these methods
can be rigorously assessed through both subjective and objective evaluations, confirm-
ing their impact across diverse applications, including education, content creation, and
programming. Additionally, prompt engineering has a crucial role to play in fortify-
ing LLM security, identifying vulnerabilities, and mitigating risks through adversarial
training. Looking ahead, future advancements could focus on a deeper understanding
of model structures and the development of AI agents, further elevating the sophis-
tication and capability of AI systems. This comprehensive review underscores the
transformative potential of prompt engineering in advancing AI capabilities, providing
a structured framework for future research and applications.
10 Acknowledgement
The authors would like to acknowledgde the support from the Interdisciplinary Intelli-
gence Super Computer Center of Beijing Normal University at Zhuhai. This work was
funded by the Natural Science Foundation of China (12271047); Guangdong Provin-
cial Key Laboratory of Interdisciplinary Research and Application for Data Science,
BNU-HKBU United International College (2022B1212010006); UIC research grant
(R0400001-22; UICR0400008-21; R72021114; UICR0400036-21CTL; UICR04202405-
21, UICR0700041-22); Guangdong College Enhancement and Innovation Program
(2021ZDZX1046).
References
[1] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al.
Attention is all you need. In: Proceedings of the 31st International Conference
on Neural Information Processing Systems. NIPS’17; 2017. p. 6000–6010.
38
[2] Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of
stochastic parrots: can language models be too big? In: Proceedings of the
2021 ACM Conference on Fairness, Accountability, and Transparency; 2021. p.
610–623.
[7] Hulbert D.: Tree of knowledge: ToK aka Tree of Knowledge dataset for large
language models LLM. Accessed: 2023-8-15. https://github.com/dave1010/
tree-of-thought-prompting.
[16] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning
transferable visual models from natural language supervision. In: International
Conference on Machine Learning; 2021. p. 8748–8763.
39
[19] Moore O.: Announcing GPT-4o in the API! Accessed: 2024-05-22. Available
from: https://community.openai.com/t/announcing-gpt-4o-in-the-api/744700.
[24] Wang X, Wei J, Schuurmans D, Le QV, Chi EH, Narang S, et al. Self-
consistency improves chain of thought reasoning in language models. In:
Eleventh International Conference on Learning Representations; 2023. .
[28] Yong G, Jeon K, Gil D, Lee G. Prompt engineering for zero-shot and few-shot
defect detection and classification using a visual-language pretrained model.
Computer-Aided Civil and Infrastructure Engineering. 2022;38(11):1536–1554.
[30] Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D. Deep rein-
forcement learning from human preferences. Advances in neural information
processing systems. 2017;30.
40
[34] Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text
degeneration. In: Ninth International Conference on Learning Representations;
2020. .
[35] Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J. Neural text generation
with unlikelihood training; 2019. ArXiv:1908.04319.
[37] YanSong S, JingLi Tencent A. Joint learning embeddings for Chinese words and
their components via ladder structured networks. In: Proceedings of the Twenty-
Seventh International Joint Conference on Artifificial Intelligence (IJCAI-18);
2018. p. 4375–4381.
[41] Van Buren D. Guided scenarios with simulated expert personae: a remarkable
strategy to perform cognitive work; 2023. ArXiv:2306.03104.
[48] Liu J, Gardner M, Cohen SB, Lapata M. Multi-step inference for reasoning over
paragraphs; 2020. ArXiv:2004.02995.
[49] Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann
machines. Cognitive Science. 1985;9(1):147–169.
41
[50] Ficler J, Goldberg Y. Controlling linguistic style aspects in neural language
generation. In: Proceedings of the Workshop on Stylistic Variation; 2017. p.
94–104.
[58] Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Large language
models perform diagnostic reasoning. In: Eleventh International Conference on
Learning Representations; 2022. .
[61] Huang S, Dong L, Wang W, Hao Y, Singhal S, Ma S, et al. Language is not all
you need: aligning perception with language models; 2023. ArXiv:2302.14045.
42
[66] Tafjord O, Dalvi B, Clark P. ProofWriter: generating implications, proofs, and
abductive statements over natural language. In: Findings of the Association for
Computational Linguistics: ACL-IJCNLP 2021; 2021. p. 3621–3634.
[68] Fan A, Lewis M, Dauphin Y. Hierarchical neural story generation. In: Pro-
ceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers); 2018. p. 889–898.
[71] Shum K, Diao S, Zhang T. Automatic prompt augmentation and selection with
chain-of-thought from labeled data; 2023. ArXiv:2302.12822.
[75] Gao L, Madaan A, Zhou S, Alon U, Liu P, Yang Y, et al. Pal: program-aided
language models. In: International Conference on Machine Learning. PMLR;
2023. p. 10764–10799.
[76] Patel A, Bhattamishra S, Goyal N. Are NLP Models really able to Solve
Simple Math Word Problems? In: Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies. Online: Association for Computational
Linguistics; 2021. p. 2080–2094.
[77] Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. Tree of thoughts:
deliberate problem solving with large language models; 2023. ArXiv:2305.10601.
43
[82] Diao S, Wang P, Lin Y, Zhang T. Active prompting with chain-of-thought for
large language models; 2024. ArXiv:2302.12246.
[83] Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. In:
Zong C, Xia F, Li W, Navigli R, editors. Proceedings of the 59th Annual Meeting
of the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers).
Online: Association for Computational Linguistics; 2021. p. 4582–4597.
[84] Sahoo P, Singh AK, Saha S, Jain V, Mondal S, Chadha A. A systematic survey
of prompt engineering in large language models: techniques and applications;
2024. ArXiv:2402.07927.
[89] Mondal S, Bappon SD, Roy CK. Enhancing user interaction in ChatGPT:
characterizing and consolidating multiple prompts for issue resolution; 2024.
ArXiv:2402.04568.
[92] Chen Y, Wen Z, Fan G, Chen Z, Wu W, Liu D, et al. MAPO: boosting large
language model performance with model-adaptive prompt optimization. In:
Bouamor H, Pino J, Bali K, editors. Findings of the Association for Compu-
tational Linguistics: EMNLP 2023. Singapore: Association for Computational
Linguistics; 2023. p. 3279–3304.
[93] Cheng J, Liu X, Zheng K, Ke P, Wang H, Dong Y, et al. Black-box prompt opti-
mization: aligning large language models without model training. In: Ku LW,
Martins A, Srikumar V, editors. Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). Bangkok,
Thailand: Association for Computational Linguistics; 2024. p. 3201–3219.
[95] Bach SH, Sanh V, Yong ZX, Webson A, Raffel C, Nayak NV, et al. Prompt-
source: an integrated development environment and repository for natural
language prompts; 2022. ArXiv:2202.01279.
44
[96] Deng M, Wang J, Hsieh CP, Wang Y, Guo H, Shu T, et al. RLPrompt: optimiz-
ing discrete text prompts with reinforcement learning. In: Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing; 2022.
p. 3369–3391.
[97] Awal R, Zhang L, Agrawal A. Investigating prompting techniques for zero- and
few-shot visual question answering; 2024. ArXiv:2306.09996.
[99] OpenAI.: GPTs: introducing the latest in conversational AI. Accessed: 2024-05-
22. Available from: https://openai.com/index/introducing-gpts/.
[100] Bisson S.: Microsoft build 2023: Microsoft extends its copilots with open stan-
dard plugins. Accessed: 2023-05-25. https://www.techrepublic.com/article/
microsoft-extends-copilot-with-open-standard-plugins/.
[103] whatplugin.ai.: Prompt enhancer & ChatGPT plugins for AI development tools
like prompt enhancer. Accessed: 2023-09-14. https://www.whatplugin.ai/
plugins/prompt-enhancer.
[105] ChatGPT for Search Engines.: Prompt perfect plugin for ChatGPT. Accessed:
2023-10-15. https://chatonai.org/prompt-perfect-chatgpt-plugin.
[110] Jiang Z, Xu FF, Gao L, Sun Z, Liu Q, Dwivedi-Yu J, et al. Active retrieval
augmented generation; 2023. ArXiv:2305.06983.
45
Information Processing Systems. 2020;33:9459–9474.
[114] Izacard G, Grave E. Leveraging passage retrieval with generative models for
open domain question answering; 2020. ArXiv:2007.01282.
[119] Li J, Tang T, Zhao WX, Wang J, Nie JY, Wen JR. The web can be your oyster
for improving Large Language Models; 2023. ArXiv:2305.10998.
[126] Roberts A.: How to ReAct to simple AI agents. Available from: https://arize.
com/blog-course/react-agent-llm/.
[127] Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, et al. VQA: visual
question answering. In: Proceedings of the IEEE International Conference on
Computer Vision (ICCV); 2015. p. 2425–2433.
[128] Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A. Visual question
answering: a survey of methods and datasets. Computer Vision and Image
Understanding. 2017;163:21–40. Language in Vision.
[129] Wang P, Wu Q, Shen C, Dick A, van den Hengel A. FVQA: fact-based visual
question answering. IEEE Transactions on Pattern Analysis and Machine
46
Intelligence. 2018;40(10):2413–2427.
[135] Zhou K, Yang J, Loy CC, Liu Z. Learning to prompt for vision-language models.
International Journal of Computer Vision. 2022 jul;130(9):2337–2348.
[136] Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual
and vision-language representation learning with noisy text supervision. In:
Meila M, Zhang T, editors. Proceedings of the 38th International Conference
on Machine Learning. vol. 139 of Proceedings of Machine Learning Research.
PMLR; 2021. p. 4904–4916.
[140] Zhou K, Yang J, Loy CC, Liu Z. Conditional prompt learning for vision-language
models; 2022. ArXiv:2203.05557.
[142] Khattak MU, Wasim ST, Naseer M, Khan S, Yang MH, Khan FS. Self-
regulating prompts: foundational model adaptation without forgetting; 2023.
ArXiv:2307.06948.
[143] Khattak MU, Rasheed H, Maaz M, Khan S, Khan FS. MaPLe: multi-modal
prompt learning; 2023. ArXiv:2210.03117.
47
[145] Shen L, Tan W, Zheng B, Khashabi D. Flatness-aware prompt selection
improves accuracy and sample efficiency; 2023. ArXiv:2305.10713.
[152] Wang R, Wang H, Mi F, Chen Y, Xue B, Wong KF, et al. Enhancing Large
Language Models Against Inductive Instructions with Dual-critique Prompting;
2023. ArXiv:2305.13733.
[154] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics; 2002. p. 311–318.
[158] Sai AB, Mohankumar AK, Khapra MM. A survey of evaluation metrics used
for NLG systems. ACM Computing Surveys (CSUR). 2022;55(2):1–39.
[159] Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, et al. Beyond
the imitation game: quantifying and extrapolating the capabilities of language
models; 2023. ArXiv:2206.04615.
[160] Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, et al. Beyond
the imitation game: quantifying and extrapolating the capabilities of language
models; 2022. ArXiv:2206.04615.
48
[161] Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, et al. A survey on evalu-
ation of large language models. ACM Transactions on Intelligent Systems and
Technology. 2024;15(3):1–45.
[162] Wang Y, Liu X, Shi S. Deep neural solver for math word problems. In: Proceed-
ings of the 2017 conference on empirical methods in natural language processing;
2017. p. 845–854.
[164] Shi S, Wang Y, Lin CY, Liu X, Rui Y. Automatically solving number word prob-
lems by semantic parsing and reasoning. In: Proceedings of the 2015 conference
on empirical methods in natural language processing; 2015. p. 1132–1142.
[166] Roy S, Roth D. Unit dependency graph and its application to arithmetic
word problem solving. In: Proceedings of the AAAI conference on artificial
intelligence. vol. 31; 2017. .
[168] Miao SY, Liang CC, Su KY. A diverse corpus for evaluating and developing
English math word problem solvers; 2021. ArXiv:2106.15772.
49
[177] Feng S, Shi W, Bai Y, Balachandran V, He T, Tsvetkov Y. Knowledge card:
filling LLMs’ knowledge gaps with plug-in specialized Language Models; 2023.
ArXiv:2305.09955.
[178] Kočiskỳ T, Schwarz J, Blunsom P, Dyer C, Hermann KM, Melis G, et al. The
narrativeqa reading comprehension challenge. Transactions of the Association
for Computational Linguistics. 2018;6:317–328.
[179] Pang RY, Parrish A, Joshi N, Nangia N, Phang J, Chen A, et al. QuALITY:
question answering with long input texts, yes! In: North American Chapter of
the Association for Computational Linguistics; 2021. p. 5336–5358.
[182] Speer R, Chin J, Havasi C. Conceptnet 5.5: An open multilingual graph of gen-
eral knowledge. In: Proceedings of the AAAI conference on artificial intelligence.
vol. 31; 2017. .
[186] Voorhees EM, Tice DM. Building a question answering test collection. In: Pro-
ceedings of the 23rd annual international ACM SIGIR conference on Research
and development in information retrieval; 2000. p. 200–207.
[190] Zhang W, Deng Y, Liu B, Pan SJ, Bing L. Sentiment analysis in the era of
large language models: a reality check; 2023. ArXiv:2305.15005.
50
[192] Pang B, Lee L. Seeing stars: exploiting class relationships for sentiment
categorization with respect to rating scales; 2005. Cs/0506075.
[193] Tang L, Peng Y, Wang Y, Ding Y, Durrett G, Rousseau JF. Less likely
brainstorming: Using language models to generate alternative hypotheses.
In: Proceedings of the conference. Association for Computational Linguistics.
Meeting. vol. 2023. NIH Public Access; 2023. p. 12532.
[195] Yu L, Poirson P, Yang S, Berg AC, Berg TL. Modeling context in referring
expressions. In: Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14.
Springer; 2016. p. 69–85.
[199] Lin YT, Chen YN. LLM-eval: unified multi-dimensional automatic eval-
uation for open-domain conversations with large language models; 2023.
ArXiv:2305.13711.
[200] Dehghani M, Tay Y, Gritsenko AA, Zhao Z, Houlsby N, Diaz F, et al. The
benchmark lottery; 2021. ArXiv:2107.07002.
[206] Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, et al. Large Lan-
guage Models are human-level prompt engineers. In: Eleventh International
Conference on Learning Representations; 2022. .
51
[208] Chevalier A, Wettig A, Ajith A, Chen D. Adapting language models to compress
contexts; 2023. ArXiv:2305.14788.
[212] Izacard G, Grave E. Leveraging passage retrieval with generative models for
open domain question answering; 2020. ArXiv:2007.01282.
[213] Guo Q, Wang R, Guo J, Li B, Song K, Tan X, et al. Connecting large language
models with evolutionary algorithms yields powerful prompt optimizers; 2023.
ArXiv:2309.08532.
[218] Jiang H, Wu Q, Lin CY, Yang Y, Qiu L. Llmlingua: Compressing prompts for
accelerated inference of large language models; 2023. ArXiv:2310.05736.
[223] Sun S, Liu Y, Wang S, Zhu C, Iyyer M. Pearl: Prompting large language models
to plan and execute actions over long documents; 2023. ArXiv:2305.14564.
[225] Press O, Zhang M, Min S, Schmidt L, Smith NA, Lewis M. Measuring and nar-
rowing the compositionality gap in language models; 2022. ArXiv:2210.03350.
52
[226] Schick T, Dwivedi-Yu J, Dessı̀ R, Raileanu R, Lomeli M, Hambro E, et al.
Toolformer: Language models can teach themselves to use tools. Advances in
Neural Information Processing Systems. 2024;36.
[227] Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, et al. Large Language
Models Are Human-Level Prompt Engineers; 2023. ArXiv:2211.01910.
[230] Xie Q, Dai Z, Hovy E, Luong MT, Le QV. Unsupervised data augmentation
for consistency training. In: Advances in neural information processing systems.
vol. 33; 2020. p. 6256–6268.
[231] Ariely M, Nazaretsky T, Alexandron G. Machine learning and Hebrew NLP for
automated assessment of open-ended questions in biology. International Journal
of Artificial Intelligence in Education. 2023;33(1):1–34.
[232] Nilsson F, Tuvstedt J. GPT-4 as an automatic grader: the accuracy of grades set
by GPT-4 on introductory programming assignments [Bachelor Thesis]. KTH
Royal Institute of Technology; 2023.
[234] Yang K, Tian Y, Peng N, Klein D. Re3: generating longer stories with recur-
sive reprompting and revision. In: Goldberg Y, Kozareva Z, Zhang Y, editors.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing. Abu Dhabi, United Arab Emirates: Association for Computational
Linguistics; 2022. p. 4393–4479.
[235] Yang K, Klein D, Peng N, Tian Y. DOC: improving long story coherence with
detailed outline control. In: Rogers A, Boyd-Graber J, Okazaki N, editors.
Proceedings of the 61st Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for
Computational Linguistics; 2023. p. 3378–3465.
[236] Yang K, Klein D. FUDGE: controlled text generation with future discrim-
inators. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D,
Beltagy I, Bethard S, et al., editors. Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies. Online: Association for Computational
Linguistics; 2021. p. 3511–3535.
[237] Elgohary A, Hosseini S, Awadallah AH. Speak to your parser: interactive text-
to-SQL with natural language feedback. In: Annual Meeting of the Association
for Computational Linguistics; 2020. p. 2065–2077.
53
Learning; 2023. p. 31693–31715.
[242] Roy S, Roth D. Solving general arithmetic word problems. In: Proceedings of
the 2015 Conference on Empirical Methods in Natural Language Processing;
2015. p. 1743–1752.
[243] Li Y, Lin Z, Zhang S, Fu Q, Chen B, Lou JG, et al. Making language models
better reasoners with step-aware verifier. In: Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). Toronto, Canada: Association for Computational Linguistics; 2023. p.
5315–5333.
[244] Ding B, Qin C, Liu L, Bing L, Joty S, Li B. Is GPT-3 a good data annotator?;
2022. ArXiv:2212.10450.
[245] Yoo KM, Park D, Kang J, Lee SW, Park W. GPT3Mix: leveraging large-scale
language models for text augmentation. In: Findings of the Association for
Computational Linguistics: EMNLP 2021; 2021. p. 2225–2239.
[248] Rawat P.: AI at risk: OWASP top 10 critical vulnerabilities for large lan-
guage models (LLMs). Available from: https://www.infosectrain.com/blog/
ai-at-risk-owasp-top-10-critical-vulnerabilities-for-large-language-models-llms/.
[249] Perez F, Ribeiro I. Ignore previous prompt: attack techniques for language
models; 2022. ArXiv:2211.09527.
[250] Ren K, Zheng T, Qin Z, Liu X. Adversarial attacks and defenses in deep learning.
Engineering. 2020;6(3):346–360.
54
[256] Shayegani E, Mamun MAA, Fu Y, Zaree P, Dong Y, Abu-Ghazaleh N. Survey
of vulnerabilities in large language models revealed by adversarial attacks; 2023.
ArXiv:2310.10844.
[259] Steinhardt J, Koh PW, Liang P. Certified defenses for data poisoning attacks;
2017. ArXiv:1706.03691.
[263] Holland R, Pal S, Pan L, Zhang LY. Backdoor attacks and generative model fair-
ness: current trends and future research directions. In: 2024 16th International
Conference on COMmunication Systems & NETworkS; 2024. p. 31–36.
[265] Ooi YX. Evaluation of backdoor attacks and defenses to deep neural networks
[Master’s thesis]. Nanyang Technological University; 2024.
[269] Zhao S, Wen J, Luu A, Zhao J, Fu J. Prompt as triggers for backdoor attack:
examining the vulnerability in language models. In: Bouamor H, Pino J, Bali
K, editors. Proceedings of the 2023 Conference on Empirical Methods in Natu-
ral Language Processing. Singapore: Association for Computational Linguistics;
2023. p. 12303–12317.
55
[272] Kosch T, Feger S. Risk or chance? Large language models and reproducibility
in human-computer interaction research; 2024. ArXiv:2404.15782.
[276] Wang H, Li H, Huang M, Sha L. From noise to clarity: unraveling the adversarial
suffix of large language model attacks via translation of text embeddings; 2024.
ArXiv:2402.16006.
[277] Gao A. Prompt engineering for large language models; 2023. SSRN:4504303.
[279] Schulhoff S, Pinto J, Khan A, Bouchard LF, Si C, Anati S, et al. Ignore this
title and HackAPrompt: exposing systemic vulnerabilities of LLMs through a
global prompt hacking competition. In: Bouamor H, Pino J, Bali K, editors.
Proceedings of the 2023 Conference on Empirical Methods in Natural Lan-
guage Processing. Singapore: Association for Computational Linguistics; 2023.
p. 4945–4977.
[281] Karande C.: OWASP LLM prompt hacking. Available from: https://owasp.org/
www-project-llm-prompt-hacking/.
[284] Papernot N, McDaniel P, Sinha A, Wellman MP. SoK: security and privacy in
machine learning. In: 2018 IEEE European Symposium on Security and Privacy
(EuroS&P); 2018. p. 399–414.
[285] Tramèr F, Zhang F, Juels A, Reiter MK, Ristenpart T. Stealing machine learn-
ing models via prediction APIs. In: Proceedings of the 25th USENIX Conference
on Security Symposium. SEC’16. USA: USENIX Association; 2016. p. 601–618.
56
[288] Naseh A, Krishna K, Iyyer M, Houmansadr A. Stealing the decoding algorithms
of language models. In: Proceedings of the 2023 ACM SIGSAC Conference
on Computer and Communications Security. CCS ’23. New York, NY, USA:
Association for Computing Machinery; 2023. p. 1835–1849.
[289] Carlini N, Paleka D, Dvijotham KD, Steinke T, Hayase J, Cooper AF, et al.
Stealing part of a production language model; 2024. ArXiv:2403.06634.
[290] Sha Z, Zhang Y. Prompt stealing attacks against large language models; 2024.
ArXiv:2402.12959.
[294] Bai T, Luo J, Zhao J, Wen B, Wang Q. Recent advances in adversarial train-
ing for adversarial robustness. In: Proceedings of the Thirtieth International
Joint Conference on Artificial Intelligence. International Joint Conferences on
Artificial Intelligence Organization; 2021. p. 4312–4321. Survey Track.
[302] Amann J, Blasimme A, Vayena E, Frey D, Madai VI. Explainability for arti-
ficial intelligence in healthcare: a multidisciplinary perspective. BMC Medical
Informatics and Decision Making. 2020;20(1):1–9.
[303] Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nature
Medicine. 2022;28(1):31–38.
[304] Öztürk D. What does artificial intelligence mean for organizations? A systematic
review of organization studies research and a way forward. The Impact of
57
Artificial Intelligence on Governance, Economics and Finance, Volume I. 2021;p.
265–289.
[305] Seeamber R, Badea C. If our aim is to build morality into an artificial agent,
how might we begin to go about doing so? IEEE Intelligent Systems. 2023;.
58