Large Language Models Are Human-Level Prompt Engineers
Large Language Models Are Human-Level Prompt Engineers
Authors Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis,
Harris Chan, and Jimmy Ba.
Reference of the paper Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., &
Ba, J. (2023). Large Language Models are Human-Level Prompt Engineers. In International
Conference on Learning Representations (ICLR).
Summary
Context Large language models (LLMs) have shown remarkable performance on a wide
range of natural language processing tasks. However, their ability to perform
general-purpose tasks is limited by the lack of natural language instructions.
Problematic The lack of natural language instructions and the tedious human effort involved
in creating and validating effective instructions, limit the ability of LLMs to perform
general-purpose tasks.
“Fiche de Lecture Master” By Adimi Alaa Dania & Rezkellah Fatma-Zohra
Objective The objective of the paper is to demonstrate that LLMs are capable of prompt
engineering at a human-level, which can enable them to perform general-purpose computing
tasks by conditioning on natural language instructions.
Solution The authors propose a method for evaluating the ability of LLMs to generate
effective prompts using the instruction induction task, which measures the ability of LLMs to
follow natural language instructions. They automate the prompt engineering process by
formulating it as a black-box optimization problem, which they propose to solve using
efficient search algorithms guided by LLMs. The authors also analyze the impact of different
factors on the quality of the generated prompts, such as the length and diversity of the
prompt set.
Tests To evaluate the effectiveness of the proposed method, the authors use the BIG-Bench
Instruction Induction (BBII) dataset, which is a clean and tractable subset of 21 tasks that
have a clear, human-written instruction that can be applied to all examples in the dataset.
The selected tasks cover many facets of language understanding and include emotional
understanding, context-free question answering, reading comprehension, summarization,
algorithms, and various reasoning tasks (e.g., arithmetic, commonsense, symbolic, and
other logical reasoning tasks).
The authors use the text-davinci-002 via the OpenAI API to generate the prompts and
evaluate their quality using the instruction induction task. The gold annotations from
Honovich et al. (2022) were used, which were manually verified for correctness.
The results show that the proposed method outperforms prior LLM baselines and achieves
comparable performance to human-generated instructions. The authors also analyze the
impact of different factors on the quality of the generated prompts, such as the length and
diversity of the prompt set. The results show that longer and more diverse prompt sets lead
to better performance.
Conclusion The paper demonstrates that LLMs are capable of prompt engineering at a
human-level, which has significant implications for natural language processing. The
proposed method has the potential to enable LLMs to perform a wide range of
general-purpose computing tasks by conditioning on natural language instructions with
minimum human inputs.