Lecture 7
Lecture 7
3
NLP Technical Development for past 10 years
Prompt Engineering / In-context Learning
- Prompting with LLMs
- GPT4, GPT3, chatGPT, DALLE2
Objective Engineering
- Pre-training and fine-tuning
- BERT, GPT2, T5
Architecture Engineering
- Neural nets, e.g., LSTM, CNN, GRU
- features from Word2Vec, GloVe
Feature Engineering
- Hand-crafted features
- SVM/CRF training
4
What is Prompting ?
❑ Very large language models seem to
perform some kind of learning
without gradient steps simply from
examples you provide within their
contexts.
Decoders
GPT-2, GPT-3,
LaMDA
Encoders
BERT, RoBERTa
Encoder-
Decoders
Vanilla
Transformer, T5,
BART
6
Encoder
s
7
Traditional vs Prompt formulation
8
Labels are not Y
anymore, but a
part of X
•Classification P (Y
| X) Generation P
(X)
9
Traditional vs Prompt formulation
We have
reformulated the
task! We also should
re-define the “ground
truth labels”
10
Traditional vs Prompt formulation
11
Basic
Prompting
12
Zero-shot
Prompting Decoders
❑ Simply feed the task text to the model and ask for results.
Text: i'll bet the video game is a lot more fun than
the film. Sentiment:
13
Few-shot
Prompting Decoders
Text: despite all evidence to the contrary, this clunker has somehow managed
to pose as an actual feature movie, the kind that charges full admission and
gets hyped on tv and purports to amuse small children and ostensible adults.
Sentiment: negative
Text: i'll bet the video game is a lot more fun than
the film. Sentiment: 14
Prompt-based Training Strategies
❑ How many training samples are necessary to learn the
task?
o Zero-shot: without any explicit training of the LM for the
downstream task
o Few-shot: few training samples (e.g., 1-100) of downstream
tasks
o Full-data: lots of training samples (e.g., 10K) of downstream
tasks
✔ Typical finetuning or supervised training
15
Zhao et al.
(2021)
Few-shot Prompting
❑ Several biases
o Majority label bias: if distribution of labels among the examples is unbalanced;
o Recency bias : the tendency where the model may repeat the label at the
end;
o Common token bias : the tendency to produce common tokens more often
than rare
tokens.
❑ Many studies looked into how to construct in-context
examples to maximize the performance
o The choice of prompt format, examples, and their order can lead to
dramatically
different performance, from near random guess to near SoTA.
16
o How to make in-context learning more reliable and deterministic?
Tips for Example Selection
❑ Choose examples that are semantically similar to the test
example using k-NN clustering in the embedding space (Liu
et al., 2021)
❑ To select a diverse and representative set of examples,
different sampling methods have been studied.
o Graph-based similarity search (Su et al. (2022)),
o Contrastive learning (Rubin et al. (2022)) ,
o Q-learning (Zhang et al. 2022), and
o Active learning (Diao et al. (2023))
17
Tips for Example Ordering
❑ Keep the selection of examples diverse, relevant to the test sample
and in random order to avoid majority label bias and recency
bias.
❑ Increasing model sizes or including more training examples does not
reduce variance among different permutations of in-context
examples.
❑ When the validation set is limited, consider choosing the order
such that the model does not produce extremely unbalanced
predictions or being overconfident about its predictions. (Lu et
al. 2022)
18
Prompt
Search
19
Traditional vs Prompt formulation
How to define a
suitable prompt
template?
20
Format of prompts
Encoder
s
❑ Cloze Prompt I love this movie.
o prompt with a slot [z] to fill in the middle of the text,Overall it was a [z]
o Encoder models trained by MLM objective, movie
• o e.g., BERT, LAMA, TemplateNER
❑ Prefix Prompt
o prompt where the input text comes entirely Decoder
before s
I love this movie.
slot [z] Overall this movie
o Decoder models trained by LM objective, is [z]
o e.g., GPT3, Prefix-turning, Prompt-tuning
21
Design of Prompt Templates
❑ Hand-crafted
o Configure the manual template based on the characteristics of
the task
❑ Automated search
o Search in discrete space, e.g., AdvTrigger, AutoPrompt
o Search in continuous space, e.g., Prefix-turning, Prompt-tuning
22
(Jiang et al.
Template)
questions/answers: Middle-word
Dependency-
based
23
(Jiang et al.
2019)
24
Gradient-based Search (Prompt = Trigger
AutoPrompt (Shin et al., 2020;
code)
Tokens)
25
Gradient-based Search (Prompt = Trigger
AutoPrompt (Shin et al., 2020;
code)
Tokens)
26
Trigger tokens for
adversarial attacks of
existing off-the-shelf
NLP systems.
Universal Trigger:
input-agnostic sequences of
tokens that trigger a model
to
produce a specific
prediction
when concatenated to
any input from a
dataset
E.g.,
• SNLI (89.95% ->
0.55%)
• SQAUD (72% of
”why”
questions answered ”to
kill American people”)
• GPT2 to spew racist
output
Universal Triggeron non-
conditioned
(Wallace
racialetcontexts.
al.,
2019)
27
Positive
movies
Prefix/Prompt Tuning
❑ Expressive power: optimize the embeddings of a prompt, instead of the words
❑ "Prompt Tuning" optimizes only the embedding layer, while "Prefix Tuning"
optimizes
prefix of all layers
30
(Li and Liang 2021, Lester et al.
2021)
Prefix/Prompt Tuning
31
Advanced
Topics
32
Issues of few-shot prompting
❑ The purpose of presenting few-shot examples is to explain
our intent to the model (describe the task instruction to
the model in the form of demonstrations.)
❑ But, few-shot can be expensive in terms of token
usage and restricts the input length due to limited
context length.
33
Instruction tuning/prompting
❑ Instructed LM (e.g. InstructGPT, natural instruction)
finetunes a pretrained model with high-quality tuples of
(task instruction, input, ground truth output) to make
LM better understand user intention and follow
instruction.
34
Text: i'll bet the video game is a lot more fun than
the film. Sentiment:
35
Please label the sentiment towards the movie of the
given movie review. The sentiment label should be
"positive" or "negative".
Text: i'll bet the video game is a lot more fun than
the film. Sentiment:
For example to produce education materials for
kids,
Describe what is quantum physics to a 6-year-old.
And safe
content,
... in language that is safe for work.
36
Chain-of-thought (CoT) prompting (Wei et al.
2022)
37
Chain-of-thought (CoT) prompting (Wei et al.
2022)
38
Chain-of-thought (CoT) prompting (Wei et al.
2022)
39
Chain-of-thought (CoT) prompting (Wei et al.
2022)
40
Chain-of-thought (CoT) prompting (Wei et al.
2022)
41
Few-shot CoT prompting
It is to prompt the model with a few demonstrations, each containing manually written (or
model-generated) high-quality reasoning chains.
Question: Tom and Elizabeth have a competition to climb a hill. Elizabeth takes 30 minutes to climb the
hill. Tom takes four times as long as Elizabeth does to climb the hill. How many hours does it take Tom to
climb up the hill? Answer: It takes Tom 30*4 = <<30*4=120>>120 minutes to climb the hill. It takes Tom
120/60 = <<120/60=2>>2 hours to climb the hill. So the answer is 2. ===
Question: Jack is a soccer player. He needs to buy two pairs of socks and a pair of soccer shoes. Each
pair of socks cost $9.50, and the shoes cost $92. Jack has $40. How much more money does Jack
need?
Answer: The total cost of two pairs of socks is $9.50 x 2 = $<<9.5*2=19>>19. The total cost of the
socks and the shoes is $19 + $92 = $<<19+92=111>>111. Jack need $111 - $40 = $<<111-40=71>>71
more. So the answer is
71. ===
Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut
parts must be divided into 5 equal parts. How long will each final cut be?
Answer:
42
Zero-shot CoT prompting
Use natural language statement like Let's think step by step to explicitly encourage the model
to first generate reasoning chains and then to prompt with
• Therefore, the answer is to produce answers (Kojima et al. 2022 ).
• Similar statements Let's work this out it a step by step to be sure we have the right answer (Zhou
et al. 2022).
• …. Many follow-up work
Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each
of the cut parts must be divided into 5 equal parts. How long will each final cut be?
Meta-cognition of
LLMs 43
More Advanced Prompting
Techniques
44
Self-consistency sampling (Wang et al.
2022a)
45
generate knowledge before making a prediction: how
helpful is this for tasks such as commonsense
reasoning?
51
❑ Tree of thoughts represent coherent language sequences that
serve as intermediate steps toward solving a problem.
❑ The LM's ability to generate and evaluate thoughts is then
combined with search algorithms (e.g.,BFS, DFS) to enable
systematic exploration of thoughts with lookahead and
backtracking.
Tree of Thought (Yao et el.
(2023)
52
Tree of Thought (Yao et el.
(2023)
53
Multi-modal CoT (Zhang et al.
(2023)
5
Advanced Chain-of-thought Prompting
❑ Chain-of-thought prompting
❑ Self-consistency
❑ Tree-of-Thoughts
❑ Multimodal Chain-of-
Thought
❑ Automatic-Chain-of-Thought
❑ Program-of-Thoughts
❑ Graph-of-Thoughts
❑ Algorithm-of-Thoughts
❑ Skeleton-of-Thought
❑…
55
Pointers to other tricks
❑ Self-Taught Reasoner; (Zelikman et al. 2022; Fu et al. 2023)
❑ Complexity-based consistency (Fu et al. 2023 Shum et al.
(2023))
❑ Explanation-augmented prompting (Ye & Durrett (2022))
❑ Self-Ask (Press et al. 2022)
❑ Interleaving Retrieval CoT (Trivedi et al. 2022)
❑ ReAct (Reason + Act) (Yao et al. 2023)
❑ Automatic Prompt Engineer (Zhou et al. 2022)
o APS (Augment-Prune-Select); Shum et al. (2023)
o Clustering-based generation Zhang et al. (2023)
56
Risks &
Misuses
57
We need to watch out for …
❑ Factually wrong generations (i.e., “Hallucinations”)
❑ Biases and unethical generations
❑ Generations that violate privacy & intellectual
property
❑ Other problems..? (HW5!)
58
Factually wrong generations
59
Adversarial prompting
❑ Prompt
Injection
❑ Prompt
Leaking
❑ Jailbreaking
60
Prompt injection
Prompt injection
tricks LLMs to
behave in an
undesired or
irregular manner.
61
Prompt leakage
❑ A form of prompt
injection, characterized
by attacks aimed at
divulging details from
prompts
o Potentially exposing
confidential or
proprietary information
that was not meant for
public disclosure.
Collection of leaked prompts of GPTs
https://github.com/linexjlin/GPTs?tab=readme-ov-file Bing Chat spills its secrets via prompt injection
attack 62
Jailbreaking
❑ LLMs are safeguarded
from responding to
unethical commands.
o However, their resistance
can be circumvented if the
request is cleverly framed
within a context.
source:
https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/?rdt=60884
63
Jailbreaking by many-shot
https://www.anthropic.com/research/many-shot-jailbreaking
64
Rachel
Skilton
65
Other resources
❑ OpenAI Cookbook has many in-depth examples for how to utilize
LLM efficiently.
❑ LangChain, a library for combining language models with other
components to build applications.
❑ Prompt Engineering Guide repo contains a pretty comprehensive
collection of education materials on prompt engineering.
❑ learnprompting.org
❑ PromptPerfect
❑ Semantic Kernel
❑ https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
66