0% found this document useful (0 votes)
26 views66 pages

Lecture 7

Uploaded by

omargohary2608
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views66 pages

Lecture 7

Uploaded by

omargohary2608
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture 7: Prompting in Large Language Models

Dr. Mohamed Taher Alrefaie


Emergent behavior from Scaling
Law:
Quantum performance jump when +100B parameters

Jeff Dean https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html


2
GPT3, in-context learning, and VERY large language
models

❑ So far, we’ve interacted with pretrained models in two ways:


o Sample from the distributions they define
o Fine-tune them on a task we care about, and then take their predictions
❑ Emergent behavior: Very large language models seem to perform
some kind of learning without gradient steps simply from
examples you provide within their contexts.
o GPT-3 is the canonical example of this. The largest T5 model had 11
billion parameters. GPT-3 has 175 billion parameters

3
NLP Technical Development for past 10 years
Prompt Engineering / In-context Learning
- Prompting with LLMs
- GPT4, GPT3, chatGPT, DALLE2
Objective Engineering
- Pre-training and fine-tuning
- BERT, GPT2, T5
Architecture Engineering
- Neural nets, e.g., LSTM, CNN, GRU
- features from Word2Vec, GloVe
Feature Engineering
- Hand-crafted features
- SVM/CRF training

- 2013 - 2017 - 2020 -


2015 2015 2022 present

4
What is Prompting ?
❑ Very large language models seem to
perform some kind of learning
without gradient steps simply from
examples you provide within their
contexts.

❑ Encouraging a pre-trained model to


make particular predictions by
providing a "prompt" specifying
the task to be done
5
Pretrained model choice

Decoders
GPT-2, GPT-3,
LaMDA

Encoders
BERT, RoBERTa

Encoder-
Decoders
Vanilla
Transformer, T5,
BART

6
Encoder
s

7
Traditional vs Prompt formulation

8
Labels are not Y
anymore, but a
part of X
•Classification P (Y
| X) Generation P
(X)
9
Traditional vs Prompt formulation

We have
reformulated the
task! We also should
re-define the “ground
truth labels”

10
Traditional vs Prompt formulation

11
Basic
Prompting

12
Zero-shot
Prompting Decoders

❑ Simply feed the task text to the model and ask for results.

Text: i'll bet the video game is a lot more fun than
the film. Sentiment:

13
Few-shot
Prompting Decoders

❑ Presents a set of demonstrations (both input and output) on the


target task. As the model first sees good examples, it can
better understand human intention and criteria for what
kinds of answers are wanted.
Text: (lawrence bounces) all over the stage, dancing, running, sweating,
mopping his face and generally displaying the wacky talent that brought him
fame in the first place. Sentiment: positive

Text: despite all evidence to the contrary, this clunker has somehow managed
to pose as an actual feature movie, the kind that charges full admission and
gets hyped on tv and purports to amuse small children and ostensible adults.
Sentiment: negative

Text: i'll bet the video game is a lot more fun than
the film. Sentiment: 14
Prompt-based Training Strategies
❑ How many training samples are necessary to learn the
task?
o Zero-shot: without any explicit training of the LM for the
downstream task
o Few-shot: few training samples (e.g., 1-100) of downstream
tasks
o Full-data: lots of training samples (e.g., 10K) of downstream
tasks
✔ Typical finetuning or supervised training

15
Zhao et al.
(2021)

Few-shot Prompting
❑ Several biases
o Majority label bias: if distribution of labels among the examples is unbalanced;
o Recency bias : the tendency where the model may repeat the label at the
end;
o Common token bias : the tendency to produce common tokens more often
than rare
tokens.
❑ Many studies looked into how to construct in-context
examples to maximize the performance
o The choice of prompt format, examples, and their order can lead to
dramatically
different performance, from near random guess to near SoTA.
16
o How to make in-context learning more reliable and deterministic?
Tips for Example Selection
❑ Choose examples that are semantically similar to the test
example using k-NN clustering in the embedding space (Liu
et al., 2021)
❑ To select a diverse and representative set of examples,
different sampling methods have been studied.
o Graph-based similarity search (Su et al. (2022)),
o Contrastive learning (Rubin et al. (2022)) ,
o Q-learning (Zhang et al. 2022), and
o Active learning (Diao et al. (2023))

17
Tips for Example Ordering
❑ Keep the selection of examples diverse, relevant to the test sample
and in random order to avoid majority label bias and recency
bias.
❑ Increasing model sizes or including more training examples does not
reduce variance among different permutations of in-context
examples.
❑ When the validation set is limited, consider choosing the order
such that the model does not produce extremely unbalanced
predictions or being overconfident about its predictions. (Lu et
al. 2022)

18
Prompt
Search

19
Traditional vs Prompt formulation
How to define a
suitable prompt
template?

20
Format of prompts
Encoder
s
❑ Cloze Prompt I love this movie.
o prompt with a slot [z] to fill in the middle of the text,Overall it was a [z]
o Encoder models trained by MLM objective, movie
• o e.g., BERT, LAMA, TemplateNER

❑ Prefix Prompt
o prompt where the input text comes entirely Decoder
before s
I love this movie.
slot [z] Overall this movie
o Decoder models trained by LM objective, is [z]
o e.g., GPT3, Prefix-turning, Prompt-tuning

21
Design of Prompt Templates
❑ Hand-crafted
o Configure the manual template based on the characteristics of
the task

❑ Automated search
o Search in discrete space, e.g., AdvTrigger, AutoPrompt
o Search in continuous space, e.g., Prefix-turning, Prompt-tuning

22
(Jiang et al.

Prompt Mining (Prompt =


2019)

Template)

Mine prompts given a set of

questions/answers: Middle-word

Dependency-
based

23
(Jiang et al.
2019)

Prompt Paraphrasing (Prompt =


Template)
Paraphrase an existing prompt to get other
candidates
e.g. back translation with beam search

24
Gradient-based Search (Prompt = Trigger
AutoPrompt (Shin et al., 2020;
code)

Tokens)

25
Gradient-based Search (Prompt = Trigger
AutoPrompt (Shin et al., 2020;
code)

Tokens)

Still much less than fine-tune


RoBERTa models, but huge
improvement made over the
manual prompting

26
Trigger tokens for
adversarial attacks of
existing off-the-shelf
NLP systems.

Universal Trigger:
input-agnostic sequences of
tokens that trigger a model
to
produce a specific
prediction
when concatenated to
any input from a
dataset

E.g.,
• SNLI (89.95% ->
0.55%)
• SQAUD (72% of
”why”
questions answered ”to
kill American people”)
• GPT2 to spew racist
output
Universal Triggeron non-
conditioned
(Wallace
racialetcontexts.
al.,
2019)
27
Positive
movies

Update gradient of the classifier


p(neg) to the target adversarial
label (negative) to choose trigger
words to make input to be
negative

Found trigger words that


make positive reviews to be
Universal Trigger negative
(Wallace et al.,
2019)
28
Sub-optimal and sensitive discrete/hard prompts
❑ Discrete/hard prompts
o natural language instructions/task descriptions
❑ Problems
o require domain expertise/understanding of the model’s inner
workings
o performance still lags far behind SotA model tuning results
o sub-optimal and sensitive
✔ prompts that humans consider reasonable is not necessarily
effective for language models (Liu et al., 2021)
✔ pre-trained language models are sensitive to the choice of prompts (
Zhao et al., 2021)
29
(Li and Liang 2021, Lester et al.
2021)

Prefix/Prompt Tuning
❑ Expressive power: optimize the embeddings of a prompt, instead of the words
❑ "Prompt Tuning" optimizes only the embedding layer, while "Prefix Tuning"
optimizes
prefix of all layers

30
(Li and Liang 2021, Lester et al.
2021)

Prefix/Prompt Tuning

31
Advanced
Topics

32
Issues of few-shot prompting
❑ The purpose of presenting few-shot examples is to explain
our intent to the model (describe the task instruction to
the model in the form of demonstrations.)
❑ But, few-shot can be expensive in terms of token
usage and restricts the input length due to limited
context length.

❑ Why not just give the instruction directly?

33
Instruction tuning/prompting
❑ Instructed LM (e.g. InstructGPT, natural instruction)
finetunes a pretrained model with high-quality tuples of
(task instruction, input, ground truth output) to make
LM better understand user intention and follow
instruction.

❑ Improve the model to be more aligned with human


intention and greatly reduces the cost of communication.

34
Text: i'll bet the video game is a lot more fun than
the film. Sentiment:

35
Please label the sentiment towards the movie of the
given movie review. The sentiment label should be
"positive" or "negative".
Text: i'll bet the video game is a lot more fun than
the film. Sentiment:
For example to produce education materials for
kids,
Describe what is quantum physics to a 6-year-old.
And safe
content,
... in language that is safe for work.

36
Chain-of-thought (CoT) prompting (Wei et al.
2022)
37
Chain-of-thought (CoT) prompting (Wei et al.
2022)
38
Chain-of-thought (CoT) prompting (Wei et al.
2022)
39
Chain-of-thought (CoT) prompting (Wei et al.
2022)
40
Chain-of-thought (CoT) prompting (Wei et al.
2022)
41
Few-shot CoT prompting
It is to prompt the model with a few demonstrations, each containing manually written (or
model-generated) high-quality reasoning chains.
Question: Tom and Elizabeth have a competition to climb a hill. Elizabeth takes 30 minutes to climb the
hill. Tom takes four times as long as Elizabeth does to climb the hill. How many hours does it take Tom to
climb up the hill? Answer: It takes Tom 30*4 = <<30*4=120>>120 minutes to climb the hill. It takes Tom
120/60 = <<120/60=2>>2 hours to climb the hill. So the answer is 2. ===

Question: Jack is a soccer player. He needs to buy two pairs of socks and a pair of soccer shoes. Each
pair of socks cost $9.50, and the shoes cost $92. Jack has $40. How much more money does Jack
need?
Answer: The total cost of two pairs of socks is $9.50 x 2 = $<<9.5*2=19>>19. The total cost of the
socks and the shoes is $19 + $92 = $<<19+92=111>>111. Jack need $111 - $40 = $<<111-40=71>>71
more. So the answer is
71. ===

Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut
parts must be divided into 5 equal parts. How long will each final cut be?
Answer:
42
Zero-shot CoT prompting
Use natural language statement like Let's think step by step to explicitly encourage the model
to first generate reasoning chains and then to prompt with
• Therefore, the answer is to produce answers (Kojima et al. 2022 ).
• Similar statements Let's work this out it a step by step to be sure we have the right answer (Zhou
et al. 2022).
• …. Many follow-up work

Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each
of the cut parts must be divided into 5 equal parts. How long will each final cut be?

Answer: Let's think step by step.

Meta-cognition of
LLMs 43
More Advanced Prompting
Techniques

44
Self-consistency sampling (Wang et al.
2022a)
45
generate knowledge before making a prediction: how
helpful is this for tasks such as commonsense
reasoning?

Generated Knowledge Prompting (Liu et al. 2022


) 46
47
Generated Knowledge Prompting (Liu et al.
2022)
48
Generated Knowledge Prompting (Liu et al.
2022)
49
50
Advanced Chain-of-thought Prompting

51
❑ Tree of thoughts represent coherent language sequences that
serve as intermediate steps toward solving a problem.
❑ The LM's ability to generate and evaluate thoughts is then
combined with search algorithms (e.g.,BFS, DFS) to enable
systematic exploration of thoughts with lookahead and
backtracking.
Tree of Thought (Yao et el.
(2023)
52
Tree of Thought (Yao et el.
(2023)
53
Multi-modal CoT (Zhang et al.
(2023)
5
Advanced Chain-of-thought Prompting
❑ Chain-of-thought prompting
❑ Self-consistency
❑ Tree-of-Thoughts
❑ Multimodal Chain-of-
Thought
❑ Automatic-Chain-of-Thought
❑ Program-of-Thoughts
❑ Graph-of-Thoughts
❑ Algorithm-of-Thoughts
❑ Skeleton-of-Thought
❑…
55
Pointers to other tricks
❑ Self-Taught Reasoner; (Zelikman et al. 2022; Fu et al. 2023)
❑ Complexity-based consistency (Fu et al. 2023 Shum et al.
(2023))
❑ Explanation-augmented prompting (Ye & Durrett (2022))
❑ Self-Ask (Press et al. 2022)
❑ Interleaving Retrieval CoT (Trivedi et al. 2022)
❑ ReAct (Reason + Act) (Yao et al. 2023)
❑ Automatic Prompt Engineer (Zhou et al. 2022)
o APS (Augment-Prune-Select); Shum et al. (2023)
o Clustering-based generation Zhang et al. (2023)
56
Risks &
Misuses

57
We need to watch out for …
❑ Factually wrong generations (i.e., “Hallucinations”)
❑ Biases and unethical generations
❑ Generations that violate privacy & intellectual
property
❑ Other problems..? (HW5!)

58
Factually wrong generations

*Hinton received the


2018 Turing Award,
together with Yoshua
Bengio and Yann LeCun,
for their work on deep
learning.

59
Adversarial prompting
❑ Prompt
Injection
❑ Prompt
Leaking
❑ Jailbreaking

60
Prompt injection
Prompt injection
tricks LLMs to
behave in an
undesired or
irregular manner.

61
Prompt leakage
❑ A form of prompt
injection, characterized
by attacks aimed at
divulging details from
prompts
o Potentially exposing
confidential or
proprietary information
that was not meant for
public disclosure.
Collection of leaked prompts of GPTs
https://github.com/linexjlin/GPTs?tab=readme-ov-file Bing Chat spills its secrets via prompt injection
attack 62
Jailbreaking
❑ LLMs are safeguarded
from responding to
unethical commands.
o However, their resistance
can be circumvented if the
request is cleverly framed
within a context.

source:
https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/?rdt=60884
63
Jailbreaking by many-shot

https://www.anthropic.com/research/many-shot-jailbreaking

64
Rachel
Skilton
65
Other resources
❑ OpenAI Cookbook has many in-depth examples for how to utilize
LLM efficiently.
❑ LangChain, a library for combining language models with other
components to build applications.
❑ Prompt Engineering Guide repo contains a pretty comprehensive
collection of education materials on prompt engineering.
❑ learnprompting.org
❑ PromptPerfect
❑ Semantic Kernel
❑ https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
66

You might also like