0% found this document useful (0 votes)
54 views19 pages

Can Chatgpt Understand Too? A Comparative Study On Chatgpt and Fine-Tuned Bert

The document compares the understanding ability of ChatGPT to fine-tuned BERT models on the GLUE benchmark. It finds that ChatGPT performs poorly on paraphrase and similarity tasks but outperforms BERT models on inference tasks. With prompting strategies, ChatGPT's understanding improves and can even outperform RoBERTa-large on some tasks.

Uploaded by

Singularidad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views19 pages

Can Chatgpt Understand Too? A Comparative Study On Chatgpt and Fine-Tuned Bert

The document compares the understanding ability of ChatGPT to fine-tuned BERT models on the GLUE benchmark. It finds that ChatGPT performs poorly on paraphrase and similarity tasks but outperforms BERT models on inference tasks. With prompting strategies, ChatGPT's understanding improves and can even outperform RoBERTa-large on some tasks.

Uploaded by

Singularidad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Can ChatGPT Understand Too?

A Comparative Study on ChatGPT and Fine-tuned BERT

Qihuang Zhong♦,<∗, Liang Ding<,[ , Juhua Liu♦ , Bo Du♦ , Dacheng Tao<,[



Wuhan University < JD Explore Academy, JD.com Inc. [ University of Sydney
[email protected], [email protected]
https://github.com/WHU-ZQH/ChatGPT-vs.-BERT

Abstract form poorly for understanding tasks, even worse


than the base-sized BERT (Devlin et al., 2019)”,
Recently, ChatGPT has attracted great atten-
we wonder whether there is a similar phenomenon
tion, as it can generate fluent and high-quality
responses to human inquiries. Several prior
in the ChatGPT scenario. For the generation abil-
arXiv:2302.10198v2 [cs.CL] 2 Mar 2023

studies have shown that ChatGPT attains re- ity of ChatGPT, several prior studies (Jiao et al.,
markable generation ability compared with ex- 2023; Bang et al., 2023; Wang et al., 2023) have
isting models. However, the quantitative anal- shown that ChatGPT can achieve comparable or
ysis of ChatGPT’s understanding ability has even better performance than existing LLMs on
been given little attention. In this report, we ex- several generation tasks. However, it is still unclear
plore the understanding ability of ChatGPT by whether ChatGPT works well on natural language
evaluating it on the most popular GLUE bench-
understanding (NLU) tasks too.
mark, and comparing it with 4 representative
fine-tuned BERT-style models. We find that: In this report, we provide a systematic study to
1) ChatGPT falls short in handling paraphrase explore the question: “can ChatGPT understand
and similarity tasks; 2) ChatGPT outperforms too”. This question is answered by evaluating Chat-
all BERT models on inference tasks by a large GPT on the authoritative and popular GLUE (Wang
margin; 3) ChatGPT achieves comparable per- et al., 2019) benchmark, spanning 8 representative
formance compared with BERT on sentiment understanding tasks, i.e., sentiment analysis, lin-
analysis and question-answering tasks. Addi-
guistic acceptability, paraphrase, textual similarity,
tionally, by combining some advanced prompt-
ing strategies, we show that the understanding
natural language inference, and question answer-
ability of ChatGPT can be further improved. ing. For reference, we also compare it with 4 repre-
sentative BERT-style models. Through a series of
1 Introduction experiments and analyses, we find that:
Large language models (LLMs), such as GPT- ChatGPT falls short in handling paraphrase
3 (Brown et al., 2020) and InstructGPT (Ouyang and similarity tasks. Specifically, ChatGPT
et al., 2022), have swept the natural language pro- performs poorly in negative paraphrase and
cessing (NLP) community. Due to their emer- neutral similarity samples, respectively.
gent abilities (Wei et al., 2022a), these LLMs can
achieve impressive few-shot and zero-shot perfor- ChatGPT outperforms all BERT-style models
mance in a variety of NLP tasks. More recently, on inference tasks by a large margin, indicat-
ChatGPT1 , developed by OpenAI upon Instruct- ing its impressive reasoning ability.
GPT (Ouyang et al., 2022), has attracted great at-
ChatGPT achieves comparable performance
tention. Encouragingly, different from prior pub-
compared with BERT-base on sentiment anal-
lic chatbots, ChatGPT is able to generate fluent
ysis and question-answering tasks.
and comprehensive responses to various human
inquiries, and even correct inappropriate human Despite its good performance on inference
questions. tasks, ChatGPT may generate some contradic-
In light of the conventional wisdom that “GPT- tory or unreasonable responses, which would
style models work well in generation tasks, but per- be its potential limitations.

Work was done when Qihuang was interning at JD
Explore Academy. Furthermore, in addition to analyzing the Chat-
1
https://chat.openai.com GPT itself, we also explore the complementarity
of ChatGPT and some advanced prompting strate- Tasks and Datasets. Following many prior
gies, i.e., the standard few-shot prompting (also works (Zhong et al., 2022a, 2023), we use the
known as in-context learning) (Brown et al., 2020), widely-used GLUE benchmark (Wang et al., 2019)
manual few-shot chain-of-thought (CoT) prompt- for model evaluation purposes. As one of the
ing (Wei et al., 2022b) and zero-shot CoT prompt- most popular NLU benchmarks, GLUE consists of
ing (Kojima et al., 2022). Empirically, we find that several challenging NLU tasks, including linguis-
¶ all these prompting strategies can consistently tic acceptability (CoLA, Warstadt et al. (2019)),
improve the ChatGPT, among which the manual- sentiment analysis (SST-2, Socher et al. (2013)),
CoT brings the most performance benefits. Inter- paraphrase (MRPC, Dolan and Brockett (2005)),
estingly, we also observe that · the performance textual similarity (STS-B, Cer et al. (2017)),
of in-context learning is relatively sensitive to the question paraphrase (QQP), textual entailment
provided examples, especially in the 1-shot sce- (MNLI, Williams et al. (2018), RTE, Giampic-
nario, which is similar to the findings of Agrawal colo et al. (2007)) and question-answer entailment
et al. (2022). One possible reason is that the per- (QNLI, Rajpurkar et al. (2016)). Considering the
formance of in-context learning is (highly) related limits of testing ChatGPT, we follow Jiao et al.
to the correlation (e.g., similarity) between the pro- (2023) and randomly sample a subset of the dev
vided examples and test data. set as the evaluation data for each task. Specif-
To summarize, the zero-shot performance of ically, since most GLUE tasks are classification
ChatGPT is comparable to the baseline fine-tuned tasks (except STS-B which is a regression task),
BERT-base model. With the help of advanced we randomly sample 25 instances for each class
prompting strategies, ChatGPT shows better un- from the dev set. For STS-B, we randomly sample
derstanding ability, and even outperforms the pow- 50 instances from a uniform distribution. Table 1
erful RoBERTa-large model on some NLU tasks. shows the task descriptions and statistics2 .
However, there is still a performance gap between For evaluation, we report the performance with
ChatGPT and fine-tuned RoBERTa-large in terms Accuracy (“Acc.”) metric for most tasks, except the
of average performance. That said, while ChatGPT Pearson and Spearman correlation (“Pear./Spea.”)
could solve many NLP problems quite well, it still for STS-B, the Matthew correlation (“Mcc.”) for
fails to beat the current SOTA models (He et al., CoLA, the additional F1 score for MRPC and QQP.
2021; Wang et al., 2020; Zhong et al., 2022d; Patra
Baselines. We compare ChatGPT (Jan 31 Ver-
et al., 2022; Zhong et al., 2023), especially on some
sion) with 4 representative BERT-style models, as
NLU tasks.
the BERT models are commonly used as the base-
The remainder of this report is designed as fol-
lines to evaluate the understanding ability (Zhong
lows. We present the evaluation settings and com-
et al., 2022b). Specifically, base-sized/ large-sized
parative results in Section 2. In Section 3, we
BERT (Devlin et al., 2019) and RoBERTa (Liu
explore whether ChatGPT can be improved with
et al., 2019) are used. All models are fine-tuned on
advanced prompting strategies. In Section 4, we
the full training set for each task, where the fine-
briefly review the related works. Conclusions are
tuning hyper-parameters are the same to Zhong
described in Section 5.
et al. (2022c). To estimate the lower bound of Chat-
GPT’s understanding ability, we mainly focus on
2 ChatGPT vs. BERT
the comparison between ChatGPT and the basic
In this section, we first introduce the evaluation base-sized BERT.
setting (§2.1), and present the major results (§2.2). Prompts for ChatGPT. For each task, we de-
Then, some analyses of why ChatGPT performs sign task-specific prompts for triggering the under-
well or poorly are also provided (§2.3). Lastly, we standing ability of ChatGPT. Specifically, inspired
show some failure examples of ChatGPT to explore by Jiao et al. (2023), we also ask ChatGPT to gen-
its potential limitations (§2.4). erate the prompts for each task, by inputting the
following human inquiries:
2.1 Evaluation Setting
Here, we briefly introduce the evaluation setting, > provide five concise prompts or
including downstream tasks and datasets, baselines, templates that can make you deal with
2
and prompts for ChatGPT. More detailed descriptions are shown in Appendix A.1.
Task #Pos. #Neg. #Neu. Description Template Prompt
Single-Sentence Tasks
CoLA 25 25 - acceptablity For the sentence: “[text]", is the sentence grammarly correct?
SST-2 25 25 - sentiment For the sentence: “[text]", is the sentiment in this sentence
positive or negative?
Similarity and Paraphrase Tasks
MRPC 25 25 - paraphrase For the sentence pair “[text_1]" and “[text_2]", do these two
sentences have the same semantics?
STS-B total of 50 similarity Determine the similarity between the following two sentences:
“[text_1]" and “[text_2]". The score should be ranging from
0.0 to 5.0, and can be a decimal.
QQP 25 25 - paraphrase For the sentence pair “[text_1]" and “[text_2]", do these two
sentences have the same semantics?
Inference Tasks
MNLI 25 25 25 NLI Given the sentence “[text_1]", determine if the following
statement is entailed or contradicted or neutral: “[text_2]"
QNLI 25 25 - QA/NLI Given the question “[text_1]", determine if the following
sentence contains the corresponding answer: “[text_2]"
RTE 25 25 - NLI Given the sentence “[text_1]", determine if the following
statement is entailed: “[text_2]"

Table 1: Task statistics, descriptions and prompts. All tasks are single sentence or sentence pair classification,
except STS-B, which is a regression task. For ease of illustration, we use “#Pos./#Neg./#Neu.” to denote the
positive, negative and neutral instances for each task. Considering the limits of ChatGPT, we randomly sample 25
instances for each class from the dev set of each task for evaluation, except for STS-B, where we randomly sample
50 instances from a uniform distribution. In the prompts, [text], [text_1] and [text_2] are input slots.

rable average performance compared with BERT-


base (78.7% vs. 79.2%), but still underperforms the
other powerful BERT-style models (e.g., RoBERTa-
large, 87.8%) by a clear margin. These results
show that ChatGPT attains the basic understand-
ing ability, but there is still quite some room for
improvement.
Specifically, comparing ChatGPT with BERT-
Figure 1: Prompts for sentiment analysis, generated by base on specific tasks, we can find that: 1) Chat-
ChatGPT.
GPT performs poorly on the paraphrase and sim-
ilarity tasks, i.e., MRPC and STS-B, where the
the [x] task performance drop is up to 24% score. 2) Chat-
GPT surpasses all BERT-style models on natural
where the [x] is the task slot. Taking the senti- language inference tasks, i.e., MNLI and RTE, in-
ment analysis task as an example, we show this dicating its superiority on inference/reasoning. 3)
process in Figure 1. We evaluated ChatGPT on the ChatGPT is comparable to BERT-base on the single
sentiment analysis task with these five candidate sentence classification tasks, i.e., sentiment anal-
prompts in the preliminary experiments and found ysis (SST-2) and linguistic acceptability (CoLA),
a slight performance difference. Thus, for simplic- and QA-related tasks, i.e., QNLI.
ity, we choose one typical prompt for each task and
show them in Table 1. 2.3 Analysis
As seen in Table 2, ChatGPT works well on infer-
2.2 Main Results ence tasks, but falls short in handling paraphrase
The full results on the GLUE benchmark are shown and similarity tasks. Here, we investigate how Chat-
in Table 2. Overall, ChatGPT can achieve compa- GPT works on these special tasks in detail.
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE GLUE
Method
Mcc. Acc. Acc. F1 Pear. Spea. Acc. F1 m. mm. Acc. Acc. avg.
BERT-base 56.4 88.0 90.0 89.8 83.0 81.9 80.0 80.0 82.7 82.7 84.0 70.0 79.2
BERT-large 62.4 96.0 92.0 91.7 88.3 86.8 88.0 88.5 82.7 88.0 90.0 82.0 85.4
RoBERTa-base 61.8 96.0 90.0 90.6 90.2 89.1 84.0 84.0 84.0 88.0 92.0 78.0 84.7
RoBERTa-large 65.3 96.0 92.0 92.0 92.9 91.1 90.0 89.4 88.0 90.7 94.0 84.0 87.8
ChatGPT 56.0 92.0 66.0∗ 72.1∗ 80.9 72.4∗ 78.0 79.3 89.3∗ 81.3 84.0 88.0∗ 78.7

Table 2: Overall comparison between ChatGPT and fine-tuned BERT-style models on GLUE benchmark. The
results in green denote that ChatGPT surpasses the BERT-base model by a clear margin (> 2% (↑) score), while
the red results denote ChatGPT under-performs BERT-base (> 2% (↓) score)). More specifically, “*” means that
the performance difference between ChatGPT and BERT-base is larger than 10%.

MNLI-m RTE
Method
Entailment Contradiction Neutral Entailment Not_Entailment
BERT-base 88.0 88.0 72.0 76.0 64.0
BERT-large 76.0 92.0 80.0 80.0 84.0
RoBERTa-base 84.0 88.0 80.0 80.0 76.0
RoBERTa-large 84.0 92.0 88.0 92.0 76.0
ChatGPT 92.0∗ (↑ 4.0) 96.0∗ (↑ 8.0) 80.0 (↑ 8.0) 96.0∗ (↑ 20.0) 80.0 (↑ 16.0)

Table 3: Per-class accuracy (%) of ChatGPT and BERT-style models on MNLI-m and RTE. The number in paren-
theses indicates the performance improvement over BERT-base. “*” denotes that ChatGPT outperforms all BERT-
style models.

MRPC Paraphrase Task. Similar to the above analy-


Method
sis, we also report the per-class accuracy of Chat-
Entailment Not_Entailment
GPT and other models on the paraphrasing task,
BERT-base 88.0 92.0 i.e., MRPC, in Table 4. Surprisingly, ChatGPT
BERT-large 88.0 96.0 achieves comparable performance compared with
RoBERTa-base 96.0 84.0 BERT-base when evaluating “entailment” samples,
RoBERTa-large 92.0 92.0 but there is a dramatic performance drop (up to
ChatGPT 88.0 (↓ 0) 44.0 (↓ 47.0) 47% score) in the class of “not_entailment”, where
the sentences in the pair are not semantically equiv-
Table 4: Per-class accuracy (%) of ChatGPT and BERT- alent. This indicates that ChatGPT is not sensitive
style models on MRPC. The number in parentheses in- to the semantic difference between a pair of sen-
dicates the performance drops over BERT-base.
tences, which might be related to a lack of human
feedback on this aspect during model training.

Inference Tasks. To have a closer look at why Similarity Task. Since the STS-B is a regres-
ChatGPT achieves impressive performance on in- sion task, we choose some samples from the uni-
ference tasks, we report the per-class accuracy of form similarity distribution, ranging from 0 for no
ChatGPT and compared models on MNLI and RTE meaning overlap to 5 for meaning equivalence, and
tasks. The results are shown in Table 3. It can be show the absolute difference between predictions
seen that, ChatGPT outperforms BERT-base by a and ground-truths for ChatGPT and BERT-base,
large margin among all settings. Especially, in the respectively. As seen in Figure 2, ChatGPT under-
class of “entailment”, i.e., the premise entails the performs BERT-base in most cases, as it generally
hypothesis, ChatGPT even surpasses all powerful predicts far from the ground-truths. To be more
BERT models by a clear margin. These results specific, we can observe that ChatGPT performs
continue showing the effective inference ability of worse when the sentences in the pair have a lower
ChatGPT, especially reasoning factual input. similarity (<2.5 scores), which is similar to the
Figure 4: Failures of ChatGPT in paraphrase task. The
ground truth for both cases is “not_entailment”, but
Figure 2: The comparison between BERT-base and
ChatGPT makes the “entailment” predictions. (Data:
ChatGPT on STS-B. The x-axis denotes the similarity
2022.02.09)
distribution of STS-B, and the y-axis denotes the abso-
lute difference between prediction and ground truth.
First, while ChatGPT works well for the infer-
ence task, it still fails to make the correct predic-
tions in some cases. As seen in Figure 3, ChatGPT
can generate fluent responses to both inquiries due
to its powerful generation ability. However, we
observe that these responses are somewhat contra-
dictory and even unreasonable. For example, in the
upper case, ChatGPT says “...Jane was hungry
and that this was the reason for giving
candy to Joan,...”, which is very confusing. If
Jane was indeed hungry, Jane would not give candy
to Joan, but eat the candy himself (herself). There
is a similar phenomenon in the lower case, where
Figure 3: Failures of ChatGPT in inference task. The ChatGPT answers with confused logic. In gen-
ground truth for both cases is “not_entailment”, but eral, ChatGPT is able to generate fluent responses
ChatGPT makes the “entailment” predictions. (Data: following a certain pattern, but appears to have
2022.02.09) limitations in really reasoning the sentences. One
evidence is that ChatGPT even fails to answer the
questions, such as the cases in Figure 3, that are
observation from Table 4. It can also be found easily answered by humans.
that, ChatGPT is difficult to accurately predict the
On the other hand, some example failures
similarity score for a pair of sentences around the
of ChatGPT in the paraphrase task are shown
decision boundary (around the 2.5 scores). One
in Figure 4. Both cases are in the class of
of the reasons is ChatGPT is not fine-tuned on the
“not_entailment”. ChatGPT thinks the two sen-
STS-B task and cannot determine a correct decision
tences have the same semantics, as both sentences
boundary. And we show, in Section 3, advanced
describe a decrease (increase) in the value, which
prompting strategies upon ChatGPT could be con-
can be viewed as a coarse-grained semantic simi-
siderably improved.
larity. However, we can easily find that the major
difference between the two sentences is the value
2.4 Case Study
difference, determining the “not_entailment” po-
Here, we show some bad cases of ChatGPT to larity of these cases. We refer to this value dif-
explore its potential limitations, and attempt to ex- ference as the fine-grained semantic difference.
plain why ChatGPT falls short in handling the neg- These cases show that such a discrepancy between
ative samples of the paraphrasing task. coarse-grained and fine-grained semantic informa-
(a) Zero-shot (c) Zero-shot CoT
Q: For the sentence: "John believes it that Bill is here.", is the Q: For the sentence: " John believes it that Bill is here." , is the sentence
sentence grammarly correct? gr ammar ly cor r ect?
A: The answer (yes or no) is: A: Answer (yes or no) the question step by step.

(output): yes. (output): The word "it" is unnecessary and should be removed. The phrase
jljljl"that Bill is here" is a subordinate clause that functions as the object of
the verb "believes." However, in English, it is more natural to use the word
"that" to introduce a subordinate clause that functions as the object of a verb
of believing. Therefore, the corrected sentence would be: "John believes that
Bill is here." Ther efor e, the answer (yes or no) is: no.

(b) Few-shot (d) Few-shot CoT


Q: For the sentence: " The wagon r umbled down the r oad." , is the sentence
Q: For the sentence: "If I were a rich man, I'd buy a diamond
gr ammar ly cor r ect?
ring.", is the sentence grammarly correct?
A: Fir st, check the subject-ver b agr eement. (...) Second, check for spelling
A: The answer (yes or no) is: yes.
and punctuation er r or s. (...) Thir d, check for wor d choice and clar ity. (...)
Ther efor e, the answer (yes or no) is: yes.
Q: For the sentence: "John believes it that Bill is here.", is the
sentence grammarly correct? Q: For the sentence: " John believes it that Bill is here." , is the sentence
A: The answer (yes or no) is: gr ammar ly cor r ect?
A:
(output): no. (output): First, check the subject-verb agreement. The subject is "John"
jlga and the verb is "believes". They agree in tense and number. Second, check
for spelling and punctuation errors. The sentence is missing a comma after
"it". Third, check for word choice and clarity. The word "it" is unnecessary
and the phrasing is awkward. A better phrasing would be "John believes that
Bill is here." Ther efor e, the answer (yes or no) is: no.

Figure 5: Illustrations of ChatGPT equipped with (b) standard few-shot prompting (Brown et al., 2020), (c) zero-
shot chain-of-thought (CoT) prompting (Kojima et al., 2022) and (d) manual few-shot CoT prompting (Wei et al.,
2022b). This test example is from the dev set of CoLA (Warstadt et al., 2019), while the few-shot examples (in
green) are from the training set. We can find that, with the help of advanced prompting strategies, ChatGPT shows
a better understanding ability.

tion might be one of the reasons why ChatGPT • Standard few-shot prompting: also known
struggles with handling negative samples in the as in-context learning (Brown et al., 2020),
paraphrase task. This also indicates that strengthen- it can simply “prompt” the model with a
ing the ability of ChatGPT to extract fine-grained few input-output exemplars demonstrating the
semantic information would effectively improve its task. Specifically, as shown in Figure 5 (b), it
performance on the paraphrase tasks. enables the ChatGPT to perform a target task
by feeding a few prompted examples as part
3 Improving ChatGPT with Advanced of the input.
Prompting Strategies
• Manual few-shot CoT prompting: chain-of-
As mentioned in Section 2, we mainly focus on the thought (CoT) prompting is proposed by Wei
zero-shot learning performance of ChatGPT, and et al. (2022b), which provides manual inter-
the evaluation results show that there is still a clear mediate reasoning steps (demonstrations)3 to
margin between ChatGPT and fine-tuned BERT lead the model to output the final answer step
models on some NLU tasks. Inspired by some ad- by step.
vanced prompting methods (Brown et al., 2020; • Zero-shot CoT: instead of manually design-
Wei et al., 2022b; Kojima et al., 2022) that can ing the demonstrations, Kojima et al. (2022)
effectively exploit the capabilities of LLMs, here, propose a zero-shot CoT method, which em-
we attempt to investigate whether these methods ploys a simple and straightforward template-
can also improve the understanding ability of Chat- based prompting for CoT reasoning. Specif-
GPT and narrow its performance gap with powerful ically, as shown in Figure 5 (c), we use
BERT models.
3
The human efforts in the design of these demonstrations
for different tasks are nontrivial. In our experience, we can first
3.1 Advanced Prompting Strategies ask ChatGPT to generate the steps to perform the target task,
and manually modify the generated reasoning steps. After
In this study, we use three popular prompting strate- obtaining one demonstration, we can encourage the ChatGPT
gies as follows: to generate similar demonstrations for other input examples.
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE GLUE
Method
Mcc. Acc. Acc. F1 Pear. Spea. Acc. F1 m. mm. Acc. Acc. avg.
BERT-base 56.4 88.0 90.0 89.8 83.0 81.9 80.0 80.0 82.7 82.7 84.0 70.0 79.2
RoBERTa-large 65.3 96.0 92.0 92.0 92.9 91.1 90.0 89.4 88.0 90.7 94.0 84.0 87.8
ChatGPT 56.0 92.0 66.0 72.1 80.9 72.4 78.0 79.3 89.3 81.3 84.0 88.0 78.7
Standard few-shot prompting (Brown et al., 2020)
-w/ 1-shot 52.0 96.0 66.0 65.3 87.4 87.0 84.0 83.3 80.0 78.7 84.0 80.0 78.5
-w/ 5-shot 60.2 98.0 76.0 77.8 89.0 86.9 90.0 89.8 82.7 84.0 88.0 86.0 83.8
Zero-shot CoT (Kojima et al., 2022)
-w/ zero-shot CoT 64.5 96.0 78.0 76.6 87.1 87.8 80.0 80.8 86.7 89.3 86.0 90.0 83.7
Manual few-shot CoT (Wei et al., 2022b)
-w/ 1-shot CoT 60.8 94.0 82.0 83.2 89.1 88.7 84.0 82.6 85.3 84.0 88.0 92.0 84.3
-w/ 5-shot CoT 68.2 96.0 82.0 81.6 90.0 90.2 86.0 85.1 85.3 86.7 90.0 92.0 86.2

Table 5: Results of ChatGPT equipped with advanced prompting strategies. For reference, we also report the results
of baseline BERT-base and powerful RoBERTa-large. The best results are in bold. We can find that all advanced
prompting strategies bring some performance improvements to ChatGPT, among which the manual few-shot CoT
is empirically optimal.

“Answer (yes or no) the question step


by step.” to extract step-by-step reasoning. 70 CoLA
zero-shot (performance) 14
1-shot (performance)
To have a close look, taking the CoLA task as 65 1-shot (similarity) 12
an example, we show the illustrations of ChatGPT
Performance (%)

60 10

Similarity (%)
equipped with these prompting strategies in Fig-
ure 5. More input examples for each task can be
55 8
found in Appendix A.2.
6
50
3.2 More Results and Analyses 4
The overall results of ChatGPT equipped with ad- 45 2
vanced prompting strategies on GLUE benchmark
40 0
are shown in Table 5. For reference, we also 0 1 2 3 4
compare the improved ChatGPT with the baseline Example id
BERT-base and powerful RoBERTa-large models. Figure 6: Analysis of the unstable 1-shot prompting
Based on these empirical results, we can further performance on the CoLA task. The x-axis denotes
find that: 5 randomly sampled examples. The left y-axis is the
performance of ChatGPT, while the right y-axis is
¶ ChatGPT benefits from all these prompting the average textual similarity, measured by Sentence-
strategies. Compared to the baseline ChatGPT BERT (Reimers and Gurevych, 2019), between the
(78.7%), i.e., zero-shot ChatGPT, all these prompt- given example and test data.
ing strategies bring some performance improve-
ments. Specifically, the standard few-shot prompt-
ing and zero-shot CoT improves the overall perfor-
mance of ChatGPT by +5.1% and +5.0% average in-context example. Despite the overall perfor-
scores, respectively. More encouragingly, with the mance gains in few-shot settings, we can find that
help of manual few-shot CoT, ChatGPT achieves ChatGPT does not consistently perform better on
up to +7.5% average gains and even outperforms these NLU tasks, especially in the 1-shot scenario.
most BERT-style models (except for RoBERTa- More specifically, when equipped with the standard
large). These results indicate that prompting the 1-shot prompting, ChatGPT even performs worse
ChatGPT with manual-CoT could be the Pareto on some tasks, e.g., CoLA, MRPC, MNLI and RTE.
frontier for leveraging its capabilities. We attribute it to the lower correlation between the
randomly sampled in-context example and test data,
· In the 1-shot scenario, the performance of as the prior work (Agrawal et al., 2022) shows that
ChatGPT is relatively sensitive to the given the 1-shot noisy unrelated example could have a
catastrophic impact on output quality4 . To further Method Few-data Full-data
verify this conjecture, we use the different 1-shot ChatGPT 88.0 83.8
example to perform the standard 1-shot prompting. Standard few-shot prompting
Taking the CoLA task as an example, the compar- -w/ 1-shot 80.0 83.4
ative results are shown in Figure 6. As seen, the -w/ 5-shot 86.0 84.4
1-shot performance is unstable, and when given a Zero-shot CoT
more related 1-shot example, ChatGPT can achieve -w/ zero-shot CoT 90.0 85.9
more performance gains, confirming our statement. Manual few-shot CoT
-w/ 1-shot CoT 92.0 87.0
¸ There is still a performance gap be- -w/ 5-shot CoT 92.0 89.9
tween ChatGPT and fine-tuned RoBERTa-
large. With the help of manual-CoT, ChatGPT Table 6: Results of ChatGPT evaluated on the few-
achieves impressive performance improvements data(the setting used in our main experiment)/ full data
and shows state-of-the-art (SOTA) performance of RTE task. We can find that there are similar findings
in both scenarios.
among all comparison models on some tasks, e.g.,
CoLA, SST-2 and RTE. However, as seen, com-
pared with the fine-tuned RoBERTa-large, Chat-
GPT still underperforms on some tasks, especially 3 (Brown et al., 2020)) and 3) encoder-decoder
for the paraphrase task (MRPC), by a clear margin. PLMs (e.g., T5 (Raffel et al., 2020)). Due to differ-
These results continue indicating that, although ent pretraining functions, these PLMs exhibit dif-
ChatGPT could solve many NLP problems quite ferent abilities when performing NLP tasks. Specif-
well, it still fails to beat the current SOTA models, ically, the BERT-style models are based on a bidi-
especially on some NLU tasks. rectional masked language modeling (MLM) ob-
jective, which enforces the models to encode the
+ Note Some readers may concern that our work context information. Through fine-tuning on the
could be a kind of “lottery ticket”, as we only eval- specific task, these BERT-style models can work
uate ChatGPT on a part of the validation set for well on a variety of natural language understand-
each task. To dispel such doubt, we investigate ing (NLU) tasks. On the contrary, the GPT-style
whether there are similar findings in the full-data models aim to predict future words towards a se-
setting. Specifically, taking the RTE task as an quence of words. Such auto-regressive models are
example, we report the corresponding results of well-suitable for language generation, but they are
ChatGPT under the few-data and full-data settings, unidirectional and usually fail short in the represen-
respectively, as shown in Table 6. It can be found tation learning for understanding the sentence (Liu
that ChatGPT shows similar characteristics (e.g., et al., 2021; Zhong et al., 2022a).
significantly benefiting from manual-CoT) in both
scenarios, indicating the credibility of our work. More recently, a lot of work focus on scaling
up the PLMs and developing the large language
4 Related Works models (LLMs) (Ouyang et al., 2022; Chowdhery
et al., 2022; Smith et al., 2022; Zhang et al., 2022).
In recent years, we have witnessed numerous
Wei et al. (2022a) show that LLMs exhibit emer-
Transformer-based pretrained language models
gent abilities, e.g., few-shot and zero-shot learning,
(PLMs) (Devlin et al., 2019; Liu et al., 2019;
when the model sizes are large enough. As a typical
Brown et al., 2020; Raffel et al., 2020; Lewis et al.,
LLM, the recently-released ChatGPT has attracted
2020; Zhong et al., 2022a, 2023) that achieved
great attention, due to its impressive ability to gen-
tremendous success in various natural language
erate fluent and high-quality responses. There is
processing (NLP) tasks. Based on the model archi-
growing interest in exploring the capabilities, ap-
tectures, these PLMs can be classified into three
plications, ethics, and failures of ChatGPT (Jiao
groups: 1) encoder-only PLMs (e.g., BERT (Devlin
et al., 2023; Bang et al., 2023; Qin et al., 2023;
et al., 2019))5 , 2) decoder-only PLMs (e.g., GPT-
Zhuo et al., 2023; Wang et al., 2023). Along with
4
This might be also the reason why 5-shot prompting gener- the research line, we mainly focus on analyzing
ally works better, as concatenating multiple random examples the understanding ability of ChatGPT in this re-
could reduce the effect of noise.
5
We refer to these encoder-only models as BERT-style port, which is important but has been given little
models, and the decoder-only models as GPT-style models. attention.
5 Conclusion Daniel Cer, Mona Diab, Eneko E Agirre, Iñigo Lopez-
Gazpio, and Lucia Specia. 2017. Semeval-2017 task
In this study, we empirically investigate the lan- 1: Semantic textual similarity multilingual and cross-
guage understanding ability of ChatGPT on a di- lingual focused evaluation. In SemEval.
versity of natural language understanding tasks. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Through a series of quantitative studies, we find Maarten Bosma, Gaurav Mishra, Adam Roberts,
that ChatGPT works well on inference tasks, but Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, et al. 2022. Palm: Scaling lan-
falls short in handling paraphrase and similarity guage modeling with pathways. arXiv preprint.
tasks, especially for the negative instances. Fur-
thermore, we attempt to improve the understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of deep
ability of ChatGPT with some advanced prompting bidirectional transformers for language understand-
strategies. The results show that with the help of ing. In NAACL.
these prompting strategies, ChatGPT can achieve
Bill Dolan and Chris Brockett. 2005. Automatically
significant performance improvements, and even
constructing a corpus of sentential paraphrases. In
outperforms the powerful RoBERTa-large on some IWP.
tasks. Overall, ChatGPT attains a comparable un-
derstanding ability compared with some fine-tuned Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,
and William B Dolan. 2007. The third pascal
BERT-style models, but still fails to beat the cur- recognizing textual entailment challenge. In ACL-
rently best models on some NLU tasks. We hope PASCAL.
our study could facilitate more research on how to
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
address the limitations and improve the understand- Weizhu Chen. 2021. Deberta: Decoding-enhanced
ing performance of ChatGPT. bert with disentangled attention. In ICLR.
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
Limitations Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
translator? a preliminary study. arXiv preprint.
Our work has several potential limitations. First,
due to the limits of testing ChatGPT, we mainly Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
evaluate ChatGPT on a part of the validation set for taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
guage models are zero-shot reasoners. In NeurIPS.
each task. It would be more convincing if we can
test on more samples. On the other hand, this report Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
only uses the GLUE benchmark for experiments, jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
in which the task types are somewhat limited. In 2020. Bart: Denoising sequence-to-sequence pre-
future work, we would like to evaluate ChatGPT training for natural language generation, translation,
on more NLU tasks and conduct more in-depth and comprehension. In ACL.
analyses and discussions. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Hiroaki Hayashi, and Graham Neubig. 2021. Pre-
train, prompt, and predict: A systematic survey of
References prompting methods in natural language processing.
ACM Computing Surveys.
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke
Zettlemoyer, and Marjan Ghazvininejad. 2022. In- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
context examples selection for machine translation. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
arXiv preprint. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- proach. arXiv preprint.
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
wei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A Carroll Wainwright, Pamela Mishkin, Chong Zhang,
multitask, multilingual, multimodal evaluation of Sandhini Agarwal, Katarina Slama, Alex Gray, et al.
chatgpt on reasoning, hallucination, and interactiv- 2022. Training language models to follow instruc-
ity. arXiv preprint. tions with human feedback. In NeurIPS.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Barun Patra, Saksham Singhal, Shaohan Huang, Zewen
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Chi, Li Dong, Furu Wei, Vishrav Chaudhary, and
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Xia Song. 2022. Beyond english-centric bitexts for
Askell, et al. 2020. Language models are few-shot better multilingual language representation learning.
learners. NeurIPS. arXiv preprint.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is Artetxe, Moya Chen, Shuohui Chen, Christopher De-
chatgpt a general-purpose natural language process- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
ing task solver? arXiv preprint. 2022. Opt: Open pre-trained transformer language
models. arXiv preprint.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Wei Li, and Peter J Liu. 2020. Exploring the limits Dacheng Tao. 2022a. E2s2: Encoding-enhanced
of transfer learning with a unified text-to-text trans- sequence-to-sequence pretraining for language un-
former. JMLR. derstanding and generation. arXiv preprint.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Percy Liang. 2016. Squad: 100,000+ questions for Dacheng Tao. 2022b. Panda: Prompt transfer meets
machine comprehension of text. In EMNLP. knowledge distillation for efficient model adaptation.
arXiv preprint.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert- Qihuang Zhong, Liang Ding, Keqin Peng, Juhua Liu,
networks. In EMNLP. Bo Du, Yibing Zhan, and Dacheng Tao. 2023. Bag
of tricks for effective language model pretraining
Shaden Smith, Mostofa Patwary, Brandon Norick, and downstream adaptation: A case study on glue.
Patrick LeGresley, Samyam Rajbhandari, Jared arXiv preprint.
Casper, Zhun Liu, Shrimai Prabhumoye, George
Zerveas, Vijay Korthikanti, et al. 2022. Using Qihuang Zhong, Liang Ding, Li Shen, Peng Mi, Juhua
deepspeed and megatron to train megatron-turing Liu, Bo Du, and Dacheng Tao. 2022c. Improving
nlg 530b, a large-scale generative language model. sharpness-aware minimization with fisher mask for
arXiv preprint. better generalization on language models. In Find-
ings of EMNLP.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng, Qihuang Zhong, Liang Ding, Yibing Zhan, Y. Qiao,
and Christopher Potts. 2013. Recursive deep mod- Yonggang Wen, Li Shen, Juhua Liu, Baosheng Yu,
els for semantic compositionality over a sentiment Bo Du, Yixin Chen, Xinbo Gao, Chun Miao, Xiaoou
treebank. In EMNLP. Tang, and Dacheng Tao. 2022d. Toward efficient
language model pretraining and downstream adap-
Alex Wang, Amanpreet Singh, Julian Michael, Felix tation via self-evolution: A case study on superglue.
Hill, Omer Levy, and Samuel R Bowman. 2019. arXiv preprint.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. In ICLR. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and
Zhenchang Xing. 2023. Exploring ai ethics of chat-
Jiaan Wang, Yunlong Liang, Fandong Meng, Zhixu Li, gpt: A diagnostic analysis. arXiv preprint.
Jianfeng Qu, and Jie Zhou. 2023. Cross-lingual sum-
marization via chatgpt. arXiv preprint. A Appendix
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, A.1 Details of Tasks
Zuyi Bao, Liwei Peng, and Luo Si. 2020. Structbert:
Incorporating language structures into pre-training In this work, we conduct extensive experiments on
for deep language understanding. In ICLR. the GLUE (Wang et al., 2019) benchmark. Here,
we introduce the detailed descriptions of all down-
Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
man. 2019. Neural network acceptability judgments. stream tasks and datasets as follows:
TACL. CoLA Corpus of Linguistic Acceptabil-
ity (Warstadt et al., 2019) is a binary single-
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, sentence classification task to determine whether a
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al. given sentence is linguistically “acceptable”.
2022a. Emergent abilities of large language models. SST-2 The Stanford Sentiment Tree-
TMLR. bank (Socher et al., 2013) is a binary classification
task to predict the sentiment of a given sentence.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, MRPC Microsoft Research Paraphrase Cor-
et al. 2022b. Chain-of-thought prompting elicits rea- pus (Dolan and Brockett, 2005) is a task to predict
soning in large language models. In NeurIPS. whether two sentences are semantically equivalent.
Adina Williams, Nikita Nangia, and Samuel Bowman.
STS-B Semantic Textual Similarity (Cer et al.,
2018. A broad-coverage challenge corpus for sen- 2017) is a task to predict how similar two sentences
tence understanding through inference. In NAACL. are on a 1-5 scale in terms of semantic meaning.
QQP The Quora Question Pairs dataset is a
collection of question pairs from the community
question-answering website Quora. The task is to
determine whether a pair of questions are semanti-
cally equivalent.
MNLI The Multi-Genre Natural Language In-
ference Corpus (Williams et al., 2018) is a task to
predict whether the premise entails the hypothe-
sis, contradicts the hypothesis, or neither, given a
premise sentence and a hypothesis sentence.
QNLI Question Natural Language Inference
is a binary classification task constructed from
SQuAD (Rajpurkar et al., 2016), which aims to
predict whether a context sentence contains the
answer to a question sentence.
RTE Recognizing Textual Entailment (Giampic-
colo et al., 2007), given a premise and a hypothesis,
is a task to predict whether the premise entails the
hypothesis.

A.2 Input Examples


Here, we present input examples of standard few-
shot prompting, zero-shot CoT prompting and man-
ual few-shot CoT prompting used in ChatGPT. Ta-
ble 7 to 14 show the detailed examples for each
task of the GLUE benchmark.
Table 7: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompting
produced by ChatGPT on the CoLA task.

¶ Standard few-shot prompting


Q: For the sentence: "The wagon rumbled down the road.", is the sentence grammarly correct?
A: The answer (yes or no) is: yes.
Q: For the sentence: "Surprised me that you came early.", is the sentence grammarly correct?
A: The answer (yes or no) is: no.
Q: For the sentence: "Out of the box jumped a little white rabbit.", is the sentence grammarly correct?
A: The answer (yes or no) is: ____
· Zero-shot CoT prompting
Q: For the sentence: "Out of the box jumped a little white rabbit.", is the sentence grammarly correct?
A: Answer (yes or no) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: For the sentence: "The wagon rumbled down the road.", is the sentence grammarly correct?
A: First, check the subject-verb agreement. The subject is “wagon”, and the verb is “rumbled”. They
agree in tense and number. Second, check for spelling and punctuation errors. There are no spelling
errors and the sentence is punctuated correctly. Third, check for word choice and clarity. The word
choice is appropriate and clear. Therefore, we can conclude that the sentence is grammatically correct.
The answer (yes or no) is: yes.
Q: For the sentence: "Surprised me that you came early.", is the sentence grammarly correct?
A: First, check the subject-verb agreement. The subject is “you”, and there is no verb in the sentence.
This makes the sentence incomplete and ungrammatical. Second, check for spelling and punctuation
errors. There are no spelling errors, but the sentence is missing a comma after “Surprised me”. Third,
check for word choice and clarity. The word choice is awkward and unclear. A better phrasing
would be “I was surprised that you came early.” Therefore, we can conclude that the sentence is not
grammatically correct. The answer (yes or no) is: no.
Q: For the sentence: "Out of the box jumped a little white rabbit.", is the sentence grammarly correct?
A: The answer (yes or no) is: ____
Table 8: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompting
produced by ChatGPT on the SST-2 task.

¶ Standard few-shot prompting


Q: For the sentence: "’ll keep you wide awake and ... very tense ", is the sentiment in this sentence
positive or negative?
A: The answer (positive or negative) is: positive.
Q: For the sentence: "is to see two academy award winning actresses ( and one academy award
winning actor ) succumb to appearing in this junk that ’s tv sitcom material at best . ", is the sentiment
in this sentence positive or negative?
A: The answer (positive or negative) is: negative.
Q: For the sentence: "entertains by providing good , lively company . ", is the sentiment in this
sentence positive or negative?
A: The answer (positive or negative) is: ____
· Zero-shot CoT prompting
Q: For the sentence: "entertains by providing good , lively company . ", is the sentiment in this
sentence positive or negative?
A: Answer (positive or negative) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: For the sentence: ", while it may not rival the filmmaker ’s period pieces , is still very much worth
seeing ", is the sentiment in this sentence positive or negative?
A: First, identify the main clause and its subject and verb. Main clause: "is still very much worth
seeing". Subject: "it". Verb: "is". Second, determine the tone of the main clause. The tone of the
main clause is positive because it states that "it is still very much worth seeing.". Third, determine
the tone of the entire sentence. The sentence acknowledges that the film may not be as good as
the filmmaker’s other works, but still recommends it as being worth seeing. The tone of the entire
sentence is somewhat positive. Therefore, we can conclude that the sentiment of this sentence is
positive. The answer (positive or negative) is: positive.
Q: For the sentence: "anyone could desire to the cheap , graceless , hackneyed sci-fi serials of the
’30s and ’40s ", is the sentiment in this sentence positive or negative?
A: First, identify the main clause and its subject and verb. Main clause: "anyone could desire".
Subject: "anyone". Verb: "could desire". Second, determine the tone of the main clause. The tone of
the main clause is negative because it states that "anyone could desire." Third, determine the tone of
the entire sentence. The sentence is negative overall because it implies that it would be undesirable
for anyone to desire the described sci-fi serials. Therefore, we can conclude that the sentiment of this
sentence is negative. The answer (positive or negative) is: negative.
Q: For the sentence: "entertains by providing good , lively company . ", is the sentiment in this
sentence positive or negative?
A: The answer (positive or negative) is: ____
Table 9: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompting
produced by ChatGPT on the MRPC task.

¶ Standard few-shot prompting


Q: For the sentence pair "He found that men who had ejaculated more than five times a week in
their 20s were a third less likely to develop aggressive prostate cancer later in life ." and "Those who
ejaculated more than five times a week were a third less likely to develop serious prostate cancer in
later life .", do these two sentences have the same semantics?
A: The answer (yes or no) is: yes.
Q: For the sentence pair "Analysts say Davis , who faces a historic recall election in October , could
get a boost in the polls with a budget plan in place ." and "Analysts say Davis , a Democrat , could get
a boost in the polls if the 29-day-old budget crisis is resolved without further delay .", do these two
sentences have the same semantics?
A: The answer (yes or no) is: no.
Q: For the sentence pair "Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice
where she has been living for several years , said her father , Bob Schindler ." and "The tube was
removed Wednesday from Terri Schiavo , 39 , at the Tampa Bay-area hospice where she has lived for
several years .", do these two sentences have the same semantics?
A: The answer (yes or no) is: ____
· Zero-shot CoT prompting
Q: For the sentence pair "Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice
where she has been living for several years , said her father , Bob Schindler ." and "The tube was
removed Wednesday from Terri Schiavo , 39 , at the Tampa Bay-area hospice where she has lived for
several years .", do these two sentences have the same semantics?
A: Answer (yes or no) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: For the sentence pair "He found that men who had ejaculated more than five times a week in
their 20s were a third less likely to develop aggressive prostate cancer later in life ." and "Those who
ejaculated more than five times a week were a third less likely to develop serious prostate cancer in
later life .", do these two sentences have the same semantics?
A: First, identify the key differences between the two sentences. Second, consider the impact of the
difference in wording. Third, consider the overall meaning of the two sentences. Therefore, given that
the two sentences convey the same general idea, despite the difference in wording, we can conclude
that they have the same semantics. The answer (yes or no) is: yes.
Q: For the sentence pair "Sen. Bob Graham , Florida Democrat , raised $ 2 million after getting a late
start ." and "Further back , Sen. Bob Graham of Florida reported about $ 1.7 million on hand .", do
these two sentences have the same semantics?
A: First, identify the key differences between the two sentences. Second, consider the impact of
the difference in wording. Third, consider the overall meaning of the two sentences. While both
sentences are about Sen. Graham’s financial situation, they are focused on different aspects of it and
do not convey the same information. Therefore, we can conclude that the two sentences do not have
the same semantics. The answer (yes or no) is: no.
Q: For the sentence pair "Terri Schiavo , 39 , underwent the procedure at the Tampa Bay area hospice
where she has been living for several years , said her father , Bob Schindler ." and "The tube was
removed Wednesday from Terri Schiavo , 39 , at the Tampa Bay-area hospice where she has lived for
several years .", do these two sentences have the same semantics?
A: The answer (yes or no) is: ____
Table 10: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompt-
ing produced by ChatGPT on the STS-B task.

¶ Standard few-shot prompting


Q: Determine the similarity between the following two sentences: "’Scores of bodies’ found in Syria"
and "Eight more bodies found on ship". The score should be ranging from 0.0 to 5.0, and can be a
decimal.
A: The answer (decimals ranging from 0 to 5) is: 1.4
Q: Determine the similarity between the following two sentences: "The man cut some carpet with a
knife." and "A man is cutting carpet with a knife.". The score should be ranging from 0.0 to 5.0, and
can be a decimal.
A: The answer (decimals ranging from 0 to 5) is: 4.750
Q: Determine the similarity between the following two sentences: "At least 38 Morsi supporters die
in clashes" and "Dozens of Morsi supporters killed in Egypt clashes". The score should be ranging
from 0.0 to 5.0, and can be a decimal.
A: The answer (decimals ranging from 0 to 5) is: ____
· Zero-shot CoT prompting
Q: Determine the similarity between the following two sentences: "At least 38 Morsi supporters die
in clashes" and "Dozens of Morsi supporters killed in Egypt clashes". The score should be ranging
from 0.0 to 5.0, and can be a decimal.
A: Answer (ranging from 0 to 5) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: Determine the similarity between the following two sentences: "Ernst & Young has denied any
wrongdoing and plans to fight the allegations." and "Ernst & Young has denied the SEC’s claims, and
called its recommendations "irresponsible".". The score should be ranging from 0.0 to 5.0, and can be
a decimal.
A: First, preprocessing the two sentences to make them more suitable for comparison. We can do
this using the NLTK toolkit. Second, choosing a similarity metric. Here, we use the popular cosine
similarity metric. Third, computing the similarity score between two preprocessed sentence. After
obtaining the cosine similarity score, we simply multiply it by 5 to convert the similarity score to a
0.0 to 5.0 scale. Based on these, the answer (ranging from 0 to 5) is: 3.0
Q: Determine the similarity between the following two sentences: "I’m not outraged, I’m disap-
pointed." and "God bless I’m not outraged, I’m disappointed.". The score should be ranging from 0.0
to 5.0, and can be a decimal.
A: First, preprocessing the two sentences to make them more suitable for comparison. We can do
this using the NLTK toolkit. Second, choosing a similarity metric. Here, we use the popular cosine
similarity metric. Third, computing the similarity score between two preprocessed sentences. After
obtaining the cosine similarity score, we simply multiply it by 5 to convert the similarity score to a 0.0
to 5.0 scale. When we preprocess the two sentences and compute their cosine similarity score, we get
a score of 1.0. Multiplying this score by 5 gives us a similarity score of 5. Therefore, the similarity
score between the two sentences is 5 out of 5.0.
Q: Determine the similarity between the following two sentences: "At least 38 Morsi supporters die
in clashes" and "Dozens of Morsi supporters killed in Egypt clashes". The score should be ranging
from 0.0 to 5.0, and can be a decimal.
A: The answer (decimals ranging from 0 to 5) is: ____
Table 11: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompt-
ing produced by ChatGPT on the QQP task.

¶ Standard few-shot prompting


Q: For the sentence pair "Who is winning the presidential election, Trump or Clinton?" and "Who’s
winning the election, Trump or Clinton?", do these two sentences have the same semantics?
A: The answer (yes or no) is: yes.
Q: For the sentence pair "How should ask my crush for her number?" and "What should I ask my
crush?", do these two sentences have the same semantics?
A: The answer (yes or no) is: no.
Q: For the sentence pair "What are your favorite photos of yourself?" and "Which is your favourite
picture of yourself?", do these two sentences have the same semantics?
A: The answer (yes or no) is: ____
· Zero-shot CoT prompting
Q: For the sentence pair "What are your favorite photos of yourself?" and "Which is your favourite
picture of yourself?", do these two sentences have the same semantics?
A: Answer (yes or no) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: For the sentence pair "Who is winning the presidential election, Trump or Clinton?" and "Who’s
winning the election, Trump or Clinton?", do these two sentences have the same semantics?
A: First, identify the key differences between the two sentences. Second, consider the impact of the
difference in wording. Third, consider the overall meaning of the two sentences. Both sentences ask
the same question about the current status of the election and the relative positions of the candidates.
Therefore, given that the two sentences ask the same question and convey the same general meaning,
despite the difference in wording and context, we can conclude that they have the same semantics.
The answer (yes or no) is: yes.
Q: For the sentence pair "How do I know if I really want to become a doctor?" and "How do I know
if I want to be a doctor?", do these two sentences have the same semantics?
A: First, identify the key differences between the two sentences. Second, consider the impact of the
difference in wording. Third, consider the overall meaning of the two sentences. Both sentences ask
how one can determine whether they want to become a doctor. However, the inclusion of "really" in
the first sentence may imply a deeper level of introspection or a more significant decision. Therefore,
given the difference in emphasis and potential implications, we can conclude that these two sentences
do not have the same semantics. The answer (yes or no) is: no.
Q: For the sentence pair "What are your favorite photos of yourself?" and "Which is your favourite
picture of yourself?", do these two sentences have the same semantics?
A: The answer (yes or no) is: ____
Table 12: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompt-
ing produced by ChatGPT on the MNLI task.

¶ Standard few-shot prompting


Q: Given the sentence "well how do you think women’s roles have changed" determine if the following
statement is entailed or contradicted or neutral: "How have the roles of women changed?"
A: The answer (entailed or contradicted or neutral) is: entailed.
Q: Given the sentence "Should further revelations warrant impeachment, Congress may have lost
the necessary credibility." determine if the following statement is entailed or contradicted or neutral:
"Further news will never do anything to stifle the notion of Congress credibility."
A: The answer (entailed or contradicted or neutral) is: contradicted.
Q: Given the sentence "Others watched them with cold eyes and expressionless faces." determine if
the following statement is entailed or contradicted or neutral: "Some people who were not emotive
were watching."
A: The answer (entailed or contradicted or neutral) is: ____
· Zero-shot CoT prompting
Q: Given the sentence "Others watched them with cold eyes and expressionless faces." determine if
the following statement is entailed or contradicted or neutral: "Some people who were not emotive
were watching."
A: Answer (entailed or contradicted or neutral) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: Given the sentence "well how do you think women’s roles have changed" determine if the following
statement is entailed or contradicted or neutral: "How have the roles of women changed?"
A: First, identify the main proposition or claim of the original sentence. Second, identify the question
being asked in the statement. Third, compare the main proposition or claim of the original sentence
with the question being asked in the statement. The main proposition or claim of the original sentence
and the question being asked in the statement are similar in meaning and are asking about the same
topic. Lastly, determine the relationship between the original sentence and the statement. The
statement is asking a question that is similar in meaning to the main proposition or claim of the
original sentence. Therefore, we can conclude that the statement is entailed by the original sentence.
The answer (entailed or contradicted or neutral) is: entailed.
Q: Given the sentence "Should further revelations warrant impeachment, Congress may have lost
the necessary credibility." determine if the following statement is entailed or contradicted or neutral:
"Further news will never do anything to stifle the notion of Congress credibility."
A: First, identify the main proposition or claim of the original sentence. Second, identify the question
being asked in the statement. Third, compare the main proposition or claim of the original sentence
with the question being asked in the statement. The main proposition of the original sentence and the
claim in the statement are contradictory because the original sentence implies that further revelations
could harm Congress’s credibility, while the statement suggests that they could not. Lastly, determine
the relationship between the original sentence and the statement. The statement contradicts the
main proposition or claim of the original sentence. Therefore, we can conclude that the statement is
contradicted by the original sentence. The answer (entailed or contradicted or neutral) is: contradicted.
Q: Given the sentence "Others watched them with cold eyes and expressionless faces." determine if
the following statement is entailed or contradicted or neutral: "Some people who were not emotive
were watching."
A: The answer (entailed or contradicted or neutral) is: ____
Table 13: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompt-
ing produced by ChatGPT on the QNLI task.

¶ Standard few-shot prompting


Q: Given the question "The "To Secure These Rights" findings wanted desegregation of which
institution?" determine if the following sentence contains the corresponding answer: "These Rights
also called for desegregation of the Armed Forces."
A: The answer (yes or no) is: yes.
Q: Given the question "Where is the Bay of Pasaia located?" determine if the following sentence
contains the corresponding answer: "A series of expeditions to St. John’s by Portuguese from the
Azores took place in the early 16th century, and by 1540 French, Spanish and Portuguese ships
crossed the Atlantic annually to fish the waters off the Avalon Peninsula."
A: The answer (yes or no) is: no.
Q: Given the question "Where is English spoken the most?" determine if the following sentence con-
tains the corresponding answer: "English is widely spoken in commerce, schooling and government."
A: The answer (yes or no) is: ____
· Zero-shot CoT prompting
Q: Given the question "Where is English spoken the most?" determine if the following sentence con-
tains the corresponding answer: "English is widely spoken in commerce, schooling and government."
A: Answer (yes or no) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: Given the question "How large was the fleet that attacked the Bithynian coast along with the
Pecheneg allies?" determine if the following sentence contains the corresponding answer: "A navy of
10,000 vessels, including Pecheneg allies, landed on the Bithynian coast and devastated the Asiatic
shore of the Bosphorus."
A: First, identify the key information in the question: "How large was the fleet that attacked the
Bithynian coast along with the Pecheneg allies?" Second, look for the corresponding information in
the given sentence: "A navy of 10,000 vessels, including Pecheneg allies, landed on the Bithynian
coast and devastated the Asiatic shore of the Bosphorus." Third, compare the key information in the
question to the corresponding information in the sentence to determine if the sentence contains the
answer. The fleet that attacked the Bithynian coast along with the Pecheneg allies was a navy of
10,000 vessels, according to the sentence. Therefore, we can conclude that the sentence contains the
answer. The answer (yes or no) is: yes.
Q: Given the question "Heading east from Southampton, what city is connected by rail?" determine if
the following sentence contains the corresponding answer: "The route to London was opened in 1840
by what was to become the London and South Western Railway Company."
A: First, identify the key information in the question: "Heading east from Southampton, what city
is connected by rail?" Second, look for the corresponding information in the given sentence: "The
route to London was opened in 1840 by what was to become the London and South Western Railway
Company." Third, compare the key information in the question to the corresponding information in
the sentence to determine if the sentence contains the answer. The sentence mentions that the route to
London was opened by the London and South Western Railway Company, but it does not mention if
there are any other cities connected by rail from Southampton. Therefore, we can conclude that the
sentence does not contain the corresponding answer to the question. The answer (yes or no) is: no.
Q: Given the question "Where is English spoken the most?" determine if the following sentence con-
tains the corresponding answer: "English is widely spoken in commerce, schooling and government."
A: The answer (yes or no) is: ____
Table 14: Examples of standard few-shot prompting, zero-shot CoT prompting and manual few-shot CoT prompt-
ing produced by ChatGPT on the RTE task.

¶ Standard few-shot prompting


Q: Given the sentence "Harvey Weinstein, the co-chairman of Miramax, who was instrumental in
popularizing both independent and foreign films with broad audiences, agrees." determine if the
following statement is entailed: "Harvey Weinstein is the co-chairman of Miramax."
A: The answer (yes or no) is: yes.
Q: Given the sentence "INS predicts that the smuggling will continue to increase and that alien
smuggling organizations will become more sophisticated, organized, and complex." determine if the
following statement is entailed: "Steps are being taken to stop the smuggling of aliens."
A: The answer (yes or no) is: no.
Q: Given the sentence "Hepburn’s family will receive the proceeds from the sale." determine if the
following statement is entailed: "Proceeds go to Hepburn’s family."
A: The answer (yes or no) is: ____
· Zero-shot CoT prompting
Q: Given the sentence "Hepburn’s family will receive the proceeds from the sale." determine if the
following statement is entailed: "Proceeds go to Hepburn’s family."
A: Answer (yes or no) the question step by step. The answer is: ____
¸ Manual few-shot CoT prompting
Q: Given the sentence "Anna Politkovskaya was found shot dead on Saturday in a lift at her block
of flats in the Russian capital, Moscow." determine if the following statement is entailed: "Anna
Politkovskaya was murdered."
A: First, is there any mention of Anna Politkovskaya’s death in the sentence? Yes, the sentence
mentions that Anna Politkovskaya was found shot dead. Second: does the sentence suggest how Anna
Politkovskaya died? Yes, the sentence suggests that Anna Politkovskaya died from being shot. Third,
does the sentence use any words that indicate that Anna Politkovskaya’s death was intentional or
deliberate? Yes, the sentence uses the word "shot" which suggests that Anna Politkovskaya’s death
was intentional. Lastly, based on the information provided in the sentence, can we conclude that Anna
Politkovskaya was murdered? Yes, we can conclude that Anna Politkovskaya was murdered because
she was found shot dead, which suggests that someone intentionally and unlawfully caused her death.
Therefore, the answer (yes or no) is: yes.
Q: Given the sentence "Oscar-winning director Franco Zeffirelli has been awarded an honorary
knighthood for his "valuable services to British performing arts"." determine if the following statement
is entailed: "Italian director is awarded an honorary Oscar."
A: First, is there any mention of an Oscar in the sentence? No, there is no mention of an Oscar in the
sentence. Second, does the sentence suggest that Franco Zeffirelli received any award related to film
or cinema? No, the sentence mentions that Franco Zeffirelli was awarded an honorary knighthood
for his services to the British performing arts, but there is no indication that he received an honorary
Oscar. Therefore, based on the information provided in the sentence, we cannot conclude that Franco
Zeffirelli was awarded an honorary Oscar. Therefore, the answer (yes or no) is no.
Q: Given the sentence "Hepburn’s family will receive the proceeds from the sale." determine if the
following statement is entailed: "Proceeds go to Hepburn’s family."
A: The answer (yes or no) is: ____

You might also like