Can Chatgpt Understand Too? A Comparative Study On Chatgpt and Fine-Tuned Bert
Can Chatgpt Understand Too? A Comparative Study On Chatgpt and Fine-Tuned Bert
studies have shown that ChatGPT attains re- ity of ChatGPT, several prior studies (Jiao et al.,
markable generation ability compared with ex- 2023; Bang et al., 2023; Wang et al., 2023) have
isting models. However, the quantitative anal- shown that ChatGPT can achieve comparable or
ysis of ChatGPT’s understanding ability has even better performance than existing LLMs on
been given little attention. In this report, we ex- several generation tasks. However, it is still unclear
plore the understanding ability of ChatGPT by whether ChatGPT works well on natural language
evaluating it on the most popular GLUE bench-
understanding (NLU) tasks too.
mark, and comparing it with 4 representative
fine-tuned BERT-style models. We find that: In this report, we provide a systematic study to
1) ChatGPT falls short in handling paraphrase explore the question: “can ChatGPT understand
and similarity tasks; 2) ChatGPT outperforms too”. This question is answered by evaluating Chat-
all BERT models on inference tasks by a large GPT on the authoritative and popular GLUE (Wang
margin; 3) ChatGPT achieves comparable per- et al., 2019) benchmark, spanning 8 representative
formance compared with BERT on sentiment understanding tasks, i.e., sentiment analysis, lin-
analysis and question-answering tasks. Addi-
guistic acceptability, paraphrase, textual similarity,
tionally, by combining some advanced prompt-
ing strategies, we show that the understanding
natural language inference, and question answer-
ability of ChatGPT can be further improved. ing. For reference, we also compare it with 4 repre-
sentative BERT-style models. Through a series of
1 Introduction experiments and analyses, we find that:
Large language models (LLMs), such as GPT- ChatGPT falls short in handling paraphrase
3 (Brown et al., 2020) and InstructGPT (Ouyang and similarity tasks. Specifically, ChatGPT
et al., 2022), have swept the natural language pro- performs poorly in negative paraphrase and
cessing (NLP) community. Due to their emer- neutral similarity samples, respectively.
gent abilities (Wei et al., 2022a), these LLMs can
achieve impressive few-shot and zero-shot perfor- ChatGPT outperforms all BERT-style models
mance in a variety of NLP tasks. More recently, on inference tasks by a large margin, indicat-
ChatGPT1 , developed by OpenAI upon Instruct- ing its impressive reasoning ability.
GPT (Ouyang et al., 2022), has attracted great at-
ChatGPT achieves comparable performance
tention. Encouragingly, different from prior pub-
compared with BERT-base on sentiment anal-
lic chatbots, ChatGPT is able to generate fluent
ysis and question-answering tasks.
and comprehensive responses to various human
inquiries, and even correct inappropriate human Despite its good performance on inference
questions. tasks, ChatGPT may generate some contradic-
In light of the conventional wisdom that “GPT- tory or unreasonable responses, which would
style models work well in generation tasks, but per- be its potential limitations.
∗
Work was done when Qihuang was interning at JD
Explore Academy. Furthermore, in addition to analyzing the Chat-
1
https://chat.openai.com GPT itself, we also explore the complementarity
of ChatGPT and some advanced prompting strate- Tasks and Datasets. Following many prior
gies, i.e., the standard few-shot prompting (also works (Zhong et al., 2022a, 2023), we use the
known as in-context learning) (Brown et al., 2020), widely-used GLUE benchmark (Wang et al., 2019)
manual few-shot chain-of-thought (CoT) prompt- for model evaluation purposes. As one of the
ing (Wei et al., 2022b) and zero-shot CoT prompt- most popular NLU benchmarks, GLUE consists of
ing (Kojima et al., 2022). Empirically, we find that several challenging NLU tasks, including linguis-
¶ all these prompting strategies can consistently tic acceptability (CoLA, Warstadt et al. (2019)),
improve the ChatGPT, among which the manual- sentiment analysis (SST-2, Socher et al. (2013)),
CoT brings the most performance benefits. Inter- paraphrase (MRPC, Dolan and Brockett (2005)),
estingly, we also observe that · the performance textual similarity (STS-B, Cer et al. (2017)),
of in-context learning is relatively sensitive to the question paraphrase (QQP), textual entailment
provided examples, especially in the 1-shot sce- (MNLI, Williams et al. (2018), RTE, Giampic-
nario, which is similar to the findings of Agrawal colo et al. (2007)) and question-answer entailment
et al. (2022). One possible reason is that the per- (QNLI, Rajpurkar et al. (2016)). Considering the
formance of in-context learning is (highly) related limits of testing ChatGPT, we follow Jiao et al.
to the correlation (e.g., similarity) between the pro- (2023) and randomly sample a subset of the dev
vided examples and test data. set as the evaluation data for each task. Specif-
To summarize, the zero-shot performance of ically, since most GLUE tasks are classification
ChatGPT is comparable to the baseline fine-tuned tasks (except STS-B which is a regression task),
BERT-base model. With the help of advanced we randomly sample 25 instances for each class
prompting strategies, ChatGPT shows better un- from the dev set. For STS-B, we randomly sample
derstanding ability, and even outperforms the pow- 50 instances from a uniform distribution. Table 1
erful RoBERTa-large model on some NLU tasks. shows the task descriptions and statistics2 .
However, there is still a performance gap between For evaluation, we report the performance with
ChatGPT and fine-tuned RoBERTa-large in terms Accuracy (“Acc.”) metric for most tasks, except the
of average performance. That said, while ChatGPT Pearson and Spearman correlation (“Pear./Spea.”)
could solve many NLP problems quite well, it still for STS-B, the Matthew correlation (“Mcc.”) for
fails to beat the current SOTA models (He et al., CoLA, the additional F1 score for MRPC and QQP.
2021; Wang et al., 2020; Zhong et al., 2022d; Patra
Baselines. We compare ChatGPT (Jan 31 Ver-
et al., 2022; Zhong et al., 2023), especially on some
sion) with 4 representative BERT-style models, as
NLU tasks.
the BERT models are commonly used as the base-
The remainder of this report is designed as fol-
lines to evaluate the understanding ability (Zhong
lows. We present the evaluation settings and com-
et al., 2022b). Specifically, base-sized/ large-sized
parative results in Section 2. In Section 3, we
BERT (Devlin et al., 2019) and RoBERTa (Liu
explore whether ChatGPT can be improved with
et al., 2019) are used. All models are fine-tuned on
advanced prompting strategies. In Section 4, we
the full training set for each task, where the fine-
briefly review the related works. Conclusions are
tuning hyper-parameters are the same to Zhong
described in Section 5.
et al. (2022c). To estimate the lower bound of Chat-
GPT’s understanding ability, we mainly focus on
2 ChatGPT vs. BERT
the comparison between ChatGPT and the basic
In this section, we first introduce the evaluation base-sized BERT.
setting (§2.1), and present the major results (§2.2). Prompts for ChatGPT. For each task, we de-
Then, some analyses of why ChatGPT performs sign task-specific prompts for triggering the under-
well or poorly are also provided (§2.3). Lastly, we standing ability of ChatGPT. Specifically, inspired
show some failure examples of ChatGPT to explore by Jiao et al. (2023), we also ask ChatGPT to gen-
its potential limitations (§2.4). erate the prompts for each task, by inputting the
following human inquiries:
2.1 Evaluation Setting
Here, we briefly introduce the evaluation setting, > provide five concise prompts or
including downstream tasks and datasets, baselines, templates that can make you deal with
2
and prompts for ChatGPT. More detailed descriptions are shown in Appendix A.1.
Task #Pos. #Neg. #Neu. Description Template Prompt
Single-Sentence Tasks
CoLA 25 25 - acceptablity For the sentence: “[text]", is the sentence grammarly correct?
SST-2 25 25 - sentiment For the sentence: “[text]", is the sentiment in this sentence
positive or negative?
Similarity and Paraphrase Tasks
MRPC 25 25 - paraphrase For the sentence pair “[text_1]" and “[text_2]", do these two
sentences have the same semantics?
STS-B total of 50 similarity Determine the similarity between the following two sentences:
“[text_1]" and “[text_2]". The score should be ranging from
0.0 to 5.0, and can be a decimal.
QQP 25 25 - paraphrase For the sentence pair “[text_1]" and “[text_2]", do these two
sentences have the same semantics?
Inference Tasks
MNLI 25 25 25 NLI Given the sentence “[text_1]", determine if the following
statement is entailed or contradicted or neutral: “[text_2]"
QNLI 25 25 - QA/NLI Given the question “[text_1]", determine if the following
sentence contains the corresponding answer: “[text_2]"
RTE 25 25 - NLI Given the sentence “[text_1]", determine if the following
statement is entailed: “[text_2]"
Table 1: Task statistics, descriptions and prompts. All tasks are single sentence or sentence pair classification,
except STS-B, which is a regression task. For ease of illustration, we use “#Pos./#Neg./#Neu.” to denote the
positive, negative and neutral instances for each task. Considering the limits of ChatGPT, we randomly sample 25
instances for each class from the dev set of each task for evaluation, except for STS-B, where we randomly sample
50 instances from a uniform distribution. In the prompts, [text], [text_1] and [text_2] are input slots.
Table 2: Overall comparison between ChatGPT and fine-tuned BERT-style models on GLUE benchmark. The
results in green denote that ChatGPT surpasses the BERT-base model by a clear margin (> 2% (↑) score), while
the red results denote ChatGPT under-performs BERT-base (> 2% (↓) score)). More specifically, “*” means that
the performance difference between ChatGPT and BERT-base is larger than 10%.
MNLI-m RTE
Method
Entailment Contradiction Neutral Entailment Not_Entailment
BERT-base 88.0 88.0 72.0 76.0 64.0
BERT-large 76.0 92.0 80.0 80.0 84.0
RoBERTa-base 84.0 88.0 80.0 80.0 76.0
RoBERTa-large 84.0 92.0 88.0 92.0 76.0
ChatGPT 92.0∗ (↑ 4.0) 96.0∗ (↑ 8.0) 80.0 (↑ 8.0) 96.0∗ (↑ 20.0) 80.0 (↑ 16.0)
Table 3: Per-class accuracy (%) of ChatGPT and BERT-style models on MNLI-m and RTE. The number in paren-
theses indicates the performance improvement over BERT-base. “*” denotes that ChatGPT outperforms all BERT-
style models.
Inference Tasks. To have a closer look at why Similarity Task. Since the STS-B is a regres-
ChatGPT achieves impressive performance on in- sion task, we choose some samples from the uni-
ference tasks, we report the per-class accuracy of form similarity distribution, ranging from 0 for no
ChatGPT and compared models on MNLI and RTE meaning overlap to 5 for meaning equivalence, and
tasks. The results are shown in Table 3. It can be show the absolute difference between predictions
seen that, ChatGPT outperforms BERT-base by a and ground-truths for ChatGPT and BERT-base,
large margin among all settings. Especially, in the respectively. As seen in Figure 2, ChatGPT under-
class of “entailment”, i.e., the premise entails the performs BERT-base in most cases, as it generally
hypothesis, ChatGPT even surpasses all powerful predicts far from the ground-truths. To be more
BERT models by a clear margin. These results specific, we can observe that ChatGPT performs
continue showing the effective inference ability of worse when the sentences in the pair have a lower
ChatGPT, especially reasoning factual input. similarity (<2.5 scores), which is similar to the
Figure 4: Failures of ChatGPT in paraphrase task. The
ground truth for both cases is “not_entailment”, but
Figure 2: The comparison between BERT-base and
ChatGPT makes the “entailment” predictions. (Data:
ChatGPT on STS-B. The x-axis denotes the similarity
2022.02.09)
distribution of STS-B, and the y-axis denotes the abso-
lute difference between prediction and ground truth.
First, while ChatGPT works well for the infer-
ence task, it still fails to make the correct predic-
tions in some cases. As seen in Figure 3, ChatGPT
can generate fluent responses to both inquiries due
to its powerful generation ability. However, we
observe that these responses are somewhat contra-
dictory and even unreasonable. For example, in the
upper case, ChatGPT says “...Jane was hungry
and that this was the reason for giving
candy to Joan,...”, which is very confusing. If
Jane was indeed hungry, Jane would not give candy
to Joan, but eat the candy himself (herself). There
is a similar phenomenon in the lower case, where
Figure 3: Failures of ChatGPT in inference task. The ChatGPT answers with confused logic. In gen-
ground truth for both cases is “not_entailment”, but eral, ChatGPT is able to generate fluent responses
ChatGPT makes the “entailment” predictions. (Data: following a certain pattern, but appears to have
2022.02.09) limitations in really reasoning the sentences. One
evidence is that ChatGPT even fails to answer the
questions, such as the cases in Figure 3, that are
observation from Table 4. It can also be found easily answered by humans.
that, ChatGPT is difficult to accurately predict the
On the other hand, some example failures
similarity score for a pair of sentences around the
of ChatGPT in the paraphrase task are shown
decision boundary (around the 2.5 scores). One
in Figure 4. Both cases are in the class of
of the reasons is ChatGPT is not fine-tuned on the
“not_entailment”. ChatGPT thinks the two sen-
STS-B task and cannot determine a correct decision
tences have the same semantics, as both sentences
boundary. And we show, in Section 3, advanced
describe a decrease (increase) in the value, which
prompting strategies upon ChatGPT could be con-
can be viewed as a coarse-grained semantic simi-
siderably improved.
larity. However, we can easily find that the major
difference between the two sentences is the value
2.4 Case Study
difference, determining the “not_entailment” po-
Here, we show some bad cases of ChatGPT to larity of these cases. We refer to this value dif-
explore its potential limitations, and attempt to ex- ference as the fine-grained semantic difference.
plain why ChatGPT falls short in handling the neg- These cases show that such a discrepancy between
ative samples of the paraphrasing task. coarse-grained and fine-grained semantic informa-
(a) Zero-shot (c) Zero-shot CoT
Q: For the sentence: "John believes it that Bill is here.", is the Q: For the sentence: " John believes it that Bill is here." , is the sentence
sentence grammarly correct? gr ammar ly cor r ect?
A: The answer (yes or no) is: A: Answer (yes or no) the question step by step.
(output): yes. (output): The word "it" is unnecessary and should be removed. The phrase
jljljl"that Bill is here" is a subordinate clause that functions as the object of
the verb "believes." However, in English, it is more natural to use the word
"that" to introduce a subordinate clause that functions as the object of a verb
of believing. Therefore, the corrected sentence would be: "John believes that
Bill is here." Ther efor e, the answer (yes or no) is: no.
Figure 5: Illustrations of ChatGPT equipped with (b) standard few-shot prompting (Brown et al., 2020), (c) zero-
shot chain-of-thought (CoT) prompting (Kojima et al., 2022) and (d) manual few-shot CoT prompting (Wei et al.,
2022b). This test example is from the dev set of CoLA (Warstadt et al., 2019), while the few-shot examples (in
green) are from the training set. We can find that, with the help of advanced prompting strategies, ChatGPT shows
a better understanding ability.
tion might be one of the reasons why ChatGPT • Standard few-shot prompting: also known
struggles with handling negative samples in the as in-context learning (Brown et al., 2020),
paraphrase task. This also indicates that strengthen- it can simply “prompt” the model with a
ing the ability of ChatGPT to extract fine-grained few input-output exemplars demonstrating the
semantic information would effectively improve its task. Specifically, as shown in Figure 5 (b), it
performance on the paraphrase tasks. enables the ChatGPT to perform a target task
by feeding a few prompted examples as part
3 Improving ChatGPT with Advanced of the input.
Prompting Strategies
• Manual few-shot CoT prompting: chain-of-
As mentioned in Section 2, we mainly focus on the thought (CoT) prompting is proposed by Wei
zero-shot learning performance of ChatGPT, and et al. (2022b), which provides manual inter-
the evaluation results show that there is still a clear mediate reasoning steps (demonstrations)3 to
margin between ChatGPT and fine-tuned BERT lead the model to output the final answer step
models on some NLU tasks. Inspired by some ad- by step.
vanced prompting methods (Brown et al., 2020; • Zero-shot CoT: instead of manually design-
Wei et al., 2022b; Kojima et al., 2022) that can ing the demonstrations, Kojima et al. (2022)
effectively exploit the capabilities of LLMs, here, propose a zero-shot CoT method, which em-
we attempt to investigate whether these methods ploys a simple and straightforward template-
can also improve the understanding ability of Chat- based prompting for CoT reasoning. Specif-
GPT and narrow its performance gap with powerful ically, as shown in Figure 5 (c), we use
BERT models.
3
The human efforts in the design of these demonstrations
for different tasks are nontrivial. In our experience, we can first
3.1 Advanced Prompting Strategies ask ChatGPT to generate the steps to perform the target task,
and manually modify the generated reasoning steps. After
In this study, we use three popular prompting strate- obtaining one demonstration, we can encourage the ChatGPT
gies as follows: to generate similar demonstrations for other input examples.
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE GLUE
Method
Mcc. Acc. Acc. F1 Pear. Spea. Acc. F1 m. mm. Acc. Acc. avg.
BERT-base 56.4 88.0 90.0 89.8 83.0 81.9 80.0 80.0 82.7 82.7 84.0 70.0 79.2
RoBERTa-large 65.3 96.0 92.0 92.0 92.9 91.1 90.0 89.4 88.0 90.7 94.0 84.0 87.8
ChatGPT 56.0 92.0 66.0 72.1 80.9 72.4 78.0 79.3 89.3 81.3 84.0 88.0 78.7
Standard few-shot prompting (Brown et al., 2020)
-w/ 1-shot 52.0 96.0 66.0 65.3 87.4 87.0 84.0 83.3 80.0 78.7 84.0 80.0 78.5
-w/ 5-shot 60.2 98.0 76.0 77.8 89.0 86.9 90.0 89.8 82.7 84.0 88.0 86.0 83.8
Zero-shot CoT (Kojima et al., 2022)
-w/ zero-shot CoT 64.5 96.0 78.0 76.6 87.1 87.8 80.0 80.8 86.7 89.3 86.0 90.0 83.7
Manual few-shot CoT (Wei et al., 2022b)
-w/ 1-shot CoT 60.8 94.0 82.0 83.2 89.1 88.7 84.0 82.6 85.3 84.0 88.0 92.0 84.3
-w/ 5-shot CoT 68.2 96.0 82.0 81.6 90.0 90.2 86.0 85.1 85.3 86.7 90.0 92.0 86.2
Table 5: Results of ChatGPT equipped with advanced prompting strategies. For reference, we also report the results
of baseline BERT-base and powerful RoBERTa-large. The best results are in bold. We can find that all advanced
prompting strategies bring some performance improvements to ChatGPT, among which the manual few-shot CoT
is empirically optimal.
60 10
Similarity (%)
equipped with these prompting strategies in Fig-
ure 5. More input examples for each task can be
55 8
found in Appendix A.2.
6
50
3.2 More Results and Analyses 4
The overall results of ChatGPT equipped with ad- 45 2
vanced prompting strategies on GLUE benchmark
40 0
are shown in Table 5. For reference, we also 0 1 2 3 4
compare the improved ChatGPT with the baseline Example id
BERT-base and powerful RoBERTa-large models. Figure 6: Analysis of the unstable 1-shot prompting
Based on these empirical results, we can further performance on the CoLA task. The x-axis denotes
find that: 5 randomly sampled examples. The left y-axis is the
performance of ChatGPT, while the right y-axis is
¶ ChatGPT benefits from all these prompting the average textual similarity, measured by Sentence-
strategies. Compared to the baseline ChatGPT BERT (Reimers and Gurevych, 2019), between the
(78.7%), i.e., zero-shot ChatGPT, all these prompt- given example and test data.
ing strategies bring some performance improve-
ments. Specifically, the standard few-shot prompt-
ing and zero-shot CoT improves the overall perfor-
mance of ChatGPT by +5.1% and +5.0% average in-context example. Despite the overall perfor-
scores, respectively. More encouragingly, with the mance gains in few-shot settings, we can find that
help of manual few-shot CoT, ChatGPT achieves ChatGPT does not consistently perform better on
up to +7.5% average gains and even outperforms these NLU tasks, especially in the 1-shot scenario.
most BERT-style models (except for RoBERTa- More specifically, when equipped with the standard
large). These results indicate that prompting the 1-shot prompting, ChatGPT even performs worse
ChatGPT with manual-CoT could be the Pareto on some tasks, e.g., CoLA, MRPC, MNLI and RTE.
frontier for leveraging its capabilities. We attribute it to the lower correlation between the
randomly sampled in-context example and test data,
· In the 1-shot scenario, the performance of as the prior work (Agrawal et al., 2022) shows that
ChatGPT is relatively sensitive to the given the 1-shot noisy unrelated example could have a
catastrophic impact on output quality4 . To further Method Few-data Full-data
verify this conjecture, we use the different 1-shot ChatGPT 88.0 83.8
example to perform the standard 1-shot prompting. Standard few-shot prompting
Taking the CoLA task as an example, the compar- -w/ 1-shot 80.0 83.4
ative results are shown in Figure 6. As seen, the -w/ 5-shot 86.0 84.4
1-shot performance is unstable, and when given a Zero-shot CoT
more related 1-shot example, ChatGPT can achieve -w/ zero-shot CoT 90.0 85.9
more performance gains, confirming our statement. Manual few-shot CoT
-w/ 1-shot CoT 92.0 87.0
¸ There is still a performance gap be- -w/ 5-shot CoT 92.0 89.9
tween ChatGPT and fine-tuned RoBERTa-
large. With the help of manual-CoT, ChatGPT Table 6: Results of ChatGPT evaluated on the few-
achieves impressive performance improvements data(the setting used in our main experiment)/ full data
and shows state-of-the-art (SOTA) performance of RTE task. We can find that there are similar findings
in both scenarios.
among all comparison models on some tasks, e.g.,
CoLA, SST-2 and RTE. However, as seen, com-
pared with the fine-tuned RoBERTa-large, Chat-
GPT still underperforms on some tasks, especially 3 (Brown et al., 2020)) and 3) encoder-decoder
for the paraphrase task (MRPC), by a clear margin. PLMs (e.g., T5 (Raffel et al., 2020)). Due to differ-
These results continue indicating that, although ent pretraining functions, these PLMs exhibit dif-
ChatGPT could solve many NLP problems quite ferent abilities when performing NLP tasks. Specif-
well, it still fails to beat the current SOTA models, ically, the BERT-style models are based on a bidi-
especially on some NLU tasks. rectional masked language modeling (MLM) ob-
jective, which enforces the models to encode the
+ Note Some readers may concern that our work context information. Through fine-tuning on the
could be a kind of “lottery ticket”, as we only eval- specific task, these BERT-style models can work
uate ChatGPT on a part of the validation set for well on a variety of natural language understand-
each task. To dispel such doubt, we investigate ing (NLU) tasks. On the contrary, the GPT-style
whether there are similar findings in the full-data models aim to predict future words towards a se-
setting. Specifically, taking the RTE task as an quence of words. Such auto-regressive models are
example, we report the corresponding results of well-suitable for language generation, but they are
ChatGPT under the few-data and full-data settings, unidirectional and usually fail short in the represen-
respectively, as shown in Table 6. It can be found tation learning for understanding the sentence (Liu
that ChatGPT shows similar characteristics (e.g., et al., 2021; Zhong et al., 2022a).
significantly benefiting from manual-CoT) in both
scenarios, indicating the credibility of our work. More recently, a lot of work focus on scaling
up the PLMs and developing the large language
4 Related Works models (LLMs) (Ouyang et al., 2022; Chowdhery
et al., 2022; Smith et al., 2022; Zhang et al., 2022).
In recent years, we have witnessed numerous
Wei et al. (2022a) show that LLMs exhibit emer-
Transformer-based pretrained language models
gent abilities, e.g., few-shot and zero-shot learning,
(PLMs) (Devlin et al., 2019; Liu et al., 2019;
when the model sizes are large enough. As a typical
Brown et al., 2020; Raffel et al., 2020; Lewis et al.,
LLM, the recently-released ChatGPT has attracted
2020; Zhong et al., 2022a, 2023) that achieved
great attention, due to its impressive ability to gen-
tremendous success in various natural language
erate fluent and high-quality responses. There is
processing (NLP) tasks. Based on the model archi-
growing interest in exploring the capabilities, ap-
tectures, these PLMs can be classified into three
plications, ethics, and failures of ChatGPT (Jiao
groups: 1) encoder-only PLMs (e.g., BERT (Devlin
et al., 2023; Bang et al., 2023; Qin et al., 2023;
et al., 2019))5 , 2) decoder-only PLMs (e.g., GPT-
Zhuo et al., 2023; Wang et al., 2023). Along with
4
This might be also the reason why 5-shot prompting gener- the research line, we mainly focus on analyzing
ally works better, as concatenating multiple random examples the understanding ability of ChatGPT in this re-
could reduce the effect of noise.
5
We refer to these encoder-only models as BERT-style port, which is important but has been given little
models, and the decoder-only models as GPT-style models. attention.
5 Conclusion Daniel Cer, Mona Diab, Eneko E Agirre, Iñigo Lopez-
Gazpio, and Lucia Specia. 2017. Semeval-2017 task
In this study, we empirically investigate the lan- 1: Semantic textual similarity multilingual and cross-
guage understanding ability of ChatGPT on a di- lingual focused evaluation. In SemEval.
versity of natural language understanding tasks. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Through a series of quantitative studies, we find Maarten Bosma, Gaurav Mishra, Adam Roberts,
that ChatGPT works well on inference tasks, but Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, et al. 2022. Palm: Scaling lan-
falls short in handling paraphrase and similarity guage modeling with pathways. arXiv preprint.
tasks, especially for the negative instances. Fur-
thermore, we attempt to improve the understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of deep
ability of ChatGPT with some advanced prompting bidirectional transformers for language understand-
strategies. The results show that with the help of ing. In NAACL.
these prompting strategies, ChatGPT can achieve
Bill Dolan and Chris Brockett. 2005. Automatically
significant performance improvements, and even
constructing a corpus of sentential paraphrases. In
outperforms the powerful RoBERTa-large on some IWP.
tasks. Overall, ChatGPT attains a comparable un-
derstanding ability compared with some fine-tuned Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,
and William B Dolan. 2007. The third pascal
BERT-style models, but still fails to beat the cur- recognizing textual entailment challenge. In ACL-
rently best models on some NLU tasks. We hope PASCAL.
our study could facilitate more research on how to
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
address the limitations and improve the understand- Weizhu Chen. 2021. Deberta: Decoding-enhanced
ing performance of ChatGPT. bert with disentangled attention. In ICLR.
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
Limitations Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
translator? a preliminary study. arXiv preprint.
Our work has several potential limitations. First,
due to the limits of testing ChatGPT, we mainly Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
evaluate ChatGPT on a part of the validation set for taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
guage models are zero-shot reasoners. In NeurIPS.
each task. It would be more convincing if we can
test on more samples. On the other hand, this report Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
only uses the GLUE benchmark for experiments, jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
in which the task types are somewhat limited. In 2020. Bart: Denoising sequence-to-sequence pre-
future work, we would like to evaluate ChatGPT training for natural language generation, translation,
on more NLU tasks and conduct more in-depth and comprehension. In ACL.
analyses and discussions. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Hiroaki Hayashi, and Graham Neubig. 2021. Pre-
train, prompt, and predict: A systematic survey of
References prompting methods in natural language processing.
ACM Computing Surveys.
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke
Zettlemoyer, and Marjan Ghazvininejad. 2022. In- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
context examples selection for machine translation. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
arXiv preprint. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- proach. arXiv preprint.
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
wei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A Carroll Wainwright, Pamela Mishkin, Chong Zhang,
multitask, multilingual, multimodal evaluation of Sandhini Agarwal, Katarina Slama, Alex Gray, et al.
chatgpt on reasoning, hallucination, and interactiv- 2022. Training language models to follow instruc-
ity. arXiv preprint. tions with human feedback. In NeurIPS.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Barun Patra, Saksham Singhal, Shaohan Huang, Zewen
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Chi, Li Dong, Furu Wei, Vishrav Chaudhary, and
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Xia Song. 2022. Beyond english-centric bitexts for
Askell, et al. 2020. Language models are few-shot better multilingual language representation learning.
learners. NeurIPS. arXiv preprint.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is Artetxe, Moya Chen, Shuohui Chen, Christopher De-
chatgpt a general-purpose natural language process- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
ing task solver? arXiv preprint. 2022. Opt: Open pre-trained transformer language
models. arXiv preprint.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Wei Li, and Peter J Liu. 2020. Exploring the limits Dacheng Tao. 2022a. E2s2: Encoding-enhanced
of transfer learning with a unified text-to-text trans- sequence-to-sequence pretraining for language un-
former. JMLR. derstanding and generation. arXiv preprint.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Percy Liang. 2016. Squad: 100,000+ questions for Dacheng Tao. 2022b. Panda: Prompt transfer meets
machine comprehension of text. In EMNLP. knowledge distillation for efficient model adaptation.
arXiv preprint.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert- Qihuang Zhong, Liang Ding, Keqin Peng, Juhua Liu,
networks. In EMNLP. Bo Du, Yibing Zhan, and Dacheng Tao. 2023. Bag
of tricks for effective language model pretraining
Shaden Smith, Mostofa Patwary, Brandon Norick, and downstream adaptation: A case study on glue.
Patrick LeGresley, Samyam Rajbhandari, Jared arXiv preprint.
Casper, Zhun Liu, Shrimai Prabhumoye, George
Zerveas, Vijay Korthikanti, et al. 2022. Using Qihuang Zhong, Liang Ding, Li Shen, Peng Mi, Juhua
deepspeed and megatron to train megatron-turing Liu, Bo Du, and Dacheng Tao. 2022c. Improving
nlg 530b, a large-scale generative language model. sharpness-aware minimization with fisher mask for
arXiv preprint. better generalization on language models. In Find-
ings of EMNLP.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng, Qihuang Zhong, Liang Ding, Yibing Zhan, Y. Qiao,
and Christopher Potts. 2013. Recursive deep mod- Yonggang Wen, Li Shen, Juhua Liu, Baosheng Yu,
els for semantic compositionality over a sentiment Bo Du, Yixin Chen, Xinbo Gao, Chun Miao, Xiaoou
treebank. In EMNLP. Tang, and Dacheng Tao. 2022d. Toward efficient
language model pretraining and downstream adap-
Alex Wang, Amanpreet Singh, Julian Michael, Felix tation via self-evolution: A case study on superglue.
Hill, Omer Levy, and Samuel R Bowman. 2019. arXiv preprint.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. In ICLR. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and
Zhenchang Xing. 2023. Exploring ai ethics of chat-
Jiaan Wang, Yunlong Liang, Fandong Meng, Zhixu Li, gpt: A diagnostic analysis. arXiv preprint.
Jianfeng Qu, and Jie Zhou. 2023. Cross-lingual sum-
marization via chatgpt. arXiv preprint. A Appendix
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, A.1 Details of Tasks
Zuyi Bao, Liwei Peng, and Luo Si. 2020. Structbert:
Incorporating language structures into pre-training In this work, we conduct extensive experiments on
for deep language understanding. In ICLR. the GLUE (Wang et al., 2019) benchmark. Here,
we introduce the detailed descriptions of all down-
Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
man. 2019. Neural network acceptability judgments. stream tasks and datasets as follows:
TACL. CoLA Corpus of Linguistic Acceptabil-
ity (Warstadt et al., 2019) is a binary single-
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, sentence classification task to determine whether a
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al. given sentence is linguistically “acceptable”.
2022a. Emergent abilities of large language models. SST-2 The Stanford Sentiment Tree-
TMLR. bank (Socher et al., 2013) is a binary classification
task to predict the sentiment of a given sentence.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, MRPC Microsoft Research Paraphrase Cor-
et al. 2022b. Chain-of-thought prompting elicits rea- pus (Dolan and Brockett, 2005) is a task to predict
soning in large language models. In NeurIPS. whether two sentences are semantically equivalent.
Adina Williams, Nikita Nangia, and Samuel Bowman.
STS-B Semantic Textual Similarity (Cer et al.,
2018. A broad-coverage challenge corpus for sen- 2017) is a task to predict how similar two sentences
tence understanding through inference. In NAACL. are on a 1-5 scale in terms of semantic meaning.
QQP The Quora Question Pairs dataset is a
collection of question pairs from the community
question-answering website Quora. The task is to
determine whether a pair of questions are semanti-
cally equivalent.
MNLI The Multi-Genre Natural Language In-
ference Corpus (Williams et al., 2018) is a task to
predict whether the premise entails the hypothe-
sis, contradicts the hypothesis, or neither, given a
premise sentence and a hypothesis sentence.
QNLI Question Natural Language Inference
is a binary classification task constructed from
SQuAD (Rajpurkar et al., 2016), which aims to
predict whether a context sentence contains the
answer to a question sentence.
RTE Recognizing Textual Entailment (Giampic-
colo et al., 2007), given a premise and a hypothesis,
is a task to predict whether the premise entails the
hypothesis.