0% found this document useful (0 votes)
37 views

Text Classification Via Large Language Models

Uploaded by

cardinalshan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Text Classification Via Large Language Models

Uploaded by

cardinalshan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Text Classification via Large Language Models

Xiaofei Sun♦ *, Xiaoya Li♣ *, Jiwei Li♦ , Fei Wu♦


Shangwei Guo▲ , Tianwei Zhang♥ , Guoyin Wang⋆

Abstract 1 Introduction

Despite the remarkable success of large- Large language models (LLMs) (Radford et al.,
scale Language Models (LLMs) such as 2019a; Xue et al., 2020; Zhang et al., 2022a; Rae
GPT-3, their performances still significantly et al., 2021; Brown et al., 2020; Chowdhery et al.,
arXiv:2305.08377v3 [cs.CL] 9 Oct 2023

underperform fine-tuned models in the task 2022; Ouyang et al., 2022; Thoppilan et al., 2022)
of text classification. This is due to (1) the have shown the ability for in-context learning (ICL).
lack of reasoning ability in addressing complex
Given a few demonstration examples, LLMs are
linguistic phenomena (e.g., intensification,
contrast, irony etc); (2) limited number of prompted to generate results for a new test example,
tokens allowed in in-context learning. and have achieved performance comparable to
supervised baselines or even state-of-the-art results
In this paper, we introduce Clue And
in a variety of natural language processing (NLP)
Reasoning Prompting (CARP). CARP adopts
a progressive reasoning strategy tailored to tasks such as question answering (Trivedi et al.,
addressing the complex linguistic phenomena 2022), natural language inference, (Schick and
involved in text classification: CARP first Schütze, 2020), named entity recognition (Wang
prompts LLMs to find superficial clues (e.g., et al., 2023), relation extraction (Wan et al., 2023)
keywords, tones, semantic relations, references, and information extraction (Han et al., 2021).
etc), based on which a diagnostic reasoning In spite of the success, LLMs with ICL
process is induced for final decisions. To
further address the limited-token issue, CARP
still significantly underperform fine-tuned models
uses a fine-tuned model on the supervised for text classification. This is due to two
dataset for kNN demonstration search in the reasons: (1) Text classification requires models
in-context learning, allowing the model to take with more powerful reasoning abilities to
the advantage of both LLM’s generalization resolve complex linguistic phenomenon including
ability and the task-specific evidence provided clause composition (e.g., concession, negation,
by the full labeled dataset. intensification), irony, etc. Recent efforts to
Remarkably, CARP yields new SOTA improve LLMs’ reasoning capabilities (Wei et al.,
performances on 4 out of 5 widely-used 2022b; Kojima et al., 2022; Ye and Durrett, 2022;
text-classification benchmarks, 97.39 (+1.24) Zhang et al., 2022b) mainly focus on tackling
on SST-2, 96.40 (+0.72) on AGNews, 98.78 math problems, and thus are not tailored to
(+0.25) on R8 and 96.95 (+0.6) on R52, and
addressing the reasoning process necessary for the
a performance comparable to SOTA on MR
(92.39 v.s. 93.3). More importantly, we find multitude of intricate linguistic phenomena in text
that CARP delivers impressive abilities on classification; (2) This number of demonstration
low-resource and domain-adaptation setups. examples allowed in in-context learning is limited,
Specifically, using 16 examples per class, e.g., the longest context allowed for GPT-3 is 4,096
CARP achieves comparable performances subtokens. Therefore, LLMs are only able to take
to supervised models with 1,024 examples the advantage of a small proportion of the training
per class. Code and data are available at
set, performing well below supervised baselines;
github.com/ShannonAI/GPT-CLS-CARP 1 2
In this paper, we introduce Clue And Reasoning
1
* indicates equal contributions. Prompting (CARP), an extensible, annotation-
2 ♦
Zhejiang University, ♣ Shannon.AI, ⋆ Amazon
♥ Nanyang Technological University, ▲ Chongqing University [email protected], [email protected]
{xiaofei_sun, wufei, jiwei_li}@zju.edu.cn [email protected], [email protected]
This is an overall sentiment classifier for movie reviews. Classify the overall SENTIMENT of the INPUT as Positive or
Negative.
(a) INPUT: press the delete key
SENTIMENT: Neutral

This is an overall sentiment classifier for movie reviews. Classify the overall SENTIMENT of the INPUT as Positive or
Negative.
(b) INPUT: press the delete key
SENTIMENT: Let's think step-by-step. The input does not contain any words that would indicate a sentiment, so it is
not possible to classify the sentiment as either positive or negative.

This is an overall sentiment classifier for movie reviews.


First, list CLUES (i.e., keywords, phrases, contextual information, semantic relations, semantic meaning, tones,
references) that support the sentiment determination of input..
Second, deduce the diagnostic REASONING process from premises (i.e., clues, input) that supports the INPUT
sentiment determination (Limit the number of words to 130).
Third, based on clues, reasoning and input, determine the overall SENTIMENT of INPUT as Positive or Negative.
(c) INPUT: press the delete key
CLUES: delete key
REASONING: The phrase "delete key" implies an action of removing something, which could be interpreted as a
negative sentiment.
SENTIMENT: Negative

Figure 1: Examples of zero-shot prompting methods for the text classification task: (a) represents for the vanilla
prompting method; (b) denotes for the Chain-of-Thought (CoT) (Kojima et al., 2022) prompting method; c
represents for the proposed CARP prompting method.

free and efficient framework for text classification abilities and all task-specific evidence provided by
via large language models. To address the the training dataset.
reasoning process necessary for handling the Remarkably, CARP yields new SOTA
linguistic phenomena in text classification, CARP performances on four out of 5 widely-used
decomposes the reasoning process into three steps, text-classification benchmarks, 97.39 (+1.24)
where LLMs are first prompted to find superficial on SST-2, 96.40 (+0.72) on AGNews, 98.78
clues (e.g., keywords, tones, semantic relations, (+0.25) on R8 and 96.95 (+0.6) on R52, and
etc) in the given text; next, CARP treats the clues a performance comparable to SOTA on MR
and input as premises and induce a diagnostic (92.39 v.s. 93.3). More importantly, we find that
reasoning process; and finally determine the final CARP delivers impressive ability on low-resource
label considering the above two steps. We find and domain adaptation setups with orders of
this progressive reasoning strategy to be effective magnitude fewer training examples. Specifically,
in enhancing LLMs’ ability in language reasoning CARP achieves comparable performances with 16
involved in text classification. Due to the limited examples per class to supervised models trained
number of tokens allowed in context, a more on the full training set containing more than 1
effective demonstration search is needed. CARP thousand examples per class. This demonstrates
uses a fine-tuned model on the supervised dataset the capabilities of CARP in real-world text
for kNN demonstration search for in-context classification cases where training data is limited.
learning. Since the fine-tuned model is trained
based on task-specific labels, it guarantees that 2 Related Work
retrieved samples are close to the input sequence
with respect to the task. Using fine-tuned models 2.1 Large Language Models
for demonstration search provides a channel to Large language models (LLMs) are models that
connect LLMs with the full training set, in are trained using self-teaching algorithms on large
spite of the limited number of tokens allowed in unlabeled corpora. With emergent capabilities (Xie
demonstrations. This strategy lets the model take et al., 2021; Wei et al., 2022a), LLMs achieve
the advantage of both the LLMs’ generalization significant performance boosts in NLP tasks.
This is an overall sentiment classifier for movie reviews. Classify the overall SENTIMENT of the INPUT as Positive or
Negative.
INPUT: noyce films it more as a shocking history lesson than as drama.
SENTIMENT: Positive

INPUT: press the delete key


SENTIMENT: Neutral

(a)
This is an overall sentiment classifier for movie reviews. Classify the overall SENTIMENT of the INPUT as Positive
or Negative.
INPUT: noyce films it more as a shocking history lesson than as drama.
SENTIMENT: Let's think step-by-step. The input states that the films are "more" something than something else.
This implies that the films are at least somewhat of the first thing mentioned. In this case, the films are more of a
"shocking history lesson" than "drama". This implies that the sentiment is positive, as the films are more of a history
lesson than drama. The overall sentiment is Positive.

INPUT: press the delete key


SENTIMENT: Let's think step-by-step. The input does not provide any context or opinion, so it is impossible to
determine the sentiment. Therefore, the overall sentiment is Neutral.

(b)

This is an overall sentiment classifier for movie reviews.


First, present CLUES (i.e., keywords, phrases, contextual information, semantic relations, semantic meaning,
tones, references) that support the sentiment determination of input.
Second, deduce a diagnostic REASONING process from premises (i.e., clues, input) that supports the sentiment
determination (Limit the number of words to 130).
Third, determine the overall SENTIMENT of INPUT as Positive or Negative considering CLUES, the REASONING
process and the INPUT.
INPUT: noyce films it more as a shocking history lesson than as drama.
CLUES: - Positive: "more," "shocking," "history lesson," "drama." - Negative: None.
REASONING: The input contains positive words such as "more," "shocking," and "drama," which suggest that the
sentiment is positive. Additionally, the phrase "history lesson" implies that the film is educational and informative,
which is generally seen as a positive thing. Therefore, the sentiment of the input is positive.
SENTIMENT: Positive

INPUT: press the delete key


CLUES: - Positive: None. - Negative: "delete key."
REASONING: The input contains the phrase "delete key," which implies that something is being removed or
erased. This is generally seen as a negative action, suggesting that the sentiment of the input is negative.
SENTIMENT: Negative

(c)

Figure 2: Examples of few-shot (k=1) prompting methods for the text classification task: (a) represents for the
vanilla prompting method; (b) denotes for the Chain-of-Thought (CoT) (Kojima et al., 2022) prompting method;
(c) represents for the proposed CARP prompting method.

LLMs can be broadly divided into three model for predicting the next token in a sequence.
categories based on the model architecture. The GPT (Radford et al., 2019a) and its variants (Dai
first category is the encoder-only model like et al., 2019; Keskar et al., 2019; Radford et al.,
BERT (Devlin et al., 2018). BERT (300M) (Devlin 2019b; Chowdhery et al., 2022; Zhang et al.,
et al., 2018) and its variants (Liu et al., 2019; 2022a) also follow the pre-training then fine-tuning
Sun et al., 2020; Clark et al., 2020; Feng et al., paradigm. GPT-3 (175B) (Brown et al., 2020)
2020; Sun et al., 2021) adopt the pre-training then proposes to formalize all NLP tasks as generating
fine-tuning paradigm for NLP tasks: use masked textual responses condition on the given prompt.
language models as the main training objective for
The third category is the encoder-decoder
pretraining, and fine-tune the pretrained model in
models like T5 (Raffel et al., 2020). T5
the annotated downstream datasets.
(11B) (Raffel et al., 2020) and its variants (Lewis
The second category is the decoder-only models et al., 2019; Xue et al., 2020) are encoder-decoder
like GPT (Radford et al., 2019a). GPT (Radford transformer models, which generate new sentences
et al., 2019a) uses the decoder of an auto- depending on a given input, following the pre-
regressive transformer (Vaswani et al., 2017) training then fine-tuning paradigm.
2.2 In-context Learning 3 Prompt Construction
Unlike the pre-training then fine-tuning 3.1 Overview
paradigm (Devlin et al., 2018), which saves
We follow the standard prompt-based in-context
model weights and uses task-specific datasets
learning paradigm. Given an input sequence
(i.e., train/valid/test set), in-context learning (ICL)
xinput = {x1 , x2 , ..., xl }, the task of assigning a
generates textual responses (i.e., label words)
text-class label to an input text is transformed to
conditioning on the given prompt (usually) with a
generating a pre-defined textual response y ∈ Yverb
few annotated examples for downstream tasks.
(e.g., positive, negative, etc) conditioning on the
Li and Liang (2021); Zhong et al. (2021); Qin
prompt xprompt using a language model.
and Eisner (2021) propose to optimize prompts
in the continuous space. Rubin et al. (2021); 3.2 Prompt Construction
Das et al. (2021); Liu et al. (2021); Su et al.
The prompt xprompt , which is constructed based on
(2022) introduce different strategies for selecting
x, consists of the following three components:
in-context examples. Lampinen et al. (2022) show
that explanations of examples in a few-shot prompt (1) Task description xdesc generally describes
lead to a performance boost. Marasović et al. the task. For different classification tasks, e..g,
(2021) find that GPT-3 outperforms other models sentiment classification, topic classification, etc,
by a large margin in the explanation generation descriptions are different. Take the sentiment
task. Wei et al. (2022b) propose chain-of-thought classification task as an example, the task
reasoning and utilized <input, chain-of-thought, description is given as follows:
output> triples as the prompt for LLMs. Wiegreffe Classify the overall sentiment of the input as
et al. (2021) traine a supervised filter to select positive or negative
explanations generated by GPT-3 on the SNLI and
CommonsenseQA tasks. (2) Demonstration consists of a sequence of
annotated examples:
2.3 Text Classification
{(x1demo , ydemo
1
), ..., (xkdemo , ydemo
k
)}
Text classification is a task that aims to assign
predefined labels (e.g., sentiment polarity, topic, where xjdemo , 1 ≤ j ≤ k denotes the jth
etc) to a given text. Earlier work decouple the j
input sequence and ydemo denotes the text which
task into two steps: (1) extract features using
is transformed from the label, e.g., positive or
neural models such as RNNs (Irsoy and Cardie,
negative for the binary sentiment classification
2014; Yang et al., 2016; Wang et al., 2018; Liu
task. Demonstration serves as two purposes: (1)
et al., 2016; Xie et al., 2020), CNNs (Kim, 2014;
providing the LLM with evidence to consult on
Zhang et al., 2015; Lai et al., 2015; Conneau
for decision making, which will significantly boost
et al., 2016; Wei and Zou, 2019), GCN (Yao et al.,
performances; (2) provides an output format that
2019), LLMs (Howard and Ruder, 2018; Sun et al.,
LLM’s outputs need to follow, so that the output,
2019; Chai et al., 2020; Chen et al., 2020; Lin
which takes the form of natural language, can be
et al., 2021); and (2) feed extracted features into
further easily transformed to labels. It is worth
a classifier (Joulin et al., 2016) to obtain the final
noting that demonstrations are only needed for the
label.
few-shot learning setup, but not for the zero-shot
Recently, in-context learning has achieved
learning setup.
success and changes the paradigm in the text
classification task. Schick and Schütze (2020) (3) Input xinput is the test text sequence to
reformulate input examples into cloze-style phrases classify.
and annotate the unlabeled text. Han et al. (2021) The prompt xprompt for a test input is
design sub-prompts and applied logic rules to constructed by concatenating the task
compose sub-prompts into final prompts. Liu description xdesc , a sequence of demonstrations
et al. (2021) retrieve semantically-similar examples {(x1demo , ydemo
1 ), ..., (xkdemo , ydemo
k )}, and the test
to a test sample to formulate its corresponding sequence xtest , which can be given as follows:
prompt. Shi et al. (2022) retrieve label-words-
similar examples as demonstrations in prompts. {xdesc ; \n; <demo>1 ; \n; ...; <demo>k ; \n; xtest }
3.3 Demonstration Sampling strategy lets the model take the advantage of both
The few-shot setup requires demonstrations the LLMs’ generalization abilities and all task-
sampled from the training set. Strategies that we specific evidence provided by the training dataset.
explore include:
4 Clues Collecting and Reasoning
Random Sampling a straightforward strategy
To enhance the models’ reasoning ability in
from samplings is to randomly sample k examples
addressing linguistic phenomenon tailored to text
{(x1 , y 1 ), ..., (xk , y k )} from the training set Dtrain
classification, we propose a progressive reasoning
for a text sequence xtest .
strategy that involves clue collection, reasoning and
kNN Sampling The key disadvantage for decision making. This process also mimics how
random sampling is that there is no guarantee that human decisions: where we first collect evidence
selected samples are semantically related to the from the input, separating chaff from wheat; next
input sequence. One straightforward alternative we piece together local evidence to form a global
is to sample examples that are similar to the test picture, which leads to final decision making.
sequence using kNN search (Khandelwal et al., Next we first given an overview of the the clue
2020). In this process, the test sequence xtest is collecting and reasoning process, and then describe
first mapped to a vector vtest using an encoder implementation details.
model f . Then using vtest as the query, we search
4.1 Overview
through the entire training set Dtrain to retrieve k
nearest text sequence to get k nearest data examples Collecting Clues For a test sequence, clues
N = {xj , yj }kj=1 as demonstrations. We use are local fact evidence such as keywords,
the following encoder models to obtain sentence phrases, contextual information, semantic meaning,
representations and similarity scores: semantic relationships, tones, references, etc. The
following is an example for clues of an input:
SimCSE (Gao et al., 2021) is a contrastive Input: Steers turns in a snappy screenplay that
learning model for sentence embeddings. We use curls at the edges; it’s so clever you want to hate
Sup-SimCSE-RoBERTa-Large model as an it.
encoder model, which is initizlied with RoBERTa- Clues: "snappy", "clever", "want to hate it" are
Large (Liu et al., 2019) and fine-tuned on the clues for determining the sentiment of the input
natural language inference datasets. SimCSE (Gao sentence.
et al., 2021) is a semantic-based model and
retrieves semantically similar examples, but not Reasoning For reasoning, the LLM is prompted
necessarily examples with the same labels. to go beyond superficial keywords to mine deeper
perspectives, considering language phenomenon
Finetuned Model FT for short. The key such as negation, intensification, irony, etc), and
disadvantage for SimCSE (Gao et al., 2021) and piece together local evidence to form the final
other general semantic encoding models (Reimers decision. The following example shows the
and Gurevych, 2019; Seonwoo et al., 2022; Sun reasoning process to decide the sentiment of the
et al., 2022) is that it measures the general semantic above example based on the evidence collected:
similarity but is not specifically tailored to the text 1. The phrase "snappy screenplay" implies that the
classification task. To resolve this issue, CARP screenplay is of a high quality and is well-crafted.
uses the model fine-tuned on the training dataset as 2. The phrase "curls at the edges" implies that the
the kNN encoder model. Specifically, we first fine- screenplay is cleverly written.
tune a Roberta model on the training data. Next 3. The phrase "so clever you want to hate it" is
we use the [CLS] embedding as the sentence level a paradoxical statement, which suggests that the
representation for KNN search. Since the fine- sentiment is positive despite the use of the word
tuned model is trained based on task-specific labels, "hate".
it guarantees that retrieved samples are close to the
input sequence with respect to the task. Using fine-
tuned model provides a channel to connect LLMs Decision Making Based on the reasoning
with the full training set, in spite of the limited process, the model makes the decision for the
number of tokens allowed in demonstrations. This sentiment of the given input:
Overall, the clues and reasoning process point 4.2.1 Clue Collecting and Reasoning in
to a positive sentiment for the input sentence. few-shot
The merits for the incorporation of clue finding In the few-shot setup , we need to prepare clues
and reasonings are as follows: (1) it prompts the and reasonings for all examples in the training set
model to progressively think and make decisions: in advance as all training examples have chances
clue finding focuses more on superficial features to be selected as demonstrations given different
such as keywords, while reasoning makes deeper test inputs. Previous efforts in math problems (Wei
justifications based on superficial features. This et al., 2022b; Kojima et al., 2022; Ye and Durrett,
process better mimics how we humans decide; 2022; Zhang et al., 2022b) prepare hand-drafted
(2) clue finding and reasoning serve as a tunnel reasoning for a few examples, and always use these
to let human intervene: in the few-shot setup, example as demonstrations. This strategy does
where clues and reasons need to be prepared not fit for our situation as it is extremely time-
in advance for demonstrations, we can modify intensive to manually generate clues and reasonings
them as we see fit. This is extremely helpful for for all training examples, To resolve this issue, we
trouble shooting in the prompt-construction stage harness LLMs for automatic clue and reasoning
for error corrections; (3) from an interpretation generation, where we ask LLMs to generate clues
and uncertainty estimation perspective, clues and reasoning based on both the input and its
and reasoning in few-shot setups are human- corresponding label.
readable influence functions; (4) in contrast to Clue Generation For a given training example
list annotated (text, label) pairs in few-shot <text> paired with the label word <label-word>
setups, incorporating clues and reasoning process (e.g., positive), we ask LLM to generate clues that
in prompts aligns closer with the instruction tuning indicate the label:
objective. The discrepancy between LLMs training
objectives and in-context learning for downstream List CLUES (i.e., keywords, phrases, contextual
tasks has been reduced. information, semantic meaning, semantic
relationships, tones, references) that support
the sentiment determination of the input (limit
4.2 Collecting clues and reasoning in to 15 words).
zero-shot INPUT: <text>
SENTIMENT: <label-word>
In the zero-shot setup, as no demonstration is
allowed, no concrete example for clues and reasons Reasoning Generation Based on clues
can be provided. In this way, we only add requests generated clues, the input, and the label, we ask
asking the model to output clues and reasons in the LLMs to generate reasoning details3 :
prompt. The prompt is given as follows:
Based on the input and clues, articulate the
diagnostic reasoning process that supports the
This is an overall sentiment classifier for sentiment determination of the input.
opinion snippets. INPUT: <text>
First, list CLUES (i.e., keywords, phrases, LABEL: <label-word>
contextual information, semantic relations, CLUES: <clues>
semantic meaning, tones, references) for REASONING:
determining the overall sentiment of the input. Given the generated clues and reasonings
Next, deduce a diagnostic reasoning process for all training examples, at test time, when
from clues and the input to determine the K-nearest examples are selected demonstrations,
overall sentiment. its corresponding clues and reasons are
Finally, determine the sentiment of input as concatenated to the demonstration. In this
Positive or Negative considering clues, the way, each demonstration example is composed
reasoning process and the input.
3
LLMs often generate long responses, in order to ensemble
more demonstrations in prompts, we use "limit to 50 words".
INPUT: <text> After conducting an analysis of the generated responses, we
CLUES: find that LLMs can explain the reason within limited words.
INPUT & GOLD LABEL CLUES REASONING
INPUT: johnnie to and wai ka - Positive Clues: enthusiastic, action, The diagnostic reasoning process supporting the
fai are sure to find an enthusiastic adventure. sentiment determination of the input is that the phrase
audience among american action - Negative Clues: narrow, crossover "johnnie to and wai ka fai are sure to find an enthusiastic
adventure buffs, but the film ’s audience among american action adventure buffs"
interests may be too narrow to suggests that the film is likely to be well-received by
attract crossover viewers this specific demographic. However, the subsequent
GOLD LABEL: Negative phrase "but the film’s interests may be too narrow to
attract crossover viewers" implies that the film is unlikely
to appeal to a broader audience, suggesting a negative
sentiment. Therefore, the overall sentiment of the input
is negative.
INPUT: moore sell globe unit - Acquisition (agreed, sell, unit, sale, The input mentions an acquisition agreement between
moore resources inc said agreed close, board approval) Moore Resources Inc and Globe Inc, and the sale
principle sell globe inc unit cyprus - Companies (Moore Resources Inc, is expected to close in mid-April, suggesting an
minerals co undisclosed terms Globe Inc, Cyprus Minerals Co) Acquisitions topic. The input also mentions Cyprus
sale expected close mid april - Products (silicon, metals, chemical, Minerals Co, silicon and metals products which are used
subject certain conditions including aluminum, iron, rubber, semiconductor) in chemical, aluminum, iron, rubber, and semiconductor
board approval companies globe - Reuter (news agency) industries, and a Reuter news agency, all of which
produces silicon metals products support the Acquisitions topic.
used chemical aluminum iron rubber
semiconductor industries reuter
GOLD LABEL: Acquisitions

Table 1: Examples of generated clues and reasoning for demonstrations.

by a (text, clues, reasons, golden LLM will first output clues, then reasons, and at
label word) pair. The prompt is thus given as last decisions.
follows:
4.3 Voting
This is a sentiment classifier for input opinion
snippets. Unlike conventional discriminative models for text
List CLUES (i.e., keywords, phrases, contextual classification, which generate deterministic results
information, semantic meaning, semantic during inferences, LLMs for in-context learning
relationships, tones, references) that support are generative models and generate distinct textual
the sentiment determination of the input. responses with diverse sampling strategies in
Next, deduce the diagnostic REASONING multiple runs. We consider the following voting
process from premises (i.e., clues, input) that strategies in the paper:
support the sentiment determination. • Majority Vote: the final result is the most
Finally, based on clues, the reasoning and the frequent prediction among multiple runs.
input, categorize the overall SENTIMENT of • Weighted Probability Vote: the final result
input as Positive or Negative. is the one with weighted summed probability
from multiple runs.
input: <demo-text-1>
clues: <demo-clues-1> 5 Experiments
reasoning: <demo-reason-1>
In order to evaluate the effectiveness of the
sentiment: <demo-label-word-1>
proposed method, we conduct experiments on two
input: <demo-text-2>
setups: (1) full training setup, where the model has
clues: <demo-clues-2>
the access to the full training data; and (2) low-
reasoning: <demo-reason-2>
resource setup, where the model can only access
sentiment: <demo-label-word-2>
partial training dataset. The low-resource setup
... ...
better mimics real-world situations where training
input: <demo-text-n>
data is limited. For the full training setup, we
clues: <demo-clues-n>
follow the standard train/dev/test split. For the low-
reasoning: <demo-reason-n>
resource setup, we randomly sample n instances
sentiment: <demo-label-word-n>
per class (n in {16, 128, 256, 512, 1024}) from the
input: <text>
benchmark training set. The sampled subset forms
Examples for prompts with clues and reasons are a new training set to test different models’ abilities
shown in Figure 2. In this way, for a test example, in the low-resource situations. During experiments,
by following the format of demonstrations, the we train models/sample demonstrations with the
SST-2 AGNews R8 R52 MR Average
Supervised Methods
RoBERTa-Large (Liu et al., 2019) 95.99 95.55 97.76 96.42 91.16 95.38
DeBERTa (He et al., 2020) 94.75 95.32 98.33 96.32 90.19 94.99
RoBERTa-GCN (Lin et al., 2021) 95.80 95.68* 98.2 96.1 89.7 95.10
XLNet (Yang et al., 2019) 96.10* 95.55 - - - -
VLAWE (Ionescu and Butnaru, 2019) - - - - 93.3* -
GCN-SB (Zeng et al., 2022) - - 98.53* 96.35* 87.59 -
Zero-shot Setting
Vanilla (Brown et al., 2020) 91.55 90.72 90.19 89.06 88.69 90.04
CoT (Kojima et al., 2022) 92.11 91.25 90.48 91.24 89.37 90.89
CARP 93.01 92.60 91.75 91.80 89.94 91.82
Few-shot Setting (k=16)
Random Sampler
Vanilla (Brown et al., 2020) 92.36 91.74 91.58 91.56 89.15 91.28
CoT (Kojima et al., 2022) 94.56 95.02 92.49 92.03 89.91 92.80
CARP 96.20 95.18 97.60 96.19 90.03 95.04
SimCSE kNN-Sampler
Vanilla (Brown et al., 2020) 93.90 93.50 94.36 92.40 89.59 94.05
CoT (Kojima et al., 2022) 94.21 94.28 95.07 92.98 90.27 93.69
CARP 95.69 95.25 97.83 96.27 90.74 95.16
FT kNN-Sampler
Vanilla (Brown et al., 2020) 94.01 94.14 95.57 95.79 90.90 94.08
CoT (Kojima et al., 2022) 95.48 94.89 95.59 95.89 90.17 94.40
CARP 96.80 95.99 98.29 96.82 91.90 95.97
CARP (WP Vote) 97.39 96.40 98.78 96.95 92.39 96.38

Table 2: Accuracy performances of different settings on benchmarks. We report mean and standard deviation
results over 5 runs. The GPT-3 denotes text-davinci-003. In few-shot experiments, we sample 16 annotated
examples (k=16) for every test instance. * indicates previous state-of-the-art results. "MJ Vote" is short for majority
vote. "WP Vote" denotes weighted probability vote.

new training set. to denote the baseline that mimics the chain-of-
We conduct experiments on five widely-used thought strategy and use CARP to denote the
datasets, including SST-2 (Socher et al., 2013), R8, proposed method.
R524 , AGNews (Zhang et al., 2015) and Movie
Review (MR) (Pang and Lee, 2005). More details 5.1 Models for Comparison
of the benchmarks and low-resource datasets can Supervised models trained on the trained set
be found in Appendix ??. naturally constitute baselines to compare with.
For zero-shot and few-shot experiments, We use the following models as baselines, and
we use InstructGPT-3 (Ouyang et al., 2022) more details of hyper-parameters are shown in
(text-davinci-003, 175B) as the backbone. Appendix B.1:
Due to the input token limitation, we use • RoBERTa-Large:We fine-tune RoBERTa-
k = 16 for few-shot setups. Prompts on the Large (Liu et al., 2019) on the training set.
five datasets are shown in Appendix ??. Model • RoBERTa-GCN:Lin et al. (2021) constructs
hyper-parameters can be found in Table 3 5 . heterogeneous graph networks on top of the
We use Vanilla to denote the conventional ICL RoBERTa-Large (Liu et al., 2019) model.
approach where LLMs are directly prompted to • DeBERTa:He et al. (2020) improve RoBERTa
generate labels. We use CoT (Kojima et al., 2022) by using disentangled attention mechanism
4
R8 and R52 are original from https://www.cs.umb. and an enhanced mask decoder.
edu/~smimarog/textmining/datasets/ • XLNet:Yang et al. (2019) propose a
5
During experiments, we find that CARP is robust with
different hyper-parameters. Experimental results can be found generalized autoregressive pretraining
in Appendix B.2 method that enables learning bidirectional
Parameter Value purposed simcse retriever.
Engine Name text-davinci-003 For different reasoning strategies, we first
Max Tokens 200 observe that the CoT strategy outperforms the
Temperature 0.7 vanilla strategy, which straightforwardly asks
Top P 1 LLMs to generate results without further reasoning
Frequency Penalty 0.0
steps. CARP consistently outperforms CoT across
Presence Penalty 0.0
Best Of 1
all benchmarks, i.e., +1.48, +0.97, +2.76, +
3.29, +0.47 respectively on SST-2, AGNews,
Table 3: OpenAI API Hyper-parameters. R8, R52 and MR datasets. This demonstrates
the necessity of building models with complex
linguistic phenomena involved in text classification,
contexts. and the effectiveness of CARP in doing this job.
• GCN-SB:Zeng et al. (2022) propose a
Compared with supervised learning baselines,
simplified boosting algorithm, which makes
we find that the vanilla model using LLM
CNN learn the samples misclassified by GCN
underperforms supervised baselines, while few-
again.
shot CoT is able to obtain slightly worse or
• VLAWE:Ionescu and Butnaru (2019) obtain
comparable results agains supervised baselines.
document embeddings based on aggregating
Notably, single CARP outperforms fine-tuned
the differences between each codeword vector
RoBBERTa on all benchmarks. Using WP voting
and each word vector (from the document)
strategies, CARP yields new SOTA performances
associated to the respective codeword.
on four out of the 5 datasets, 97.39 on SST-2
Few-shot Setup For demonstration sample (+1.24), 96.40 (+0.72) on AGNews, 98.78 (+0.25)
strategies in the few-shot setup, we consider the on R8 and 96.95 (+0.6) on R52, and a performance
following strategies for comparison: (more details comparable to SOTA on MR (92.39 v.s. 93.3).
can be found in Section 3.3):
• Random Sampler: randomly samples k
examples. 5.3 Results on low-resource settings
• SimCSE kNN-Sampler: samples k nearest
examples based on SimCSE (Gao et al., 2021) To estimate low-resource circumstances, we
representations6 . sample n = {16, 128, 256, 512, 1024} instances
for each class as low-resource setups. Experimental
• FT kNN-Sampler: sample k nearest
results are shown in Table 4. As can be seen,
examples using Fine-Tuned RoBERTa-Large
when the training set size is extremely small (i.e.,
representations.
16 or 128 sentences), and the performance of the
5.2 Results on the full training set supervised model is far below CARP. Even with
only 16 examples to train on, the accuracy of CARP
Experimental results are shown in Table 2. As of SST-2 already around 90%, whereas supervised
can be seen, performances of few-shot setups models’ performance is similar to random guess.
consistently outperform zero-shot setups. In terms This demonstrates the strong generalization ability
of sampling strategies in the few-shot setups, we of CARP in the low-resource setup. As we
observe that simcse KNN-sampler outperform anticipated, the kNN search efficiency improved
random sampler, illustrating the importance of at a faster rate as the amount of the training
adding demonstrations that are relevant to the test data increases; Enlarging the training dataset
input in the few-shot setup. We also observe increases the chances that the chosen examples
that FT KNN-sampler consistently outperforms will correspond to the input, resulting in improved
simcse KNN-sampler. This shows that, the fine- results. Specifically, using 16 examples per
tuned model, which takes the advantage of the full class, CARP achieves comparable performances to
training set, serves as a better retriever for task- supervised models with 1,024 examples per class;
specific demonstration retrieval than the general- using 512 instance per class annotation data, CARP
6
Specifically, we use Sup-SimCSE-RoBERTa-Large achieves comparable performances to supervised
as the text encoder. models trained on the full set.
Dataset Model n=16 n=128 n=256 n=512 n=1024
FT RoBERTa 51.52 52.31 53.89 70.49 90.30
GPT-3 Vanilla 90.15 90.36 91.70 93.86 94.68
SST-2
GPT-3 Zero-shot-CoT 89.66 90.19 90.80 94.42 94.89
GPT-3 CRAP 90.48 91.07 91.77 94.03 95.20
FT RoBERTa 21.87 38.19 40.08 50.18 78.09
GPT-3 Vanilla 89.47 89.63 90.54 93.02 94.79
AGNews
GPT-3 Zero-shot-CoT 89.66 90.16 91.70 94.86 95.28
GPT-3 CRAP 90.16 90.94 91.07 94.08 95.48
FT RoBERTa 11.29 48.19 60.18 70.70 88.68
GPT-3 Vanilla 89.15 90.27 91.70 94.00 94.91
R8
GPT-3 Zero-shot-CoT 90.49 90.88 91.81 95.42 95.75
GPT-3 CRAP 90.23 91.03 91.77 95.56 96.67
FT RoBERTa 38.29 39.10 59.18 67.19 81.53
GPT-3 Vanilla 89.15 90.04 90.29 91.88 92.06
R52
GPT-3 Zero-shot-CoT 89.46 90.02 90.73 93.20 94.12
GPT-3 CRAP 90.82 91.00 95.85 94.36 96.27
FT RoBERTa 51.20 52.11 53.58 68.29 88.37
GPT-3 Vanilla 86.04 88.68 88.99 89.80 90.18
MR
GPT-3 Zero-shot-CoT 86.26 89.00 90.01 90.16 90.89
GPT-3 CRAP 86.54 87.19 89.63 90.01 91.20

Table 4: Experimental results on low-resource (n example per class) settings. We compare fine-tuned RoBERTa-
Large with 16-shots GPT-3 setting. For GPT-3, we use SimCSE (Gao et al., 2021) to retrieve 16 annotated examples
from the low-resource train set. "cls" represents GPT-3 makes decisions by generating label words; "reason-cls"
denotes that GPT-3 first generates the reasoning process and then makes decisions; "clue-reason-cls" represents that
GPT-3 finds clues in the given text, then explain the reasoning process and finally makes decisions.

5.4 Domain Adaptation FT RoBERTa on FT RoBERTa on


SST-2 Train Yelp Train
It is unclear whether it is essential to train
SST-2 Test 95.99 88.78
models on the specific dataset for retrieving Yelp Test 92.38 96.04
demonstrations. In this subsection, we conduct CARP with CARP with
an analysis on using demonstrations from out-of- SST-2 demon. Yelp demon.
distribution datasets. SST-2 Test 96.80 96.29
We use SST-2 and Yelp, and the task is Yelp Test 95.94 96.32
to determine the positive or negative polarity
Table 5: Results for Yelp test set when using
of the given text. SST-2 and Yelp are from in-domain/out-of-domain kNN sampler and
different domains: SST-2 are snippets from Rotten demonstrations source. We use FT kNN Sampler to
Tomatoes7 , whereas Yelp8 consists of product retrieve demonstrations on the corresponding train set.
reviews from the online website. Experimental
results are shown in Table 5. SST-2 train & SST-2
test means that demonstrations are from the SST-2 to the out-of-distribution data. On the contrary,
dataset and test is performed on SST-2 dataset; Yelp we only observe a slight decrease in performance
train & SST-2 test means demonstrations are from (-0.5%, 96.80% v.s. 96.29%) when switching SST-
yelp and test is performed on SST-2 dataset. We see 2 train to Yelp-2 train on SST-2 test, illustration
a significant decrease (-7.2%, 95.99% v.s.88.78% the greater capabilities of CARP on the domain
) in performance when switching SST-2 train to adaptation situations.
Yelp-2 train using supervised RoBERTa, which This means CARP is very robust when training
illustrates that supervised models are very sensitive and test are not from the same domain. On the
contrary,
7
https://www.rottentomatoes.com/
8
https://drive.google. 6 Ablation Studies
com/drive/folders/0Bz8a_
Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M?
resourcekey=0-TLwzfR2O-D2aPitmn5o9VQ& In this section, we conduct comprehensive ablation
usp=share_link studies to get a better knowledge about different
SST-2 AGNews R8 R52 MR Average
Supervised Methods
RoBERTa-Large 95.99 95.55 97.76 96.42 91.16 95.38
RoBERTa-GCN 95.80 95.68 98.2 96.1 89.7 95.10
Zero-shot Setting
Vanilla 91.55 90.72 90.19 89.06 88.69 90.04
Zero-shot-CoT 92.11 91.25 90.48 91.24 89.37 90.89
CARP 94.41 93.18 93.29 92.69 90.03 92.72
Few-shot Setting
Random Sampler
Vanilla 91.36 91.48 90.60 90.68 89.15 90.65
Zero-shot-CoT 92.56 92.65 92.49 92.03 89.91 91.93
CARP 94.41 93.18 93.29 92.69 90.03 92.72
SimCSE kNN-Sampler
Vanilla 93.90 93.50 94.36 92.40 89.59 92.75
Zero-shot-CoT 94.21 94.28 95.07 92.98 90.27 93.36
CARP 95.99 95.53 95.31 93.84 90.64 94.26
FT kNN-Sampler
Vanilla 94.01 94.14 95.57 95.79 90.90 94.08
Zero-shot-CoT 95.48 94.89 95.59 95.89 90.17 94.40
CARP 96.62 95.97 98.13 96.12 91.86 95.74

Table 6: Accuracy performances of different settings on test subsets (results are over 5 runs). GPT-3 denotes
text-davinci-003. In few-shot experiments, we sample 16 annotated examples (k=16) per prompt. "MJ Vote"
is short for majority vote. "WP Vote" denotes weighted probability vote.

98
elements of CARP.
97
98
Test Accuracy (%)

96
97

95
Test Accuracy (%)

96

94
95 Random Sampler
SimCSE kNN-Sampler
93 FT kNN-Sampler
94
Random Sampler
SimCSE kNN-Sampler 92
93 FT kNN-Sampler 0 2 4 8 12 16 20 24
Number of Demonstrations
92
0 2 4 8 12 16 20 24
Number of Demonstrations
Figure 4: Performances v.s. the number of
demonstrations in few-shot prompts for the CARP
Figure 3: Performances v.s. the number of strategy, where LLMs are first asked to generate
demonstrations in few-shot prompts. evidence, then to reason and at last to generate final
results.

6.1 Impact of the number of demonstrations 6.2 The effect of components in


We explore the effect of the number of demonstrations
demonstrations in prompts. We conduct CARP uses (text, clues, reasons,
experiments on the SST-2 dataset. Results for golden label word) pairs as
the vanilla prompting and the CARP schemas demonstrations. In this subsection, we exploit the
using different sampling strategies are shown in influence of each component in (text, clues,
Figure 3 and Figure 4, respectively. As can be reasons, golden label word) by
seen, performances improve as the number of removing it from prompts. Experimental results
demonstrations increases for both the vanilla and are shown in Table 7. As shown in Table 7,
the CARP schemas. text in demonstrations has the biggest influence
Prompts SST-2 R8 Prompts SST-2 R8
CARP 96.80 98.29 Clues 96.80 98.29
w/o Text 92.28 94.18 w/o keyword&phrase 96.21 96.91
w/o Clue 95.48 95.29 w/o contextual info. 96.23 97.10
w/o Reason 95.72 97.82 w/o semantic relations 96.30 97.38
w/o Label 96.53 98.18 w/o tones 96.40 97.35
w/o reference 96.50 97.19
Table 7: The effect of components on the SST-2 dataset
with different strategies. Table 9: Label words and results on the SST-2 dataset
with different strategies.
Strategy Label Words(+,-) CARP
Position Index One, Two 95.66
Annotation Words Positive, Negative 96.86 decrease when flipped words are used as label
Synonyms Words Great, Terrible 96.27 words in demonstrations.
Flipped Words Negative, Positive 64.63
Random Words Cf, Ng 95.06
Special Tokens <POS>, <NEG> 96.65 6.4 The influence of clues

Table 8: Label words and results on the SST-2 dataset As mentioned in Section 3, clues are keywords,
with different strategies. "+" represents positive polarity; phrases, contextual information, semantic meaning,
"-" denotes negative polarity. semantic relationships, tones, references that
support making decisions. We remove different
types of words in clues and evaluate its influence
impact of the final results. When (text, clue, on SST-2 and R8 datasets. Editing prompts
reason) as demonstrations, the label has achieve this goal. The original prompt for
effect to the performances. clue collecting is List CLUES (i.e., keywords,
phrases, contextual information, semantic meaning,
6.3 The effect of different types of label words
semantic relationships, tones, references) that
Label words denote words generated by LLMs that support the sentiment determination of the input.
indicate the label of the input. In this subsection, If we want to remove keywords & phrases, we just
we explore the impact of using different kinds of remove them from the prompt.
label words: • w/o keywords & phrases: keywords and
• Position index: number of index. i.e., one, phrases are surface evidence for making
two, three and etc to denote the label. decisions such as "like", "hate".
• Annotation words: words used to refer to the • w/o contextual information & semantic
category in the annotation file. e.g., positive, meaning: contextual information and
negative. 9 semantic meaning are meaning in
• Synonyms words: synonyms words e.g., sentences/paragraphs such as The author
great, terrible. express his happiness.
• Flipped words: words that are contrary to • w/o semantic relationships: semantic
original target meanings. e.g., "positive" to relationships refer to relations between
denote the negative polarity, "negative" to subjects such as "emotional danger" suggests
denote the positive polarity. a romantic and thrilling relationship between
• Random words: randomly choose words in Idemoto and Kim that creates a positive
the vocabulary. e.g., order, number. sentiment..
• Special tokens: tokens that do not have • w/o tones: tones are the general mood of the
semantic meaning. They are independent of text such as The sentence is expressed in an
the input and added for a certain purpose. e.g., objective tone.
<cls>, <mask>. • w/o references: references are mentions of
Results are shown in Table 8. As can be seen, commonsense facts or books such as The
few-shot ICL with annotation words as label words reference to the popular, comedic character
achieves the best performances. It is also worth "Ferris Bueller" implies that the kid is seen in
noting that we observe a significant performance a positive light..
9
GPT-3 generates the same label words for binary Experimental results are shown in Table 9. For
sentiment classification task. R8 and SST-2 datasets, keywords play the key role
Reliability(%) ↑ Fluency(ppl) ↓ Logic Faithful(%) ↑
Ranking SimCSE FT
SST-2 96.18 3.89 95.20
CARP R8 95.34 3.29 94.55

Random 95.39 95.99


High-to-Low 95.22 96.71 Table 11: Results for evaluating the quality of generated
Low-to-High 96.39 96.80 reasoning explanation. We sample 500 (text, reason)
instances for SST-2 and R8.
Table 10: Accuracy scores on SST-2 when assembling
demonstrations with different ranking strategies.
reliable for making decisions. If the GPT-3 returns
"no", it represents that the reasoning process is not
for GPT predictions. reliable.
The prompt for SST-2 is shown as follows:
6.5 The effect of demonstration order
Is the following REASONING process supporting
During experiments, we find that the ranking determinate sentiment label to INPUT? Please
order of demonstration affect final results. In this answer Yes or No.
subsection, we further investigate the influence INPUT: <text>
of orders of demonstrations. As mentioned in REASONING: <reasoning-process>
Section 3.3, we retrieved k data instances N = where <text> is the text sequence for the data
{xj , yj }kj=1 according to the cosine similarity with and <reasoning-process> is generated reasoning
the test sequence. Orders the demonstrations in the process.
prompt we investigate include: (2) Fluency: use LLMs to generate reasoning
• Random: randomly shuffle retrieved explanations is a reference-free text generation task.
demonstrations. We use perplexity to evaluate the generated text.
• Low-to-High: demonstrations with lower (3) Logic Faithful: previous work often use
similarity scores come first. Therefore models, which are trained on natural language
demonstrations with higher similarity scores inference datasets, to determine whether the given
are placed closer to the test sequence, which “hypothesis” logically follows from the “premise”.
is placed at the end of the prompt. However, lacking annotation datasets, NLI-trained
• High-to-Low: demonstrations with lower models can not generalize across multiple domains
similarity scores are placed closer to the test (e.g., opinion, reviews, news). Since then, we use
sequence. 16-shot ICL with GPT-3 to evaluate whether the
As shown in Table 10, performance is sensitive generated rationable explanations can be entailed
the ordering of the demonstrations. The low- from the input text. If the InstructGPT responds
to-high ordering achieves the best performance with "entailment", it denotes that the generated
compared to random and high-to-low ordering. reasoning process is logic faithful with the text.
Otherwise, it represents the reasoning process is not
6.6 Quality of the reasoning process faithful to the text. We sample training instances
In this paper, we use LLMs to generate rationable from the SNLI dataset (Bowman et al., 2015)
explanations instead of human editing. Therefore, as demonstrations. And prompts are shown as
the quality of generated reasoning process affects follows:
the final results. In this subsection, we sample Given the premise and hypothesis, please justify
500 training (text, clues, reason, label) pairs and whether the HYPOTHESIS can be entailed from
evaluate the generated reasoning process from the the PREMISE. Please return yes or no.
following perspectives: PREMISE: <text>
(1) Reliability: Inspired by the emergent HYPOTHESIS: <reasoning-process>
generalization ability of LLMs, we use zero-shot Evaluation results are shown in Table 11. As
GPT-3 (175B) as the self-critique model to evaluate can be seen, the reliability percentages for SST-
the quality of generated reasoning processes. To 2 and R5 are higher than 95%. This indicates
be specific, we ask the GPT-3 to return yes/no if that it is feasible to use the model-generated
the generated reasoning process supports making reasoning process as part of the prompts to augment
decisions for the input text. If the GPT-3 returns ICL performances. The perplexity of generated
"yes", it denotes that the reasoning process is reasoning text is smaller than 4, which denotes that
the generated reasoning text is fluent. And scores Zihang Dai, Zhilin Yang, Yiming Yang, Jaime
of logic faithful are larger than 93%, which is in Carbonell, Quoc V Le, and Ruslan Salakhutdinov.
2019. Transformer-xl: Attentive language models
line with our expectation that LLMs can generate
beyond a fixed-length context. arXiv preprint
reasonable explanations. arXiv:1901.02860.

7 Conclusion Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya


Godbole, Ethan Perez, Jay-Yoon Lee, Lizhen Tan,
In this paper, we introduce Clue And Reasoning Lazaros Polymenakos, and Andrew McCallum.
Prompting (CARP) for text classification task. 2021. Case-based reasoning for natural language
queries over knowledge bases. arXiv preprint
CARP yields new SOTA performances on 4 out arXiv:2104.08762.
of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
impressive abilities on low-resource and domain- Kristina Toutanova. 2018. Bert: Pre-training
of deep bidirectional transformers for language
adaption setups. In the future, we would like understanding. arXiv preprint arXiv:1810.04805.
to explore CARP on more natural language
understanding tasks. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan,
Xiaocheng Feng, Ming Gong, Linjun Shou, Bing
Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert:
A pre-trained model for programming and natural
References languages. arXiv preprint arXiv:2002.08155.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
annotated corpus for learning natural language Simcse: Simple contrastive learning of sentence
inference. In Proceedings of the 2015 Conference embeddings. arXiv preprint arXiv:2104.08821.
on Empirical Methods in Natural Language
Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu,
Processing (EMNLP). Association for Computational
and Maosong Sun. 2021. Ptr: Prompt tuning
Linguistics.
with rules for text classification. arXiv preprint
arXiv:2105.11259.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Weizhu Chen. 2020. Deberta: Decoding-enhanced
Askell, et al. 2020. Language models are few-shot bert with disentangled attention. ArXiv.
learners. Advances in neural information processing
systems, 33:1877–1901. Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification.
Duo Chai, Wei Wu, Qinghong Han, Fei Wu, and Jiwei arXiv preprint arXiv:1801.06146.
Li. 2020. Description based text classification with
reinforcement learning. In International Conference Radu Tudor Ionescu and Andrei M Butnaru. 2019.
on Machine Learning. PMLR. Vector of locally-aggregated word embeddings
(vlawe): A novel document-level representation.
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mixtext: arXiv preprint arXiv:1902.08850.
Linguistically-informed interpolation of hidden space
for semi-supervised text classification. arXiv preprint Ozan Irsoy and Claire Cardie. 2014. Deep recursive
arXiv:2004.12239. neural networks for compositionality in language.
Advances in neural information processing systems,
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, 27.
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Sebastian Gehrmann, et al. 2022. Palm: Scaling Tomas Mikolov. 2016. Bag of tricks for efficient text
language modeling with pathways. arXiv preprint classification. arXiv preprint arXiv:1607.01759.
arXiv:2204.02311.
Nitish Shirish Keskar, Bryan McCann, Lav R
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Varshney, Caiming Xiong, and Richard Socher.
Christopher D Manning. 2020. Electra: Pre-training 2019. Ctrl: A conditional transformer language
text encoders as discriminators rather than generators. model for controllable generation. arXiv preprint
arXiv preprint arXiv:2003.10555. arXiv:1909.05858.

Alexis Conneau, Holger Schwenk, Loïc Barrault, Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke
and Yann Lecun. 2016. Very deep convolutional Zettlemoyer, and Mike Lewis. 2020. Nearest
networks for text classification. arXiv preprint neighbor machine translation. arXiv preprint
arXiv:1606.01781. arXiv:2010.00710.
Yoon Kim. 2014. Convolutional neural networks for Guanghui Qin and Jason Eisner. 2021. Learning how
sentence classification. In Conference on Empirical to ask: Querying lms with mixtures of soft prompts.
Methods in Natural Language Processing. arXiv preprint arXiv:2104.06599.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Alec Radford, Jeff Wu, Rewon Child, David Luan,
Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Dario Amodei, and Ilya Sutskever. 2019a. Language
language models are zero-shot reasoners. ArXiv. models are unsupervised multitask learners.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
Recurrent convolutional neural networks for text Alec Radford, Jeffrey Wu, Rewon Child, David
classification. In Proceedings of the AAAI conference Luan, Dario Amodei, Ilya Sutskever, et al.
on artificial intelligence. 2019b. Language models are unsupervised multitask
learners. OpenAI blog.
Andrew K Lampinen, Ishita Dasgupta, Stephanie CY
Chan, Kory Matthewson, Michael Henry Tessler, Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
Antonia Creswell, James L McClelland, Jane X Millican, Jordan Hoffmann, Francis Song, John
Wang, and Felix Hill. 2022. Can language models Aslanides, Sarah Henderson, Roman Ring, Susannah
learn from explanations in context? arXiv preprint Young, et al. 2021. Scaling language models:
arXiv:2204.02329. Methods, analysis & insights from training gopher.
arXiv preprint arXiv:2112.11446.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Bart: Denoising sequence-to-sequence pre-training Wei Li, and Peter J Liu. 2020. Exploring the
for natural language generation, translation, and limits of transfer learning with a unified text-to-
comprehension. arXiv preprint arXiv:1910.13461. text transformer. The Journal of Machine Learning
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Research, 21(1):5485–5551.
Optimizing continuous prompts for generation. arXiv
preprint arXiv:2101.00190. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
Yuxiao Lin, Yuxian Meng, Xiaofei Sun, Qinghong Han, arXiv preprint arXiv:1908.10084.
Kun Kuang, Jiwei Li, and Fei Wu. 2021. Bertgcn:
Transductive text classification by combining gcn and Ohad Rubin, Jonathan Herzig, and Jonathan Berant.
bert. arXiv preprint arXiv:2105.05727. 2021. Learning to retrieve prompts for in-context
learning. arXiv preprint arXiv:2112.08633.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
Lawrence Carin, and Weizhu Chen. 2021. What Timo Schick and Hinrich Schütze. 2020. Exploiting
makes good in-context examples for gpt-3? arXiv cloze questions for few shot text classification
preprint arXiv:2101.06804. and natural language inference. arXiv preprint
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. arXiv:2001.07676.
2016. Recurrent neural network for text
classification with multi-task learning. arXiv preprint Yeon Seonwoo, Guoyin Wang, Sajal Choudhary,
arXiv:1605.05101. Changmin Seo, Jiwei Li, Xiang Li, Puyang Xu,
Sunghyun Park, and Alice Oh. 2022. Ranking-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, enhanced unsupervised sentence representation
Mandar Joshi, Danqi Chen, Omer Levy, Mike learning. arXiv preprint arXiv:2209.04333.
Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
2019. Roberta: A robustly optimized bert pretraining Weijia Shi, Julian Michael, Suchin Gururangan, and
approach. arXiv preprint arXiv:1907.11692. Luke Zettlemoyer. 2022. Nearest neighbor zero-shot
inference. arXiv preprint arXiv:2205.13792.
Ana Marasović, Iz Beltagy, Doug Downey, and
Matthew E Peters. 2021. Few-shot self- Richard Socher, Alex Perelygin, Jean Wu, Jason
rationalization with natural language prompts. Chuang, Christopher D Manning, Andrew Y Ng, and
arXiv preprint arXiv:2111.08284. Christopher Potts. 2013. Recursive deep models for
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, semantic compositionality over a sentiment treebank.
Carroll L Wainwright, Pamela Mishkin, Chong In Proceedings of the 2013 conference on empirical
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, methods in natural language processing, pages 1631–
et al. 2022. Training language models to follow 1642.
instructions with human feedback. arXiv preprint
arXiv:2203.02155. Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi,
Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf,
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting Luke Zettlemoyer, Noah A Smith, et al. 2022.
class relationships for sentiment categorization with Selective annotation makes language models better
respect to rating scales. arXiv preprint cs/0506075. few-shot learners. arXiv preprint arXiv:2209.01975.
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
2019. How to fine-tune bert for text classification? Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
In Chinese Computational Linguistics: 18th China Maarten Bosma, Denny Zhou, Donald Metzler, et al.
National Conference, CCL 2019, Kunming, China, 2022a. Emergent abilities of large language models.
October 18–20, 2019, Proceedings 18. Springer. arXiv preprint arXiv:2206.07682.

Xiaofei Sun, Yuxian Meng, Xiang Ao, Fei Wu, Tianwei Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Zhang, Jiwei Li, and Chun Fan. 2022. Sentence Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022b.
similarity based on contexts. Transactions of the Chain of thought prompting elicits reasoning in large
Association for Computational Linguistics. language models. arXiv preprint arXiv:2201.11903.

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Jason Wei and Kai Zou. 2019. Eda: Easy data
Hao Tian, Hua Wu, and Haifeng Wang. 2020. augmentation techniques for boosting performance
Ernie 2.0: A continual pre-training framework for on text classification tasks. arXiv preprint
language understanding. In Proceedings of the AAAI arXiv:1901.11196.
conference on artificial intelligence, volume 34.
Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta,
Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Mark Riedl, and Yejin Choi. 2021. Reframing
Xiang Ao, Qing He, Fei Wu, and Jiwei Li. 2021. human-ai collaboration for generating free-text
Chinesebert: Chinese pretraining enhanced by explanations. arXiv preprint arXiv:2112.08674.
glyph and pinyin information. arXiv preprint
arXiv:2106.16038. Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and
Quoc Le. 2020. Unsupervised data augmentation for
Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. consistency training. Advances in neural information
Pte: Predictive text embedding through large-scale processing systems.
heterogeneous text networks. In Proceedings of
Sang Michael Xie, Aditi Raghunathan, Percy Liang,
the 21th ACM SIGKDD international conference on
and Tengyu Ma. 2021. An explanation of in-
knowledge discovery and data mining, pages 1165–
context learning as implicit bayesian inference. arXiv
1174.
preprint arXiv:2111.02080.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Colin Raffel. 2020. mt5: A massively multilingual
et al. 2022. Lamda: Language models for dialog pre-trained text-to-text transformer. arXiv preprint
applications. arXiv preprint arXiv:2201.08239. arXiv:2010.11934.
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
and Ashish Sabharwal. 2022. Interleaving retrieval Carbonell, Russ R Salakhutdinov, and Quoc V Le.
with chain-of-thought reasoning for knowledge- 2019. Xlnet: Generalized autoregressive pretraining
intensive multi-step questions. arXiv preprint for language understanding. Advances in neural
arXiv:2212.10509. information processing systems, 32.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz He, Alex Smola, and Eduard Hovy. 2016.
Kaiser, and Illia Polosukhin. 2017. Attention is all Hierarchical attention networks for document
you need. Advances in neural information processing classification. In Proceedings of the 2016 conference
systems, 30. of the North American chapter of the association
for computational linguistics: human language
Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying technologies, pages 1480–1489.
Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi.
2023. Gpt-re: In-context learning for relation Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
extraction using large language models. arXiv Graph convolutional networks for text classification.
preprint arXiv:2305.02105. In Proceedings of the AAAI conference on artificial
intelligence, volume 33, pages 7370–7377.
Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe
Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Xi Ye and Greg Durrett. 2022. The unreliability
Henao, and Lawrence Carin. 2018. Joint embedding of explanations in few-shot prompting for textual
of words and labels for text classification. arXiv reasoning. Advances in neural information
preprint arXiv:1805.04174. processing systems.

Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fang Zeng, Niannian Chen, Dan Yang, and Zhigang
Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. Meng. 2022. Simplified-boosting ensemble
2023. Gpt-ner: Named entity recognition via large convolutional network for text classification. Neural
language models. arXiv preprint arXiv:2304.10428. Process. Lett., 54(6).
Dataset Task # Label Source # Train # Dev # Test Dataset Task # Label Source # Train # Dev # Subtest
SST-2 sentiment 2 review 6,920 872 1,821 SST-2 sentiment 2 review 6,920 872 728
AGNews topic 4 news 96,000 24,000 7,600 AGNews topic 4 news 96,000 24,000 760
R8 topic 8 news 4,941 544 2,189 R8 topic 8 news 4,941 544 875
R52 topic 52 news 5,905 627 2,568 R52 topic 52 news 5,905 627 1,027
MR sentiment 2 reviews 6,398 710 3,554 MR sentiment 2 reviews 6,398 710 888

Table 12: Benchmark Dataset Table 13: Dataset Subsets

Susan Zhang, Stephen Roller, Naman Goyal, Mikel B Hyper-parameters


Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. B.1 Fine-tuning Hyper-parameters
2022a. Opt: Open pre-trained transformer language We fine-tune RoBERTa and RoBERT-GCN on
models. arXiv preprint arXiv:2205.01068.
4 NVIDIA 3090 GPUs with FP16. Model
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. hyper-parameters are tuned on the validation set,
Character-level convolutional networks for text where learning rate {2e-5, 3e-5, 4e-5}, batch size
classification. Advances in neural information {16, 32, 32}, a dropout rate of 0.3, a weight decay
processing systems, 28.
of 0.01, a warmup proportion of 0.01.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Smola. 2022b. Automatic chain of thought
B.2 The influence of hyper-parameters
prompting in large language models. arXiv preprint We investigate the effect of model hyper-
arXiv:2210.03493. parameters including temperature, frequency
Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. penalty. We conduct experiments with Instruct-
Factual probing is [mask]: Learning vs. learning to GPT3 on the SST-2 dataset.
recall. arXiv preprint arXiv:2104.05240.
Temperature The temperature τ controls the
A Dataset generated text variety when another hyper-
parameter topp =1. More higher τ , more variety
SST-2 (Socher et al., 2013), R8, R5210 , is introduced. When τ is close to 0, the model
AGNews (Zhang et al., 2015) and MR (Movie generates the same result with the greedy decoding
Review) (Pang and Lee, 2005). method. To exploit the effect of temperature τ ,
• SST-2: The original data in SST-2 are sampled we set τ from 0 to 1.0. Experimental results are
from snippets of Rotten Tomatoes HTML files. shown in Table B.2. We tokenize the response text
We use the same train/dev/test splits with with GPT-Tokenizer11 and then count the number
Socher et al. (2013). of tokens.
• R8 and R52: R8 and R5211 are two τ SST-2 Accuracy
τ = 0.0 96.39
subsections of the Reuters collection, τ = 0.2 96.48
containing 8 and 52 classifications, τ = 0.4 96.40
respectively. The R8 dataset is composed τ = 0.6 96.59
τ = 0.8 96.68
of 5,485 documents for training and 2,189 τ = 1.0 96.70
documents for testing. The R52 dataset is
composed of 6,532 training and 2,568 test
documents.
• AGNews: The AG News consists of news
articles from the AG’s corpus. The dataset
contains 30,000 training and 1,900 testing
examples for each class.
• MR (Movie Review): The MR contains
reviews of films for determining whether a
sentiment is either positive or negative. The
corpus has 10,662 reviews. We follow (Tang
et al., 2015) and use the same train/test split.
10
R8 and R52 are from https://www.cs.umb.edu/
11
~smimarog/textmining/datasets/ https://platform.openai.com/tokenizer
SST-2 : positive/negative sentiment analysis
Label Word Map {0: Negative, 1: Positive}
Zero-Shot
Classify Prompt: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>
SENTIMENT:
Reason-Classify Prompts: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>

Findclue-Reason-Classify Step 1:
Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>

Step 2:
Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>
CLUES: <step-1-response>

Few-Shot
Classify Prompt: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <demo-sent>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
SENTIMENT: <demo-label-word>

INPUT: <sent>
SENTIMENT:
Reason-Classify Prompts: Step 1:
Classify the sentiment of the input sentence as positive or negative.
INPUT: <demo-sent>

Step 2:
Classify the sentiment of the input sentence as positive or negative.

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <test-sent>
Findclue-Reason-Classify Prompts: Step 1:
Classify the sentiment of the input sentence as positive or negative.
INPUT: <demo-sent>

Step 2:
Classify the sentiment of the input sentence as positive or negative.

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <test-sent>

Table 14: Examples of prompts for setups in Section 3.


R8 : topic classification
Label Word Map {0: Money/Foreign Exchange, 1: Acquisitions, 2: Trade, 3: Interest Rates,
4: Shipping, 5: Earnings and Earnings Forecasts, 6: Grain, 7: Crude Oil}
Zero-Shot
Classify Prompt: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>
SENTIMENT:
Reason-Classify Prompts: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>

Findclue-Reason-Classify Step 1:
Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>

Step 2:
Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>
CLUES: <step-1-response>

Few-Shot
Classify Prompt: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <demo-sent>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
SENTIMENT: <demo-label-word>

INPUT: <sent>
SENTIMENT:
Reason-Classify Prompts: Step 1:
Classify the sentiment of the input sentence as positive or negative.
INPUT: <demo-sent>

Step 2:
Classify the sentiment of the input sentence as positive or negative.

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <test-sent>
Findclue-Reason-Classify Prompts: Step 1:
Classify the sentiment of the input sentence as positive or negative.
INPUT: <demo-sent>

Step 2:
Classify the sentiment of the input sentence as positive or negative.

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <test-sent>

Table 15: Examples of prompts for setups in Section 3.


MR : topic classification
Label Word Map {0: Negative, 1: Positive}
Zero-Shot
Classify Prompt: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>
SENTIMENT:
Reason-Classify Prompts: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>

Findclue-Reason-Classify Step 1:
Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>

Step 2:
Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <sent>
CLUES: <step-1-response>

Few-Shot
Classify Prompt: Please classify the overall SENTIMENT polarity of the INPUT sentence as Positive or Negative.
INPUT: <demo-sent>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
SENTIMENT: <demo-label-word>

INPUT: <sent>
SENTIMENT:
Reason-Classify Prompts: Step 1:
Classify the sentiment of the input sentence as positive or negative.
INPUT: <demo-sent>

Step 2:
Classify the sentiment of the input sentence as positive or negative.

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <test-sent>
Findclue-Reason-Classify Prompts: Step 1:
Classify the sentiment of the input sentence as positive or negative.
INPUT: <demo-sent>

Step 2:
Classify the sentiment of the input sentence as positive or negative.

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <demo-sent>
REASONING: <step-1-generated>
SENTIMENT: <demo-label-word>

INPUT: <test-sent>

Table 16: Examples of prompts for setups in Section 3.

You might also like