0% found this document useful (0 votes)
2 views17 pages

Data Generation Using Large Language Models For Text Classification

This study investigates the use of Large Language Models (LLMs) for generating synthetic data to enhance text classification tasks, comparing it to human-curated data. The findings indicate that while LLM-generated data can achieve competitive accuracy, incorporating a small amount of original data significantly improves model performance. The research also highlights challenges in data generation methods and the potential biases introduced by certain prompting techniques.

Uploaded by

youssef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

Data Generation Using Large Language Models For Text Classification

This study investigates the use of Large Language Models (LLMs) for generating synthetic data to enhance text classification tasks, comparing it to human-curated data. The findings indicate that while LLM-generated data can achieve competitive accuracy, incorporating a small amount of original data significantly improves model performance. The research also highlights challenges in data generation methods and the potential biases introduced by certain prompting techniques.

Uploaded by

youssef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Data Generation Using Large Language Models for Text Classification:

An Empirical Case Study

Yinheng Li 1 Rogerio Bonatti 1 Sara Abdali 1 Justin Wagle 1 Kazuhito Koishida 1

Abstract generating the same amount of data using GPT-3 only costs
14.37 USD and takes 46 minutes. With only 6000 samples
Using Large Language Models (LLMs) to gener-
arXiv:2407.12813v2 [cs.CL] 19 Jul 2024

generated by GPT-3, the model is able to achieved 76%


ate synthetic data for model training has become accuracy, compared to 88% from human-curated data.
increasingly popular in recent years. While LLMs
are capable of producing realistic training data, Our research focuses on synthetic data generation using
the effectiveness of data generation is influenced large language models (LLMs) for text classification tasks,
by various factors, including the choice of prompt, specifically tasks uses natural language understanding mod-
task complexity, and the quality, quantity, and di- els with transformer encoder architecture. In the scope of
versity of the generated data. In this work, we this study, we use the terms data augmentation and data
focus exclusively on using synthetic data for text generation interchangeably, as LLMs often require a few
classification tasks. Specifically, we use natural in-context samples to generate data. The data produced in
language understanding (NLU) models trained on this way can be considered augmented from these in-context
synthetic data to assess the quality of synthetic samples. Meanwhile, we focus solely on tasks that have
data from different generation approaches. This limited or no data at all, as our experiments have shown
work provides an empirical analysis of the impact that tasks with sufficient data receive minimal improve-
of different factors and offers recommendations ments from additional synthetic data. Numerous studies
for better data generation practices. have proposed various frameworks to improve the quality
of synthetic data generation (Wang et al., 2023; Gao et al.,
2023; Gupta et al., 2023). However, to the best of our knowl-
edge, few works have addressed the fundamental questions
1. Introduction associated with LLM for data generation. These questions
Data augmentation is a method that utilize existing data to include:
generate additional training data without collecting more
data (Feng et al., 2021). It is an effective solution to im- • What is the optimal amount of data to generate, and
prove model performance when limited data is available does increasing the volume of synthetic data improve
(Xie et al., 2020). With the emergence of large language model performance?
models, data augmentation becomes even more accessible
and has been successfully applied in training language mod- • Can in-context learning (generation) enhance the qual-
els (Gunasekar et al., 2023; Liu et al., 2024). ity of synthetic data, would providing a few examples
lead to higher quality data than zero-shot generation?
Using LLM to generate or annotate data is a cost-efficient
alternative to human-labeled data. While human-labeled • Does the LLM’s performance on a particular task di-
data tends to have higher quality, leveraging LLM with rectly influence the quality of the generated synthetic
well-designed prompts can also generate data that achieves data for this task?
comparable model performance at a much lower cost. As
estimated in (Ding et al., 2023), labeling 3000 samples for • Is combining synthetic data with raw data beneficial
SST-2 task (Socher et al., 2013) would cost between 221 for model training?
to 300 USD and take around 1000 minutes. In contrast,
• Is the synthetic data diversity an important factor for
1
Microsoft Corporation, Redmond, WA, USA. Correspondence model performance?
to: Kazuhito Koishida <[email protected]>.

Proceedings of the 41 st International Conference on Machine We experimented with six common NLP tasks (Table 1)
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by with different data generation methods. We found it is very
the author(s). challenging to pinpoint a definitive answer to the questions

1
Data Generation Using Large Language Models for Text Classification

above that applies universally to all NLP tasks due to their scription and one example, prompting the LLM to
inherent differences. Nevertheless, the findings from 6 tasks generate a similar example.
offer valuable insights into practical data generation tech-
niques. • Few-shot in-context generation: Provide the task de-
scription and a few examples, prompting the LLM to
generate a similar example.
2. Related Work
Data Augmentation The goal of data augmentation is to Inspired by the work from (Yu et al., 2023), we also ex-
increase diversity of existing data by exposing the model to periment with an additional method called zero-shot topic
unseen data. This method has been applied to many domains in-context generation:
in computer vision (Yang et al., 2023) and natural language
processing (Li et al., 2022). In (Feng et al., 2021), augmen- • Zero-shot topic in-context generation: Use the LLM to
tation techniques are categorized into rule based generation generate a list of topics (see Appendix A). Provide the
and model based generation. Rule based generation are used task description and sample one topic from the list to
in computer vision tasks including image transformations, prompt the LLM to generate a similar example.
such as rotation, flipping, and cropping(Mikołajczyk & Gro-
chowski, 2018), while model based generation has been To evaluate the success of synthetic data generation, we
widely used in natural language processing tasks, such as train a NLU model on the synthetic data and assess its
rephrasing and back translation (Kumar et al., 2019; Yang performance on the task’s validation set. We then compare
et al., 2020; Cai et al., 2020; Ye et al., 2022; Okur et al., the performance of the model trained on synthetic data with
2022b). that of the model trained on the original data. Following
the practice established in previous works (Li et al., 2023),
Large Language models (LLMs) With the development we consider the generated data is better if it results in better
of large language models, model based data augmentation model performance.
for NLP becomes trivial (Zhou et al., 2024). By instructing
LLM with proper prompt, it is able to generate a new exam- 4. Experiments
ple in human like text. While it is easy to implement, the
synthetic data generated from LLM is usually noisy and has In our experiment, GPT-3.5 turbo1 is selected for all data
a different distribution compared with raw data, which ham- generation process except for topic generation (see appendix
pers the training performance. Lots of work has explored A). Although more powerful models like GPT-4 is available,
ways to deal with this issue. The work from (Veselovsky we decided to use GPT-3.5 turbo due to the resource con-
et al., 2023) uses techniques like grounding, providing tax- strain, especially we need to run the large number of infer-
onomy and filtering to ensure the quality of synthetic data ences for our data generation experiment. Overall, GPT-3.5
by LLM. Synthesis Step by Step (Wang et al., 2023) uses an turbo is a well-rounded model with competitive performance
iterative step to create prompt based on misclassified golden across multiple benchmarks (Liang et al., 2023). It would
data to reduce the gap between the synthesized data distribu- be interesting to compare the quality of synthetic data gen-
tion and gold distribution. SunGen (Gao et al., 2023) uses erated from different LLMs, which we plan to explore in
weighted loss to reduce the impact of noise from synthetic the future.
data during training. Existing work (Gupta et al., 2023) have utilized common
NLP benchmarks, such as SuperGLUE (Wang et al., 2019),
3. Methods as tasks for evaluation or employ a customized selection of
existing benchmarks(Gao et al., 2023; Ye et al., 2022).
We follow the workflow in Figure 1 for our experiment. We
explore the following in-context data generation methods. We select six common tasks for evaluation: SST-2 (Socher
The term ”in-context generation” refers to using an LLM to et al., 2013; Wang et al., 2019), Twitter Emotion Classifica-
generate data for training given a specific context, similar to tion (EMO) (Saravia et al., 2018), New York Times News
in-context learning (Brown et al., 2020). The methods we Classification (NYT)(Stefano, 2021), Review (Amazon Re-
investigate can be categorized as follows: view Classification) (Keung et al., 2020), RTE (Recognizing
Textual Entailment) (Bentivogli et al., 2009; Wang et al.,
2019) and BoolQ (Clark et al., 2019; Wang et al., 2019).
• Zero-shot in-context generation: Provide the task de- The goal is to select diverse tasks that represent a wide range
scription in the prompt and ask the LLM to generate a of popular NLP corpora (Table 1). Additionally, we try to
similar example.
1
GPT-3.5 version: 2024-02-15 preview accessed from Azure
• One-shot in-context generation: Provide the task de- OpenAI Studio

2
Data Generation Using Large Language Models for Text Classification

zero-shot
LLM generation
Augmented Training
generate zero-shot topic Data
LLM random topics generation
topics synthetic data
LLM
from LLM
+ Classification
Model (RoBERTa)
original
one/few-shot
training data
sample in-context generation
original
training data examples LLM

low-resource task

Figure 1. Pipeline for Data Augmentation using LLM

include challenging tasks for which current NLU models do 5. Key Findings
not perform well when provided with limited training data.
Therefore, we do not use the entire GLUE benchmark, as In this section, we present the key findings from our experi-
models like BERT (Devlin et al., 2019) or RoBERTa(Liu ments.
et al., 2019) can easily achieve high accuracy on such tasks.
We also do not use the complete SuperGLUE task collection 5.1. Mixing Raw Data is Necessary
since some of its tasks require token-level classification. In To assess the effectiveness of data augmentation, we train
this work, we focus on sequence-to-sequence and sequence models with pure synthetic data and augmented data. For the
pair classification tasks. The six selected tasks cover com- augmented setting, 100 raw data points are mixed with 1000
mon web data, such as news and Wikipedia, as well as synthetic data. In the data generation stage, we use only the
popular user data, like Twitter, movie reviews, and prod- same 100 raw data points used for in-context generation to
uct reviews. They cover binary classification, multi-class prevent the model from accessing additional data. As shown
classification, and question-answering tasks. in Figure 2, we observe significant improvements across
For the evaluation metric, the default metric is accuracy, but all tasks for most prompting methods when incorporating
we use F1 or Macro-F1 to calculate the performance since raw data into training. Even as few as 100 data points can
these metrics provide a more balanced and comprehensive boost synthetic data performance compared to using only
assessment of classification performance, taking into ac- synthetic data.
count both precision and recall, especially in cases of multi-
class classification tasks. In our experiment, RoBERTa is 5.2. Impact of Bias
selected as the NLU model for all tasks, as it is a commonly
In the BoolQ task, we found that the zero-shot generation
used model for benchmark on these tasks.
method outperforms other methods, which contrasts with
We experiment five in-context generation methods for each the results obtained for the rest of the tasks. This finding is
task: zero-shot, zero-shot topic, one-shot, few-shot with 3 intriguing since zero-shot data exhibits the highest repetition
examples, few-shot with 5 examples. Prompt used in the rate, which is detrimental to model training. Upon further
generation can be found in Appendix C. examination, we noticed that only in the datasets generated
using one-shot or few-shot methods, terms like ”not,” ”sig-
In our experiment, we generate 1,000 synthetic data points
nificant,” ”only,” ”just,” ”few,” and ”little” frequently appear
per task, as we found the benefit of additional synthetic data
in the generated questions. These terms create a tone that
diminishes after that. To simulate a low-resource setting,
can be used to imply the answer to the question (which is
we allow only 100 raw examples to be used for one-shot
often False). Table 2 provides an example of such trivial
and few-shot generation. For zero-shot topic generation,
question. Table 3 provides statistics for such questions from
we generate 500 random topics related to the task domain.
different prompting method.
Details can be found in Appendix A.
We hypothesize that this pattern introduces bias in model

3
Data Generation Using Large Language Models for Text Classification

Corpus Training Size Test Size Task Metrics Domain


SST-2 67k 1.8k Binary Classification F1 Movie Reviews
EMO 16k 2k Multi-class Classification Macro-F1 Twitter
NYT 256k 3k2 Multi-class Classification Macro-F1 News
Review 200k 5k Multi-class, Ordinal Regression Macro-F1 Amazon Review
RTE 2.5k 3k Pair Classification, Question An- Macro-F1 News, Wikipedia
swering
BoolQ 16k 3.2k Pair Classification, Question An- Macro-F1 News, Wikipedia, Web
swering Query

Table 1. Summary of datasets and tasks.

finetuned on the synthetic data generated by the LLM could


Did the Mars Exploration Rover mission only involve one rover? – actually outperform the LLM itself. The former scenario
False could be due to the fact that the ability of an LLM to gen-
Did scientists in the 20th century make no significant discoveries erate good examples for a task does not always correspond
or advancements? – False
to its ability to solve the task itself. The latter scenario is
Table 2. Examples of Trivial Questions – questions contain terms also plausible, as an LLM may be proficient at generating
”not,” ”significant,” ”only,” ”just,” ”few,” and ”little”. examples with a given label, but not as good at predicting
the label given the task itself.
The results of our experiment can be found in Table 4.
training by encouraging the model to search for specific For each task, we prompted the LLM (GPT3.5-turbo) with
keywords in the question rather than reading the passage. zero/one/three/five-shot learning and reported the best per-
To test this hypothesis, we instruct the LLM to rephrase formance achieved across all in-context learning methods.
the questions like ”what people would search online” for We did not optimize the prompt or use any advanced prompt-
each synthetic example (see Appendix B). We found that ing methods in our evaluation of the LLM. It is possible
performance significantly improved for zero-shot topic and that the LLM could achieve better performance with more
one-shot method after rephrasing. The work (Okur et al., advanced prompting techniques. However, the results ob-
2022a) has also shows the effectiveness of paraphrasing in tained from the most basic in-context learning method (see
other data augmentation techniques. Appendix D) do provide valuable insights into this problem.
Although we only detected synthetic bias in the BoolQ task, For SST-2, BoolQ, NYT, and Review tasks, we found a
it remains an important factor to consider during data gener- performance gap of 10-15% between the LLM’s in-context
ation. The technique of rephrasing might not be applicable learning performance on the task and the fine-tuned lan-
to other cases, but ensuring that synthetic data does not guage model (RoBERTa model) using synthetic data. For
contain unwanted patterns is necessary. RTE and EMO tasks, the LLM does not perform well, but
For all the rest experiments, the results for BoolQ task are the data generated by the LLM leads to much better per-
all under the question rephrasing setting unless otherwise formance. Therefore, even for tasks that LLMs struggle
specified. to solve, using LLM-generated synthetic data can still be
helpful.
5.3. Relationship between LLM Performance and Data
Quality 5.4. Synthetic Data is Helpful Mostly in Low-Resource
Settings
While it may seem intuitive that the effectiveness of using
LLMs to generate data for model training depends on the Previous work has shown that it is challenging for models
LLM’s knowledge of a specific task, our research has shown trained with synthetic data to perform as well as models
that this is not always the case. The zero-shot or few-shot trained with the same amount of original data (Li et al., 2023;
performance of an LLM on a task does not necessarily deter- Ding et al., 2023). However, when human-annotated data
mine the performance of a model (specifically, the RoBERTa is limited, synthetic data augmentation can improve model
model used in our experiment) trained with data generated performance. In fact, this technique is most effective in low-
by the LLM. In other words, the fact that an LLM performs resource settings. For all tasks with 100 raw data points, we
well on a task does not guarantee that models finetuned with found that synthetic data augmentation yields improvements
data generated by the LLM will also perform well. Addi- from 3% to 26%. When the raw training data increases from
tionally, for tasks where the LLM performs poorly, models 100 to 1,000, only four tasks show improvements, where

4
Data Generation Using Large Language Models for Text Classification

BOOLQ - F1 SCORE NYT - F1 SCORE EMO - F1 SCORE


synthetic only augmented synthetic only augmented synthetic only augmented

74.8%
74.5%
74.4%

73.9%

73.4%
76.7%
76.7%
76.6%

58.7%
57.5%
73.6%
73.5%

65.3%

56.1%
73.0%

55.5%
54.9%
71.5%

63.1%
70.7%

61.2%

51.1%
57.1%

55.4%

43.1%
40.5%

38.9%
51.0%

48.4%

31.2%
REVIEW - F1 SCORE
RTE - F1 SCORE SST2 - F1 SCORE
synthetic only augmented
53.2%
53.1%
52.8%
52.8%

synthetic only augmented synthetic only augmented


51.3%

88.2%

88.1%
87.6%
87.5%

87.5%
68.3%

87.3%
50.1%

65.3%
65.1%
64.7%

64.6%

64.3%
64.2%

64.2%

86.4%
63.8%
49.6%

84.5%

84.1%
47.7%
45.6%

44.7%

29.0%

78.2%
Figure 2. Performance of different prompting methods with and without augmentation. Synthetic only: use 1000 synthetic data only.
Augmented: 1000 synthetic data plus 100 raw data

improvements are less than 5% (Figure 3). There are no dataset because different topics are sampled each time dur-
established rules for determining the amount of raw data ing generation. Zero-shot methods generate the least diverse
as low-resource. For all six tasks in our experiment, 1,000 dataset, as the prompt remains the same for each genera-
data points represent a small portion of training data. We tion. One-shot and few-shot methods also generate repeated
found the model continues to improve as we increase the examples due to the limitation of in-context examples. We
number of raw data for training. However, the amount of found for most tasks, a diversity dataset tends to benefit
performance gain obtained from increasing training data model training.
is also dependent on other factors such as task and model
As shown in (Figure 2) in non-augmented setting, zero-shot
complexity. Based on this observation, we consider 100 raw
generation shows the worst performance for RTE, EMO,
data points as low-resource tasks, which will be used as the
Review and SST-2, while zero-shot topic generation out-
default augmented setting in all experiments.
performs other methods (or at least is comparable to other
methods) for BoolQ, NYT, RTE and EMO task. This effect
5.5. A Comparison Between Different Prompting does not appear on all tasks as there might be other factors
Methods that impact the model performance. Meanwhile, the effect
In the synthetic data only setting, one-shot or zero-shot of diversity diminishes when we mix synthetic data with raw
topic methods rank in the top two for all tasks except the data. Therefore, training with both raw data and synthetic
Review task (Figure 2). In the augmented setting, few-shot data could help when synthetic data is less diverse.
generation and zero-shot topic generation methods demon- While not generating the most optimally diverse dataset, one-
strate good performance across all tasks. In BoolQ, EMO, shot or few-shot generation methods typically helps LLMs
and RTE tasks, zero-shot topic methods outperform other better understand the task description and generate examples
prompting methods. In SST-2 and NYT tasks, few-shot gen- similar to the original examples (Li, 2023; Song et al., 2022).
eration methods perform best. The performance of zero-shot In EMO and Review tasks, we observe the advantage of few-
methods is sub-optimal across all tasks. shot learning over other prompting methods. We suspect
In the five prompting methods we experimented with, zero- this is because both tasks are more subjective compared to
shot topic generation typically produces the most diverse the rest of the tasks, as the EMO contains twitter posts and
Review task are made up of customer reviews and ratings.

5
Data Generation Using Large Language Models for Text Classification

T RIVIAL Q. C OUNT F1 S CORE


R AW R EPHRASED R AW (SD) R AW (SD) R AW (AD) R EPHRASED (AD)
Z ERO -S HOT T OPIC 230 208 0.19 0.77 0.75 0.77
O NE -S HOT 131 74 0.38 0.74 0.76 0.77
F EW-S HOT (3 EX .) 90 30 0.55 0.51 0.70 0.72
F EW-S HOT (5 EX .) 57 28 0.53 0.48 0.75 0.73
Z ERO -S HOT 11 - 0.71 - 0.73 -
R AW DATA 31 - - 0.768 - -

Table 3. BoolQ Trivial Questions and F1 score comparison. SD: use 1000 synthetic data. AD: use 100 raw data plus 1000 synthetic data.
raw data: model uses 1000 raw data only without question rephrase, this score is used as a baseline

GPT3.5-turbo RoBERTa on Synthetic Data RoBERTa on Augmented Data


SST-2 0.956 0.845 0.874
BoolQ 0.870 0.641 0.742
NYT 0.729 0.604 0.742
Review 0.603 0.475 0.527
RTE 0.345 0.574 0.653
Emo 0.300 0.404 0.568

Table 4. LLM performance vs model trained by synthetic data on 6 tasks. Average f1 score from 5 prompting method under (1) Synthetic
Data (1000 synthetic data) (2) Augmented data (1000 synthetic data + 100 raw data)

generated by five prompting methods. On the x-axis, we


PERFORMANCE BY RAW DATA show the performance of the finetuned model using the 1000
AMOUNT synthetic data only. Figure 4 shows that for BoolQ, NYT,
raw data (100) augmented (100) raw data (1000) augmented (1000)
and SST-2, a lower inter-sample diversity results in a better
92.1%

F1 score. However, for other tasks, the correlation is weak


91.6%
88.2%
83.4%
83.0%
80.2%

79.5%
78.5%
76.8%
76.7%

76.7%

74.8%

due to the existence of outliers, especially for RTE, and the


71.3%
71.1%
70.1%

68.3%
65.4%
59.9%
58.7%

57.9%

possible impact of other factors, such as task complexity.


54.1%
53.2%
44.1%

f1 We also calculated the similarity between the synthetic data


32.6%

and the actual raw data using the same method and found
that the synthetic data generated from five different prompt-
ing methods had similar similarity scores with the raw data.
BOOLQ EMO NYT REVIEW RTE SST2
However, it is not clear whether synthetic data that closely
resembles the raw data would lead to better model perfor-
mance. This could be due to the limitations of our similarity
measuring method, which only considers semantic similar-
Figure 3. Improvement on Different Raw Data Amount. raw data ity, as discussed in (Steck et al., 2024). Many NLP tasks
(x) is only using X number of raw data points. augmented (x)
rely on subtle contextual cues and nuanced wordings, such
is using X amount raw data points plus 100 synthetic data. For
as in the SST-2 task, where changes to wording can affect
augmented f1 score, it is the average model performance on the
data generated by 5 different prompting methods the sentiment of the text more than contextual semantics.
Our measurement does not account for other aspects of sim-
ilarity, such as structural or lexical similarity, as discussed
in (Wang et al., 2020; Ayeldeen et al., 2014). Lastly, due to
5.6. Synthetic Data Diversity and Similarity to Raw
the limited number of data points and the potential variation
Data
in synthetic data, it needs to be cautious to generalize our
In this section, we examine the diversity of our training data findings to our tasks or domains.
using inter-sample semantic similarity. To calculate this sim-
ilarity, we use vector embedding proposed in (Reimers & 5.7. Synthetic Data Quantity
Gurevych, 2019) and average the similarity score across all
examples pairs following (Yu et al., 2023). Figure 4 displays We have found that increasing the amount of synthetic data
the inter-sample similarity for each task, comparing data in our model training improves its performance. Figure 5

6
Data Generation Using Large Language Models for Text Classification

similarity similarity similarity


score Boolq score EMO score NYT
0.850 0.750 0.600
0.750
0.650 0.550
0.650
0.550 0.550 0.500
0.450 0.450 0.450
0.450 0.500 0.550 0.600 0.650 0.700 0.750 0.800 0.300 0.350 0.400 0.450 0.500 0.550 0.540 0.560 0.580 0.600 0.620 0.640 0.660
f1 f1 f1
inter sample similarity inter sample similarity inter sample similarity
similarity between synthetic and raw sample similarity between synthetic and raw sample similarity between synthetic and raw sample
Linear (inter sample similarity) Linear (inter sample similarity) Linear (inter sample similarity)
Linear (similarity between synthetic and raw sample) Linear (similarity between synthetic and raw sample) Linear (similarity between synthetic and raw sample)

similarity similarity similarity


score RTE score SST2 score Review
0.850 0.750 0.850
0.750 0.750
0.650
0.650 0.650
0.550 0.550 0.550
0.450 0.450 0.450
0.200 0.300 0.400 0.500 0.600 0.700 0.760 0.780 0.800 0.820 0.840 0.860 0.880 0.900 0.440 0.450 0.460 0.470 0.480 0.490 0.500 0.510
f1 f1 f1
inter sample similarity inter sample similarity inter sample similarity
similarity between synthetic and raw sample similarity between synthetic and raw sample similarity between synthetic and raw sample
Linear (inter sample similarity) Linear (inter sample similarity) Linear (inter sample similarity)
Linear (similarity between synthetic and raw sample) Linear (similarity between synthetic and raw sample) Linear (similarity between synthetic and raw sample)

Figure 4. Synthetic Data Similarity

shows the relationship between the model’s performance


(measured by the f1 score) on the y-axis and the total number Left-to-right prompt: generate an example text first and then
of training data on the x-axis. In the augmented scenario, we generate its class label.
mixed 100 raw data points with varying amounts of synthetic Class-conditioned prompt: generate an example text where the
label must be Class X.
data. The performance is the average of the model’s f1 score
over 5 prompting methods for different data amount. For the Table 5. Left-to-right prompt vs. class-conditioned prompt.
raw data scenario, only real-world data was used in model
training. Our graph indicates that model performance from
raw data serves as an upper bound for the augmented setting avoids LLM generating unknown labels. It also provides
in almost all tasks. Moreover, we observe that the marginal the user control over the label distribution in the synthetic
effect of performance gain with increasing training data is dataset.
present in both raw and synthetic data. For BoolQ and SST-
2 tasks, we observed this phenomenon at the same data size. It is worth noting that class-conditioned generations are
As such, the raw data size at which marginal improvement more likely to introduce bias and reduce the difficulty of the
of model performance appears can be used as a reference synthetic example. When the class label is visible, LLM
point when increasing the number of synthetic data. might leak the label information during content generation.
In the BoolQ example, LLM hints the answer ”FALSE” via
the use of certain words in the question it generates (e.g.
6. Data Generation Techniques in Practice the word ”only”). In this case, rephrasing the question with
In the process of using LLM to generate data for this study, the class label hidden improves the performance, which is
we identified several useful techniques. These practices lack essentially performing left-to-right generation.
sufficient theoretical support and the effectiveness of these
techniques can be subject to the choice of large language 6.2. Generation on Target Corpus
models or the requirements of a specific task. It is critical to provide topics or descriptions closely related
to the use case when generating examples. Ensuring that
6.1. Condition on Label the topics are relevant to the use case significantly improves
Typically, there are two ways to generate a classification the quality of generated data. For example, when creat-
dataset: Condition on the Label and Left-to-Right (see Table ing examples from Twitter, it is beneficial to first generate
5). It is recommended to use Condition on the Label for common topics found on Twitter. On the other hand, when
each generation as it saves effort in parsing the label and generating Amazon customer reviews, it is effective to gen-
erate an Amazon product catalog as a list of potential topics.

7
Data Generation Using Large Language Models for Text Classification

BOOLQ EMO NYT


raw data augmented data raw data augmented data raw data augmented data

0.78 1 0.84
0.76 0.8 0.82
0.74 0.8
0.6
F1

F1

F1
0.72 0.78
0.4
0.7 0.76
0.68 0.2 0.74
0.66 0 0.72
100 300 500 700 900 1100 100 300 500 700 900 1100 100 300 500 700 900 1100
NUMBER OF DATA NUMBER OF DATA NUMBER OF DATA

REVIEW RTE SST2


raw data augmented data raw data augmented data raw data augmented data

0.7 0.8 0.93


0.6 0.7 0.92
0.6 0.91
0.5 0.9
0.5
0.4 0.89
F1

F1

F1
0.4
0.3 0.88
0.3 0.87
0.2 0.2 0.86
0.1 0.1 0.85
0 0 0.84
100 300 500 700 900 1100 100 300 500 700 900 1100 100 300 500 700 900 1100
NUMBER OF DATA NUMBER OF DATA NUMBER OF DATA

Figure 5. Impact on Synthetic Data Quantity

This approach ensures that the synthetic data is more closely challenging to find rules that generalize well across all tasks.
aligned with the target corpus, leading to better performance However, our findings could still serve as valuable resources
in classification tasks. for researchers and practitioners looking to use synthetic
data for training classification models. For future work, it
6.3. Iterative Data Generation and Prompt Refinement would be valuable to study the effects of more advanced
prompting methods, such as the Chain of Thought (Wei
Generating synthetic data can be both time-consuming and et al., 2023), or LLM hyperparameters, such as temperature,
resource-intensive. To maximize efficiency and ensure high- on the quality of synthetic data.
quality data, it is recommended to adopt an iterative ap-
proach. Initially, generate a small number of examples and
evaluate their quality. If the quality of these initial data Impact Statement
points is low, refine the prompt before generating more data. This paper presents work whose goal is to advance the field
It is unlikely that simply generating more data points with of Machine Learning. There are many potential societal
the same prompt will magically produce high quality data. consequences of our work, none which we feel must be
specifically highlighted here.
7. Conclusion
In this work, we analyzed different factors that influences References
the data generation using LLMs. We found data generation Ayeldeen, H., Hassanien, A. E., and Fahmy, A. A. Lexical
is most effective in low resourced settings. Increasing the similarity using fuzzy euclidean distance. In 2014 In-
amount of synthetic data does not necessarily lead to contin- ternational Conference on Engineering and Technology
uous improvements in model performance. It is beneficial (ICET), pp. 1–6, 2014. doi: 10.1109/ICEngTechnol.2014.
to combine synthetic data with raw data during training. Ad- 7016801.
ditionally, it is crucial to be vigilant for patterns or biases in
synthetic data that may hinder model training. Overall, us- Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D.,
ing LLM for data augmentation has great potential in model and Magnini, B. The fifth PASCAL recognizing textual
training. With a carefully tuned prompt, the data generated entailment challenge. 2009.
by LLM is able to obtain comparable performance with
human annotated data, but at a much lower cost. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
The domain of data generation for classification tasks is Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
highly complex. Due to the diversity of NLP tasks, it is Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,

8
Data Generation Using Large Language Models for Text Classification

J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Chapter of the Association for Computational Linguis-
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, tics: Human Language Technologies, Volume 1 (Long
S., Radford, A., Sutskever, I., and Amodei, D. Language and Short Papers), pp. 3609–3619, Minneapolis, Min-
models are few-shot learners, 2020. nesota, June 2019. Association for Computational Lin-
guistics. doi: 10.18653/v1/N19-1363. URL https:
Cai, H., Chen, H., Song, Y., Zhang, C., Zhao, X., and Yin, D. //aclanthology.org/N19-1363.
Data manipulation: Towards effective instance learning
for neural dialogue generation via learning to augment Li, B., Hou, Y., and Che, W. Data augmenta-
and reweight. In Jurafsky, D., Chai, J., Schluter, N., and tion approaches in natural language processing: A
Tetreault, J. (eds.), Proceedings of the 58th Annual Meet- survey. AI Open, 3:71–90, 2022. ISSN 2666-
ing of the Association for Computational Linguistics, pp. 6510. doi: https://doi.org/10.1016/j.aiopen.2022.03.
6334–6343, Online, July 2020. Association for Compu- 001. URL https://www.sciencedirect.com/
tational Linguistics. doi: 10.18653/v1/2020.acl-main. science/article/pii/S2666651022000080.
564. URL https://aclanthology.org/2020.
acl-main.564. Li, Y. A practical survey on zero-shot prompt design
for in-context learning. In Proceedings of the Confer-
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, ence Recent Advances in Natural Language Process-
M., and Toutanova, K. BoolQ: Exploring the surprising ing - Large Language Models for Natural Language
difficulty of natural yes/no questions. In Proceedings of Processings, RANLP. INCOMA Ltd., Shoumen, BUL-
NAACL-HLT 2019, 2019. GARIA, 2023. doi: 10.26615/978-954-452-092-2
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: 069. URL http://dx.doi.org/10.26615/
Pre-training of deep bidirectional transformers for lan- 978-954-452-092-2_069.
guage understanding, 2019.
Li, Z., Zhu, H., Lu, Z., and Yin, M. Synthetic data gener-
Ding, B., Qin, C., Liu, L., Chia, Y. K., Joty, S., Li, B., and ation with large language models for text classification:
Bing, L. Is gpt-3 a good data annotator?, 2023. Potential and limitations, 2023.

Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D.,
Mitamura, T., and Hovy, E. A survey of data augmenta- Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar,
tion approaches for nlp, 2021. A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove,
C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson,
Gao, J., Pi, R., Lin, Y., Xu, H., Ye, J., Wu, Z., Zhang, W., D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F.,
Liang, X., Li, Z., and Kong, L. Self-guided noise-free Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng,
data generation for efficient zero-shot learning, 2023. L., Yuksekgonul, M., Suzgun, M., Kim, N., Guha, N.,
Chatterji, N., Khattab, O., Henderson, P., Huang, Q., Chi,
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T.,
Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X.,
de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
Mai, Y., Zhang, Y., and Koreeda, Y. Holistic evaluation
H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
of language models, 2023.
Y. T., and Li, Y. Textbooks are all you need, 2023.

Gupta, H., Scaria, K., Anantheswaran, U., Verma, S., Par- Liu, R., Wei, J., Liu, F., Si, C., Zhang, Y., Rao, J., Zheng, S.,
mar, M., Sawant, S. A., Baral, C., and Mishra, S. Targen: Peng, D., Yang, D., Zhou, D., and Dai, A. M. Best prac-
Targeted data generation with large language models, tices and lessons learned on synthetic data for language
2023. models, 2024.

Keung, P., Lu, Y., Szarvas, G., and Smith, N. A. The Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
multilingual amazon reviews corpus. In Proceedings of Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
the 2020 Conference on Empirical Methods in Natural Roberta: A robustly optimized bert pretraining approach,
Language Processing, 2020. 2019.

Kumar, A., Bhattamishra, S., Bhandari, M., and Taluk- Mikołajczyk, A. and Grochowski, M. Data augmentation for
dar, P. Submodular optimization-based diverse para- improving deep learning in image classification problem.
phrasing and its effectiveness in data augmentation. In In 2018 International Interdisciplinary PhD Workshop
Burstein, J., Doran, C., and Solorio, T. (eds.), Proceed- (IIPhDW), pp. 117–122, 2018. doi: 10.1109/IIPHDW.
ings of the 2019 Conference of the North American 2018.8388338.

9
Data Generation Using Large Language Models for Text Classification

Okur, E., Sahay, S., and Nachman, L. Data augmentation Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,
with paraphrase generation and entity extraction for mul- Michael, J., Hill, F., Levy, O., and Bowman, S. R. Su-
timodal dialogue system. In Calzolari, N., Béchet, F., perGLUE: A stickier benchmark for general-purpose lan-
Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, guage understanding systems. arXiv preprint 1905.00537,
S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., 2019.
Odijk, J., and Piperidis, S. (eds.), Proceedings of the Thir-
teenth Language Resources and Evaluation Conference, Wang, R., Zhou, W., and Sachan, M. Let’s synthesize step
pp. 4114–4125, Marseille, France, June 2022a. Euro- by step: Iterative dataset synthesis with large language
pean Language Resources Association. URL https: models by extrapolating errors from small models, 2023.
//aclanthology.org/2022.lrec-1.437. Wang, Z., Zhang, Y., and Wu, H. Structural-aware sentence
similarity with recursive optimal transport, 2020.
Okur, E., Sahay, S., and Nachman, L. Data augmentation
with paraphrase generation and entity extraction for mul- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
timodal dialogue system, 2022b. Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought
prompting elicits reasoning in large language models,
Reimers, N. and Gurevych, I. Sentence-bert: Sentence 2023.
embeddings using siamese bert-networks, 2019.
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V.
Unsupervised data augmentation for consistency training,
Saravia, E., Liu, H.-C. T., Huang, Y.-H., Wu, J., and
2020.
Chen, Y.-S. CARER: Contextualized affect represen-
tations for emotion recognition. In Proceedings of the Yang, S., Xiao, W., Zhang, M., Guo, S., Zhao, J., and Shen,
2018 Conference on Empirical Methods in Natural Lan- F. Image data augmentation for deep learning: A survey,
guage Processing, pp. 3687–3697, Brussels, Belgium, 2023.
October-November 2018. Association for Computational
Linguistics. doi: 10.18653/v1/D18-1404. URL https: Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S.,
//www.aclweb.org/anthology/D18-1404. Le Bras, R., Wang, J.-P., Bhagavatula, C., Choi, Y.,
and Downey, D. Generative data augmentation for
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, commonsense reasoning. In Cohn, T., He, Y., and
C. D., Ng, A., and Potts, C. Recursive deep models for Liu, Y. (eds.), Findings of the Association for Compu-
semantic compositionality over a sentiment treebank. In tational Linguistics: EMNLP 2020, pp. 1008–1025, On-
Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., line, November 2020. Association for Computational
and Bethard, S. (eds.), Proceedings of the 2013 Confer- Linguistics. doi: 10.18653/v1/2020.findings-emnlp.
ence on Empirical Methods in Natural Language Process- 90. URL https://aclanthology.org/2020.
ing, pp. 1631–1642, Seattle, Washington, USA, October findings-emnlp.90.
2013. Association for Computational Linguistics. URL
Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T.,
https://aclanthology.org/D13-1170.
and Kong, L. Zerogen: Efficient zero-shot learning via
dataset generation, 2022.
Song, Y., Wang, T., Mondal, S. K., and Sahoo, J. P. A
comprehensive survey of few-shot learning: Evolution, Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A., Krishna,
applications, challenges, and opportunities, 2022. R., Shen, J., and Zhang, C. Large language model as
attributed training data generator: A tale of diversity and
Steck, H., Ekanadham, C., and Kallus, N. Is cosine- bias, 2023.
similarity of embeddings really about similarity? arXiv
preprint arXiv:2403.05440v1, 2024. arXiv.org perpetual Zhou, Y., Guo, C., Wang, X., Chang, Y., and Wu, Y. A
non-exclusive license. survey on data augmentation in large model era, 2024.

Stefano, D. D. New york times topics. https:


//huggingface.co/datasets/dstefa/New_
York_Times_Topics, 2021.

Veselovsky, V., Ribeiro, M. H., Arora, A., Josifoski, M.,


Anderson, A., and West, R. Generating faithful synthetic
data with large language models: A case study in compu-
tational social science, 2023.

10
Data Generation Using Large Language Models for Text Classification

A. Appendix
Prompt for topic generation for zero-shot with topics and LLM output examples. GPT-4 is used to generate 500 random
topics per task:

Task Role Message


BoolQ, RET, NYT, SST-2, Emo System You are an AI assistant that generates random topics. There is no limit
on the number of topics you can generate.
BoolQ, RET, NYT User Please generate 500 topics
BoolQ, RET, NYT LLM Output example: The world’s most beautiful sculptures, The role of
technology in modern education ...
SST-2, Emo User Please generate 500 twitter post topics
SST-2, Emo LLM Output example: Lunch break, Online dating ...
Review System You are an AI assistant that knows Amazon product categories. The user
will ask you to generate a list of categories. It is your responsibility to
generate the entire list of categories.
Review User Please generate 500 amazon different product categories
Review LLM Output example: Baby Products, Clothing, Jewelry ...

B. Appendix
Prompt for Question Rephrasing in Section 5.2

Please rephrase the question as if you are typing it in a search engine. Make sure the answer can only be true or false, Input: question
Output:

11
Data Generation Using Large Language Models for Text Classification

C. Appendix
Prompt used for data generation for each task:

12
Data Generation Using Large Language Models for Text Classification

Task Prompt Type Prompt


BoolQ zero-shot
Step 1 Please generate a random short passage. Passage:
Step 2 Please generate a True or False question based on the passage.
The answer to the question must be [random([True, False])] Passage:
[passage from step 1] Question:
BoolQ zero-shot topic
Step 1 Please generate a short passage about this topic: [topic sampled
from a topic list] Passage:
Step 2 Please generate a True or False question based on the passage.
The answer to the question must be [random([True, False])] Passage:
[passage from step 1] Question:
BoolQ one-shot
Step 1 Please generate a Passage, a Question and the Label to the
question following this example: [example from raw data: Passage,
Question, Label] Please generate a similar passage. Passage:
Step 2 Please generate a True or False question based on the passage.
The answer to the question must be [label from example in Step 1]
Passage: [passage generated in Step 1] Question:
BoolQ few-shot (3 or 5)
Step 1 Please generate a Passage, a Question and the Label to the
question. Here are some examples: [examples from raw data: Passage,
Question, Label] Please generate a similar example. Make sure the
question is a True or False question and the answer to the question is
[random([True, False])]. Passage:
EMO zero-shot
Step 1 Please generate a twitter post with the emotion of [ran-
dom(label)]. Text:
EMO zero-shot topic
Step 1 Please consider this topic for generation: [topic sampled from
a topic list]. Please generate a twitter post with the emotion of [ran-
dom(label)]. Text:
EMO one-shot
Step 1 The task is to predict the emotion of a twitter post. The emotion
contains six categories: sadness, joy, love, anger, fear, surprise. Here is
an example. Text: [example from raw data] Emotion: [example label
from raw data] Please generate another example for the same emotion.
Text:
EMO few-shot (3 or 5)
Step 1 The task is to predict the emotion of a twitter post. The emotion
contains six categories: sadness, joy, love, anger, fear, surprise. Here
are some examples: [examples: Text, Emotion] Please generate a twitter
post with the emotion of [first label from examples]. Text:

13
Data Generation Using Large Language Models for Text Classification

Task Prompt Type Prompt


NYT zero-shot
Step 1 Please generate a news title for [random(label)] category. Head-
line:
NYT zero-shot topic
Step 1 Please consider this sentence for generation: [topic sampled
from topic list]. Please generate a news headline for [random(label)]
category. Headline:
NYT one-shot
Step 1 The task is to predict the topic of a news headline. The topics
contain ’sports’, ’arts, culture and entertainment’, ’business and finance’,
’health and wellness’, ’lifestyle and fashion’, ’science and technology’,
’politics’, ’crime’. Here is an example News: [example news] Topic:
[example topic] Please generate another news on [example topic]. Head-
line:
NYT few-shot (3 or 5)
Step 1 The task is to predict the topic of a news headline. The topics
contain ’sports’, ’arts, culture and entertainment’, ’business and finance’,
’health and wellness’, ’lifestyle and fashion’, ’science and technology’,
’politics’, ’crime’. Here are some examples: [examples: Headline, Topic]
Please generate a news headline for [first topic from examples] category.
News:
Review zero-shot
Step 1 The Amazon customer review has a rating ranges from 1 to 5,
1 being the lowest and 5 being the highest. Please generate a customer
review with a rating of [random(label)]. Content:
Review zero-shot topic
Step 1 The Amazon customer review has a rating ranges from 1 to 5,
1 being the lowest and 5 being the highest. Please generate a customer
review with a rating of [random(label)] for a specific product under [a
product category sampled from topic list]. Please use a fake product
name. Content:
Review one-shot
Step 1 The task is to predict the rating of an Amazon customer review
based on the content. The rating ranges from 1 to 5, 1 being the lowest
and 5 being the highest. Here is a review example. Content: [example
content] Rating: [example rating] Please generate another example for a
similar product. Make sure the rating for the review is [example rating].
Content:
Review few-shot (3 or 5)
Step 1 The Amazon customer review has a rating ranges from 1 to
5, 1 being the lowest and 5 being the highest. Here are some examples
Content: [examples: Content, Rating] Please generate a customer review
with a rating [first rating from examples]. Content:

14
Data Generation Using Large Language Models for Text Classification

Task Prompt Type Prompt


RTE zero-shot
Step 1 Given a premise and a hypothesis, a model needs to predict
whether the hypothesis can be logically inferred from the premise. The
response should be either True if the hypothesis can be inferred from
the premise, or False if it cannot be inferred. Here is the output format:
Premise: Hypothesis: Label: True or False Please generate an example
where the Label is [random(label)]. Premise:
RTE zero-shot topic
Step 1 Given a premise and a hypothesis, a model needs to predict
whether the hypothesis can be logically inferred from the premise. The
response should be either True if the hypothesis can be inferred from
the premise, or False if it cannot be inferred. Here is the output format:
Premise: Hypothesis: Label: True or False Please generate an example
about [premise] where the Label is [random(label)]. Premise:
RTE one-shot
Step 1 Given a premise and a hypothesis, a model needs to predict
whether the hypothesis can be logically inferred from the premise. The
response should be either True if the hypothesis can be inferred from
the premise, or False if it cannot be inferred. Here is an example:
Premise: [example premise] Hypothesis: [example hypothesis] Label:
[example label] Please generate another similar example where the Label
is [example label]. Premise:
RTE few-shot (3 or 5)
Step 1 Given a premise and a hypothesis, a model needs to predict
whether the hypothesis can be logically inferred from the premise. The
response should be either True if the hypothesis can be inferred from the
premise, or False if it cannot be inferred. Here are some examples: [ex-
amples: Premise, Hypothesis, Label] Please generate a similar example.
Make sure the label is [first label from examples]. Premise:
SST-2 zero-shot
Step 1 Please generate a sentence that contains a [random(label)]
sentiment. Sentence:
SST-2 zero-shot topic
Step 1 Please consider this topic for generation: [topic from the topic
list]. Please generate a sentence that contains a [random(label)] senti-
ment. Sentence:
SST-2 one-shot
Step 1 The task is to predict whether the following sentence is positive
or negative sentiment. Sentence: [example sentence] Label:[example
label] Please generate a similar example on the same topic, including a
Sentence and a Label. Sentence:
SST-2 few-shot (3 or 5)
Step 1 The task is to predict whether the following sentence is positive
or negative sentiment. [examples: Sentence, Label] Please generate a
similar example, including a Sentence and a Label. Sentence:

15
Data Generation Using Large Language Models for Text Classification

D. Appendix
Prompt used to evaluate LLM performance on each task.

16
Data Generation Using Large Language Models for Text Classification

Task Prompt Type Prompt


RTE zero-shot
Step 1 Given a premise and a hypothesis, a model needs to predict
whether the hypothesis can be logically inferred from the premise. The
response should be either True if the hypothesis can be inferred from the
premise, or False if it cannot be inferred. Premise: [premise], Hypothesis:
[hypothesis], Label:
RTE 0/1/3/5-shot
Step 1 Given a premise and a hypothesis, a model needs to predict
whether the hypothesis can be logically inferred from the premise. The
response should be either True if the hypothesis can be inferred from
the premise, or False if it cannot be inferred. Here are some examples:
[example premise, hypothesis, label] Premise: [premise], Hypothesis:
[hypothesis], Label:
BoolQ zero-shot
Step 1 The task is to answer a question which is solely based on the
content provided. Passage: [passage] , Question: [question], Label:
BoolQ 0/1/3/5-shot
Step 1 The task is to answer a question which is solely based on the
content provided. Here are some examples: [example passage, question,
label] Passage: [passage], Question: [question], Label:
Review zero-shot
Step 1 The task is to predict the rating of an Amazon customer review
based on the content. The rating ranges from 1 to 5, with 1 being the
lowest and 5 being the highest. Text: [text] , Label:
Review 0/1/3/5-shot
Step 1 The task is to predict the rating of an Amazon customer review
based on the content. The rating ranges from 1 to 5, with 1 being the
lowest and 5 being the highest. Here are some examples: [example text,
label] Text: [text], Label:
NYT zero-shot
Step 1 The task is to predict the topic of a news headline. The topics
include: ’sports’, ’arts, culture and entertainment’, ’business and finance’,
’health and wellness’, ’lifestyle and fashion’, ’science and technology’,
’politics’, ’crime’. Text:[text], Label:
NYT 0/1/3/5-shot
Step 1 The task is to predict the topic of a news headline. The topics
include: ’sports’, ’arts, culture and entertainment’, ’business and finance’,
’health and wellness’, ’lifestyle and fashion’, ’science and technology’,
’politics’, ’crime’. Here are some examples: [example text, label] Text:
[text], Label:
EMO zero-shot
Step 1 The task is to predict the emotion of a Twitter text. The emotions
include six categories: sadness, joy, love, anger, fear, surprise. Text:
[text], Label:
EMO 0/1/3/5-shot
Step 1 The task is to predict the emotion of a Twitter text. The emotions
include six categories: sadness, joy, love, anger, fear, surprise. Here are
some examples: [example text, label] Text: [text], Label:
SST-2 zero-shot
Step 1 The task is to predict whether the given sentence has a positive
or negative sentiment. Sentence: [sentence], Label:
SST-2 0/1/3/5-shot
Step 1 The task is to predict whether
17 the given sentence has a positive
or negative sentiment. Here are some examples: [example sentence,
label], Sentence: [sentence], Label:

You might also like