Data Generation Using Large Language Models For Text Classification
Data Generation Using Large Language Models For Text Classification
Abstract generating the same amount of data using GPT-3 only costs
14.37 USD and takes 46 minutes. With only 6000 samples
Using Large Language Models (LLMs) to gener-
arXiv:2407.12813v2 [cs.CL] 19 Jul 2024
Proceedings of the 41 st International Conference on Machine We experimented with six common NLP tasks (Table 1)
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by with different data generation methods. We found it is very
the author(s). challenging to pinpoint a definitive answer to the questions
1
Data Generation Using Large Language Models for Text Classification
above that applies universally to all NLP tasks due to their scription and one example, prompting the LLM to
inherent differences. Nevertheless, the findings from 6 tasks generate a similar example.
offer valuable insights into practical data generation tech-
niques. • Few-shot in-context generation: Provide the task de-
scription and a few examples, prompting the LLM to
generate a similar example.
2. Related Work
Data Augmentation The goal of data augmentation is to Inspired by the work from (Yu et al., 2023), we also ex-
increase diversity of existing data by exposing the model to periment with an additional method called zero-shot topic
unseen data. This method has been applied to many domains in-context generation:
in computer vision (Yang et al., 2023) and natural language
processing (Li et al., 2022). In (Feng et al., 2021), augmen- • Zero-shot topic in-context generation: Use the LLM to
tation techniques are categorized into rule based generation generate a list of topics (see Appendix A). Provide the
and model based generation. Rule based generation are used task description and sample one topic from the list to
in computer vision tasks including image transformations, prompt the LLM to generate a similar example.
such as rotation, flipping, and cropping(Mikołajczyk & Gro-
chowski, 2018), while model based generation has been To evaluate the success of synthetic data generation, we
widely used in natural language processing tasks, such as train a NLU model on the synthetic data and assess its
rephrasing and back translation (Kumar et al., 2019; Yang performance on the task’s validation set. We then compare
et al., 2020; Cai et al., 2020; Ye et al., 2022; Okur et al., the performance of the model trained on synthetic data with
2022b). that of the model trained on the original data. Following
the practice established in previous works (Li et al., 2023),
Large Language models (LLMs) With the development we consider the generated data is better if it results in better
of large language models, model based data augmentation model performance.
for NLP becomes trivial (Zhou et al., 2024). By instructing
LLM with proper prompt, it is able to generate a new exam- 4. Experiments
ple in human like text. While it is easy to implement, the
synthetic data generated from LLM is usually noisy and has In our experiment, GPT-3.5 turbo1 is selected for all data
a different distribution compared with raw data, which ham- generation process except for topic generation (see appendix
pers the training performance. Lots of work has explored A). Although more powerful models like GPT-4 is available,
ways to deal with this issue. The work from (Veselovsky we decided to use GPT-3.5 turbo due to the resource con-
et al., 2023) uses techniques like grounding, providing tax- strain, especially we need to run the large number of infer-
onomy and filtering to ensure the quality of synthetic data ences for our data generation experiment. Overall, GPT-3.5
by LLM. Synthesis Step by Step (Wang et al., 2023) uses an turbo is a well-rounded model with competitive performance
iterative step to create prompt based on misclassified golden across multiple benchmarks (Liang et al., 2023). It would
data to reduce the gap between the synthesized data distribu- be interesting to compare the quality of synthetic data gen-
tion and gold distribution. SunGen (Gao et al., 2023) uses erated from different LLMs, which we plan to explore in
weighted loss to reduce the impact of noise from synthetic the future.
data during training. Existing work (Gupta et al., 2023) have utilized common
NLP benchmarks, such as SuperGLUE (Wang et al., 2019),
3. Methods as tasks for evaluation or employ a customized selection of
existing benchmarks(Gao et al., 2023; Ye et al., 2022).
We follow the workflow in Figure 1 for our experiment. We
explore the following in-context data generation methods. We select six common tasks for evaluation: SST-2 (Socher
The term ”in-context generation” refers to using an LLM to et al., 2013; Wang et al., 2019), Twitter Emotion Classifica-
generate data for training given a specific context, similar to tion (EMO) (Saravia et al., 2018), New York Times News
in-context learning (Brown et al., 2020). The methods we Classification (NYT)(Stefano, 2021), Review (Amazon Re-
investigate can be categorized as follows: view Classification) (Keung et al., 2020), RTE (Recognizing
Textual Entailment) (Bentivogli et al., 2009; Wang et al.,
2019) and BoolQ (Clark et al., 2019; Wang et al., 2019).
• Zero-shot in-context generation: Provide the task de- The goal is to select diverse tasks that represent a wide range
scription in the prompt and ask the LLM to generate a of popular NLP corpora (Table 1). Additionally, we try to
similar example.
1
GPT-3.5 version: 2024-02-15 preview accessed from Azure
• One-shot in-context generation: Provide the task de- OpenAI Studio
2
Data Generation Using Large Language Models for Text Classification
zero-shot
LLM generation
Augmented Training
generate zero-shot topic Data
LLM random topics generation
topics synthetic data
LLM
from LLM
+ Classification
Model (RoBERTa)
original
one/few-shot
training data
sample in-context generation
original
training data examples LLM
low-resource task
include challenging tasks for which current NLU models do 5. Key Findings
not perform well when provided with limited training data.
Therefore, we do not use the entire GLUE benchmark, as In this section, we present the key findings from our experi-
models like BERT (Devlin et al., 2019) or RoBERTa(Liu ments.
et al., 2019) can easily achieve high accuracy on such tasks.
We also do not use the complete SuperGLUE task collection 5.1. Mixing Raw Data is Necessary
since some of its tasks require token-level classification. In To assess the effectiveness of data augmentation, we train
this work, we focus on sequence-to-sequence and sequence models with pure synthetic data and augmented data. For the
pair classification tasks. The six selected tasks cover com- augmented setting, 100 raw data points are mixed with 1000
mon web data, such as news and Wikipedia, as well as synthetic data. In the data generation stage, we use only the
popular user data, like Twitter, movie reviews, and prod- same 100 raw data points used for in-context generation to
uct reviews. They cover binary classification, multi-class prevent the model from accessing additional data. As shown
classification, and question-answering tasks. in Figure 2, we observe significant improvements across
For the evaluation metric, the default metric is accuracy, but all tasks for most prompting methods when incorporating
we use F1 or Macro-F1 to calculate the performance since raw data into training. Even as few as 100 data points can
these metrics provide a more balanced and comprehensive boost synthetic data performance compared to using only
assessment of classification performance, taking into ac- synthetic data.
count both precision and recall, especially in cases of multi-
class classification tasks. In our experiment, RoBERTa is 5.2. Impact of Bias
selected as the NLU model for all tasks, as it is a commonly
In the BoolQ task, we found that the zero-shot generation
used model for benchmark on these tasks.
method outperforms other methods, which contrasts with
We experiment five in-context generation methods for each the results obtained for the rest of the tasks. This finding is
task: zero-shot, zero-shot topic, one-shot, few-shot with 3 intriguing since zero-shot data exhibits the highest repetition
examples, few-shot with 5 examples. Prompt used in the rate, which is detrimental to model training. Upon further
generation can be found in Appendix C. examination, we noticed that only in the datasets generated
using one-shot or few-shot methods, terms like ”not,” ”sig-
In our experiment, we generate 1,000 synthetic data points
nificant,” ”only,” ”just,” ”few,” and ”little” frequently appear
per task, as we found the benefit of additional synthetic data
in the generated questions. These terms create a tone that
diminishes after that. To simulate a low-resource setting,
can be used to imply the answer to the question (which is
we allow only 100 raw examples to be used for one-shot
often False). Table 2 provides an example of such trivial
and few-shot generation. For zero-shot topic generation,
question. Table 3 provides statistics for such questions from
we generate 500 random topics related to the task domain.
different prompting method.
Details can be found in Appendix A.
We hypothesize that this pattern introduces bias in model
3
Data Generation Using Large Language Models for Text Classification
4
Data Generation Using Large Language Models for Text Classification
74.8%
74.5%
74.4%
73.9%
73.4%
76.7%
76.7%
76.6%
58.7%
57.5%
73.6%
73.5%
65.3%
56.1%
73.0%
55.5%
54.9%
71.5%
63.1%
70.7%
61.2%
51.1%
57.1%
55.4%
43.1%
40.5%
38.9%
51.0%
48.4%
31.2%
REVIEW - F1 SCORE
RTE - F1 SCORE SST2 - F1 SCORE
synthetic only augmented
53.2%
53.1%
52.8%
52.8%
88.2%
88.1%
87.6%
87.5%
87.5%
68.3%
87.3%
50.1%
65.3%
65.1%
64.7%
64.6%
64.3%
64.2%
64.2%
86.4%
63.8%
49.6%
84.5%
84.1%
47.7%
45.6%
44.7%
29.0%
78.2%
Figure 2. Performance of different prompting methods with and without augmentation. Synthetic only: use 1000 synthetic data only.
Augmented: 1000 synthetic data plus 100 raw data
improvements are less than 5% (Figure 3). There are no dataset because different topics are sampled each time dur-
established rules for determining the amount of raw data ing generation. Zero-shot methods generate the least diverse
as low-resource. For all six tasks in our experiment, 1,000 dataset, as the prompt remains the same for each genera-
data points represent a small portion of training data. We tion. One-shot and few-shot methods also generate repeated
found the model continues to improve as we increase the examples due to the limitation of in-context examples. We
number of raw data for training. However, the amount of found for most tasks, a diversity dataset tends to benefit
performance gain obtained from increasing training data model training.
is also dependent on other factors such as task and model
As shown in (Figure 2) in non-augmented setting, zero-shot
complexity. Based on this observation, we consider 100 raw
generation shows the worst performance for RTE, EMO,
data points as low-resource tasks, which will be used as the
Review and SST-2, while zero-shot topic generation out-
default augmented setting in all experiments.
performs other methods (or at least is comparable to other
methods) for BoolQ, NYT, RTE and EMO task. This effect
5.5. A Comparison Between Different Prompting does not appear on all tasks as there might be other factors
Methods that impact the model performance. Meanwhile, the effect
In the synthetic data only setting, one-shot or zero-shot of diversity diminishes when we mix synthetic data with raw
topic methods rank in the top two for all tasks except the data. Therefore, training with both raw data and synthetic
Review task (Figure 2). In the augmented setting, few-shot data could help when synthetic data is less diverse.
generation and zero-shot topic generation methods demon- While not generating the most optimally diverse dataset, one-
strate good performance across all tasks. In BoolQ, EMO, shot or few-shot generation methods typically helps LLMs
and RTE tasks, zero-shot topic methods outperform other better understand the task description and generate examples
prompting methods. In SST-2 and NYT tasks, few-shot gen- similar to the original examples (Li, 2023; Song et al., 2022).
eration methods perform best. The performance of zero-shot In EMO and Review tasks, we observe the advantage of few-
methods is sub-optimal across all tasks. shot learning over other prompting methods. We suspect
In the five prompting methods we experimented with, zero- this is because both tasks are more subjective compared to
shot topic generation typically produces the most diverse the rest of the tasks, as the EMO contains twitter posts and
Review task are made up of customer reviews and ratings.
5
Data Generation Using Large Language Models for Text Classification
Table 3. BoolQ Trivial Questions and F1 score comparison. SD: use 1000 synthetic data. AD: use 100 raw data plus 1000 synthetic data.
raw data: model uses 1000 raw data only without question rephrase, this score is used as a baseline
Table 4. LLM performance vs model trained by synthetic data on 6 tasks. Average f1 score from 5 prompting method under (1) Synthetic
Data (1000 synthetic data) (2) Augmented data (1000 synthetic data + 100 raw data)
79.5%
78.5%
76.8%
76.7%
76.7%
74.8%
68.3%
65.4%
59.9%
58.7%
57.9%
and the actual raw data using the same method and found
that the synthetic data generated from five different prompt-
ing methods had similar similarity scores with the raw data.
BOOLQ EMO NYT REVIEW RTE SST2
However, it is not clear whether synthetic data that closely
resembles the raw data would lead to better model perfor-
mance. This could be due to the limitations of our similarity
measuring method, which only considers semantic similar-
Figure 3. Improvement on Different Raw Data Amount. raw data ity, as discussed in (Steck et al., 2024). Many NLP tasks
(x) is only using X number of raw data points. augmented (x)
rely on subtle contextual cues and nuanced wordings, such
is using X amount raw data points plus 100 synthetic data. For
as in the SST-2 task, where changes to wording can affect
augmented f1 score, it is the average model performance on the
data generated by 5 different prompting methods the sentiment of the text more than contextual semantics.
Our measurement does not account for other aspects of sim-
ilarity, such as structural or lexical similarity, as discussed
in (Wang et al., 2020; Ayeldeen et al., 2014). Lastly, due to
5.6. Synthetic Data Diversity and Similarity to Raw
the limited number of data points and the potential variation
Data
in synthetic data, it needs to be cautious to generalize our
In this section, we examine the diversity of our training data findings to our tasks or domains.
using inter-sample semantic similarity. To calculate this sim-
ilarity, we use vector embedding proposed in (Reimers & 5.7. Synthetic Data Quantity
Gurevych, 2019) and average the similarity score across all
examples pairs following (Yu et al., 2023). Figure 4 displays We have found that increasing the amount of synthetic data
the inter-sample similarity for each task, comparing data in our model training improves its performance. Figure 5
6
Data Generation Using Large Language Models for Text Classification
7
Data Generation Using Large Language Models for Text Classification
0.78 1 0.84
0.76 0.8 0.82
0.74 0.8
0.6
F1
F1
F1
0.72 0.78
0.4
0.7 0.76
0.68 0.2 0.74
0.66 0 0.72
100 300 500 700 900 1100 100 300 500 700 900 1100 100 300 500 700 900 1100
NUMBER OF DATA NUMBER OF DATA NUMBER OF DATA
F1
F1
0.4
0.3 0.88
0.3 0.87
0.2 0.2 0.86
0.1 0.1 0.85
0 0 0.84
100 300 500 700 900 1100 100 300 500 700 900 1100 100 300 500 700 900 1100
NUMBER OF DATA NUMBER OF DATA NUMBER OF DATA
This approach ensures that the synthetic data is more closely challenging to find rules that generalize well across all tasks.
aligned with the target corpus, leading to better performance However, our findings could still serve as valuable resources
in classification tasks. for researchers and practitioners looking to use synthetic
data for training classification models. For future work, it
6.3. Iterative Data Generation and Prompt Refinement would be valuable to study the effects of more advanced
prompting methods, such as the Chain of Thought (Wei
Generating synthetic data can be both time-consuming and et al., 2023), or LLM hyperparameters, such as temperature,
resource-intensive. To maximize efficiency and ensure high- on the quality of synthetic data.
quality data, it is recommended to adopt an iterative ap-
proach. Initially, generate a small number of examples and
evaluate their quality. If the quality of these initial data Impact Statement
points is low, refine the prompt before generating more data. This paper presents work whose goal is to advance the field
It is unlikely that simply generating more data points with of Machine Learning. There are many potential societal
the same prompt will magically produce high quality data. consequences of our work, none which we feel must be
specifically highlighted here.
7. Conclusion
In this work, we analyzed different factors that influences References
the data generation using LLMs. We found data generation Ayeldeen, H., Hassanien, A. E., and Fahmy, A. A. Lexical
is most effective in low resourced settings. Increasing the similarity using fuzzy euclidean distance. In 2014 In-
amount of synthetic data does not necessarily lead to contin- ternational Conference on Engineering and Technology
uous improvements in model performance. It is beneficial (ICET), pp. 1–6, 2014. doi: 10.1109/ICEngTechnol.2014.
to combine synthetic data with raw data during training. Ad- 7016801.
ditionally, it is crucial to be vigilant for patterns or biases in
synthetic data that may hinder model training. Overall, us- Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D.,
ing LLM for data augmentation has great potential in model and Magnini, B. The fifth PASCAL recognizing textual
training. With a carefully tuned prompt, the data generated entailment challenge. 2009.
by LLM is able to obtain comparable performance with
human annotated data, but at a much lower cost. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
The domain of data generation for classification tasks is Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
highly complex. Due to the diversity of NLP tasks, it is Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
8
Data Generation Using Large Language Models for Text Classification
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Chapter of the Association for Computational Linguis-
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, tics: Human Language Technologies, Volume 1 (Long
S., Radford, A., Sutskever, I., and Amodei, D. Language and Short Papers), pp. 3609–3619, Minneapolis, Min-
models are few-shot learners, 2020. nesota, June 2019. Association for Computational Lin-
guistics. doi: 10.18653/v1/N19-1363. URL https:
Cai, H., Chen, H., Song, Y., Zhang, C., Zhao, X., and Yin, D. //aclanthology.org/N19-1363.
Data manipulation: Towards effective instance learning
for neural dialogue generation via learning to augment Li, B., Hou, Y., and Che, W. Data augmenta-
and reweight. In Jurafsky, D., Chai, J., Schluter, N., and tion approaches in natural language processing: A
Tetreault, J. (eds.), Proceedings of the 58th Annual Meet- survey. AI Open, 3:71–90, 2022. ISSN 2666-
ing of the Association for Computational Linguistics, pp. 6510. doi: https://doi.org/10.1016/j.aiopen.2022.03.
6334–6343, Online, July 2020. Association for Compu- 001. URL https://www.sciencedirect.com/
tational Linguistics. doi: 10.18653/v1/2020.acl-main. science/article/pii/S2666651022000080.
564. URL https://aclanthology.org/2020.
acl-main.564. Li, Y. A practical survey on zero-shot prompt design
for in-context learning. In Proceedings of the Confer-
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, ence Recent Advances in Natural Language Process-
M., and Toutanova, K. BoolQ: Exploring the surprising ing - Large Language Models for Natural Language
difficulty of natural yes/no questions. In Proceedings of Processings, RANLP. INCOMA Ltd., Shoumen, BUL-
NAACL-HLT 2019, 2019. GARIA, 2023. doi: 10.26615/978-954-452-092-2
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: 069. URL http://dx.doi.org/10.26615/
Pre-training of deep bidirectional transformers for lan- 978-954-452-092-2_069.
guage understanding, 2019.
Li, Z., Zhu, H., Lu, Z., and Yin, M. Synthetic data gener-
Ding, B., Qin, C., Liu, L., Chia, Y. K., Joty, S., Li, B., and ation with large language models for text classification:
Bing, L. Is gpt-3 a good data annotator?, 2023. Potential and limitations, 2023.
Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D.,
Mitamura, T., and Hovy, E. A survey of data augmenta- Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar,
tion approaches for nlp, 2021. A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove,
C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson,
Gao, J., Pi, R., Lin, Y., Xu, H., Ye, J., Wu, Z., Zhang, W., D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F.,
Liang, X., Li, Z., and Kong, L. Self-guided noise-free Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng,
data generation for efficient zero-shot learning, 2023. L., Yuksekgonul, M., Suzgun, M., Kim, N., Guha, N.,
Chatterji, N., Khattab, O., Henderson, P., Huang, Q., Chi,
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T.,
R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T.,
Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P.,
Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X.,
de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl,
Mai, Y., Zhang, Y., and Koreeda, Y. Holistic evaluation
H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee,
of language models, 2023.
Y. T., and Li, Y. Textbooks are all you need, 2023.
Gupta, H., Scaria, K., Anantheswaran, U., Verma, S., Par- Liu, R., Wei, J., Liu, F., Si, C., Zhang, Y., Rao, J., Zheng, S.,
mar, M., Sawant, S. A., Baral, C., and Mishra, S. Targen: Peng, D., Yang, D., Zhou, D., and Dai, A. M. Best prac-
Targeted data generation with large language models, tices and lessons learned on synthetic data for language
2023. models, 2024.
Keung, P., Lu, Y., Szarvas, G., and Smith, N. A. The Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
multilingual amazon reviews corpus. In Proceedings of Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
the 2020 Conference on Empirical Methods in Natural Roberta: A robustly optimized bert pretraining approach,
Language Processing, 2020. 2019.
Kumar, A., Bhattamishra, S., Bhandari, M., and Taluk- Mikołajczyk, A. and Grochowski, M. Data augmentation for
dar, P. Submodular optimization-based diverse para- improving deep learning in image classification problem.
phrasing and its effectiveness in data augmentation. In In 2018 International Interdisciplinary PhD Workshop
Burstein, J., Doran, C., and Solorio, T. (eds.), Proceed- (IIPhDW), pp. 117–122, 2018. doi: 10.1109/IIPHDW.
ings of the 2019 Conference of the North American 2018.8388338.
9
Data Generation Using Large Language Models for Text Classification
Okur, E., Sahay, S., and Nachman, L. Data augmentation Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,
with paraphrase generation and entity extraction for mul- Michael, J., Hill, F., Levy, O., and Bowman, S. R. Su-
timodal dialogue system. In Calzolari, N., Béchet, F., perGLUE: A stickier benchmark for general-purpose lan-
Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, guage understanding systems. arXiv preprint 1905.00537,
S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., 2019.
Odijk, J., and Piperidis, S. (eds.), Proceedings of the Thir-
teenth Language Resources and Evaluation Conference, Wang, R., Zhou, W., and Sachan, M. Let’s synthesize step
pp. 4114–4125, Marseille, France, June 2022a. Euro- by step: Iterative dataset synthesis with large language
pean Language Resources Association. URL https: models by extrapolating errors from small models, 2023.
//aclanthology.org/2022.lrec-1.437. Wang, Z., Zhang, Y., and Wu, H. Structural-aware sentence
similarity with recursive optimal transport, 2020.
Okur, E., Sahay, S., and Nachman, L. Data augmentation
with paraphrase generation and entity extraction for mul- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,
timodal dialogue system, 2022b. Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought
prompting elicits reasoning in large language models,
Reimers, N. and Gurevych, I. Sentence-bert: Sentence 2023.
embeddings using siamese bert-networks, 2019.
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V.
Unsupervised data augmentation for consistency training,
Saravia, E., Liu, H.-C. T., Huang, Y.-H., Wu, J., and
2020.
Chen, Y.-S. CARER: Contextualized affect represen-
tations for emotion recognition. In Proceedings of the Yang, S., Xiao, W., Zhang, M., Guo, S., Zhao, J., and Shen,
2018 Conference on Empirical Methods in Natural Lan- F. Image data augmentation for deep learning: A survey,
guage Processing, pp. 3687–3697, Brussels, Belgium, 2023.
October-November 2018. Association for Computational
Linguistics. doi: 10.18653/v1/D18-1404. URL https: Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S.,
//www.aclweb.org/anthology/D18-1404. Le Bras, R., Wang, J.-P., Bhagavatula, C., Choi, Y.,
and Downey, D. Generative data augmentation for
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, commonsense reasoning. In Cohn, T., He, Y., and
C. D., Ng, A., and Potts, C. Recursive deep models for Liu, Y. (eds.), Findings of the Association for Compu-
semantic compositionality over a sentiment treebank. In tational Linguistics: EMNLP 2020, pp. 1008–1025, On-
Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., line, November 2020. Association for Computational
and Bethard, S. (eds.), Proceedings of the 2013 Confer- Linguistics. doi: 10.18653/v1/2020.findings-emnlp.
ence on Empirical Methods in Natural Language Process- 90. URL https://aclanthology.org/2020.
ing, pp. 1631–1642, Seattle, Washington, USA, October findings-emnlp.90.
2013. Association for Computational Linguistics. URL
Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T.,
https://aclanthology.org/D13-1170.
and Kong, L. Zerogen: Efficient zero-shot learning via
dataset generation, 2022.
Song, Y., Wang, T., Mondal, S. K., and Sahoo, J. P. A
comprehensive survey of few-shot learning: Evolution, Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A., Krishna,
applications, challenges, and opportunities, 2022. R., Shen, J., and Zhang, C. Large language model as
attributed training data generator: A tale of diversity and
Steck, H., Ekanadham, C., and Kallus, N. Is cosine- bias, 2023.
similarity of embeddings really about similarity? arXiv
preprint arXiv:2403.05440v1, 2024. arXiv.org perpetual Zhou, Y., Guo, C., Wang, X., Chang, Y., and Wu, Y. A
non-exclusive license. survey on data augmentation in large model era, 2024.
10
Data Generation Using Large Language Models for Text Classification
A. Appendix
Prompt for topic generation for zero-shot with topics and LLM output examples. GPT-4 is used to generate 500 random
topics per task:
B. Appendix
Prompt for Question Rephrasing in Section 5.2
Please rephrase the question as if you are typing it in a search engine. Make sure the answer can only be true or false, Input: question
Output:
11
Data Generation Using Large Language Models for Text Classification
C. Appendix
Prompt used for data generation for each task:
12
Data Generation Using Large Language Models for Text Classification
13
Data Generation Using Large Language Models for Text Classification
14
Data Generation Using Large Language Models for Text Classification
15
Data Generation Using Large Language Models for Text Classification
D. Appendix
Prompt used to evaluate LLM performance on each task.
16
Data Generation Using Large Language Models for Text Classification