A Survey On Data Synthesis and Augmentation For Large Language Models
A Survey On Data Synthesis and Augmentation For Large Language Models
Language Models
Ke Wang Jiahui Zhu Minjie Ren
[email protected] [email protected] [email protected]
Hangzhou Innovation Institute, Hangzhou Innovation Institute, Hangzhou Innovation Institute,
Beihang University Beihang University Beihang University
State Key Laboratory of Virtual Hangzhou Innovation Institute, State Key Laboratory of Virtual
Reality Technology and Systems, Beihang University Reality Technology and Systems,
Beihang University Beihang University
adversarial examples [162], limiting their effectiveness for training This survey aims to address these gaps by providing a compre-
LLMs. hensive overview of LLM-oriented data synthesis and augmenta-
To overcome these challenges, researchers have increasingly tion techniques. As shown in Figure 2, unlike previous surveys
turned to LLM-oriented data synthesis and augmentation tech- [43, 140, 147, 214, 271], which primarily focus on applying these
niques, recognizing the ability of LLMs to model complex patterns methods to support specific downstream tasks or particular stages
from large datasets and generate synthetic data that closely mir- of LLMs, our work emphasizes the direct role of LLM-oriented tech-
ror real-world distributions while introducing valuable variations niques in improving the overall performance of LLMs across various
[37, 175, 260]. These studies reduce the reliance on manually cu- stages of their lifecycle and core functions. In contrast to the work
rated datasets and enable the generation of high-quality, diverse [137], which focuses on practices for synthetic data generation to
data that meets the evolving demands of LLMs throughout their address challenges like data scarcity and privacy, our survey ex-
lifecycle and functions. To capture the breadth of these efforts, we tends beyond practical guidance by categorizing methods aimed at
collected papers related to LLM-oriented data synthesis and aug- improving LLM performance holistically. We examine not only data
mentation by searching Google Scholar using keywords such as generation but also how these techniques enhance LLMs across all
"data synthesis," "data augmentation," and "large models." Figure stages and functions, offering a more integrated, data-centric frame-
1 illustrates the publication trends by year and venue, reflecting work for advancing LLMs. Specifically, we systematically review
the increasing interest in this field. As of October 2024, we identi- and categorize existing research from two key perspectives: the
fied 250 unique publications covering diverse research topics and lifecycle of LLMs (from pre-training to fine-tuning and application)
venues. Summarizing these efforts provides critical insights into and their core functions (understanding, logic, memory, and gener-
the progress and challenges that remain, offering a foundation ation). By framing the discussion around these dual perspectives,
for future research. Despite these advancements, several key chal- we offer clearer insights into the development, interconnections,
lenges remain in LLM-oriented data synthesis and augmentation. and practical applications of different approaches. Moreover, we
The misuse of synthetic data poses risks, particularly in spreading identify critical challenges, explore emerging research directions,
misinformation and raising ethical concerns around manipulat- and highlight potential breakthroughs that could further drive ad-
ing public opinion. Additionally, synthetic data often introduces vancements in LLM performance through data-centric methods.
ambiguity when aligning AI models with human values, poten- The contributions of this survey are summarized as follows:
tially leading to biased outcomes. Evaluating models trained on
• First Survey: To our knowledge, we present the first com-
synthetic data is also complex, as traditional benchmarks may not
prehensive survey focused on advancing LLMs through data
fully capture the nuances of this data. Ensuring reliability is an-
synthesis and augmentation, systematically covering the en-
other concern, as biases and inaccuracies from original datasets can
tire lifecycle stages and core functions of LLMs. This survey
persist in synthetic data, limiting its generalization across domains.
provides an in-depth analysis of current methodologies and
Moreover, the computational demands of LLMs, along with chal-
highlights the unique challenges at each stage.
lenges in handling less common languages or novel instructions,
• New taxonomy: We introduce an innovative organizational
complicate broader applications. Finally, the lack of a unified frame-
framework that categorizes existing research from two key
work for organizing and comparing the methods proposed in both
perspectives: the lifecycle stages of LLMs and their core
academia and industry remains a barrier for researchers navigating
functions. This taxonomy offers a clearer understanding
this rapidly evolving field.
of the progression, interconnections, and applicability of
2
Figure 2: A comparison between existing surveys on data synthesis and augmentation techniques and our work. Previous
surveys primarily focus on LLM-based data synthesis and augmentation methods aimed at supporting downstream tasks. In
contrast, our work emphasizes LLM-oriented data synthesis and augmentation, systematically covering the full lifecycle of
LLMs—from data preparation to applications—and addressing core LLM functions such as understanding and generation, with
the ultimate goal of improving LLMs themselves through data-centric techniques.
different approaches, providing valuable insights into both and leaderboards used to assess and compare the effectiveness of
developmental and functional aspects of LLM-oriented data existing approaches. Finally, Section 6 provides insights into chal-
synthesis and augmentation. lenges and emerging trends in LLM-oriented data synthesis and
• New frontiers: We identify critical challenges, and explore augmentation, offering recommendations for future research direc-
emerging research directions, and potential breakthroughs in tions that can contribute to the continued advancement of LLMs
LLM-oriented data synthesis and augmentation. This discus- through data synthesis and augmentation methodologies.
sion aims to inspire future research and guide developments
in data-centric techniques for LLM advancement. 2 Taxonomy
• Abundant resources: We organize and maintain a dedi- Data generation methods play a pivotal role in addressing data
cated repository to support ongoing research and collabo- scarcity and imbalance, thereby improving model performance and
ration in LLM-oriented data synthesis and augmentation. generalization. As shown in Fig. 4, we summarize the development
This resource includes a curated collection of related papers, and evolution of data augmentation and synthesis techniques in
multiple leaderboards tracking the latest advancements, and recent years. This section primarily introduces the current classi-
regular updates to foster innovation, guide future research fication of data generation methods, distinguishing between data
directions, and accelerate breakthroughs in the field. augmentation, which enhances existing data samples through
By offering a comprehensive overview of LLM-oriented data transformations, and data synthesis, which creates entirely new
synthesis and augmentation approaches, this survey aims to clarify samples from scratch or based on generative models. Both methods
the current state of the field and inspire future research directions differ in their approach to acquiring data but aim to expand datasets.
that can further enhance LLM capabilities through data synthesis Furthermore, data augmentation and synthesis methods can be cat-
and augmentation methodologies. egorized into subclasses from multiple dimensions. Each approach
We organize the remainder of this survey as follows: Section has unique strengths and applications, enabling researchers to tailor
2 categorizes the primary areas of LLM-oriented data synthesis their data generation strategies to specific needs and goals.
and augmentation, providing an overview of the foundational tech-
niques. Section 3 discusses the current LLM-oriented data synthesis 2.1 Data Augmentation
and augmentation methods from the perspective of the full life- Data augmentation, a type of generation approach from data to
cycle of LLMs, detailing how these techniques are employed at data, generally involves manipulating the original data to increase
different stages of model development. In Section 4, we review its diversity and quantity without significantly altering its essen-
these methods from the viewpoint of core LLM functions, exploring tial characteristics. Techniques used in data augmentation are de-
how data synthesis and augmentation enhance key capabilities signed to enhance the richness of existing data samples through
such as understanding, logic, memory, and generation. Section 5 transformations or perturbations. Across different modalities, data
delves into the evaluation strategies for LLM-oriented data synthe- augmentation techniques often exhibit similarities. For instance,
sis and augmentation, addressing benchmarks, evaluation metrics, in image data, augmentation operations encompass mosaic[90],
3
Data Labeling T-SciQ [205], ChatGPT-based [3, 63, 275]
Data Augmentation
Data Reformation Mosaic[90], CORE [45], ALIA [51], ChatAug [37]
(§2.1)
Co-Annotation Co-annotating [116], ToolCoder [259]
Taxonomy (§2) General Model
TinyStories[53], Phi-1[67, 120], Alpagasus [22], WizardLM [223]
Distillation
Data Augmentation WRAP [150], KMLM [133], bioR [276], Physics-based [134]
Domain Model Distillation BAD [225], BEAVERTAILS [85], PRM800K [124], WebGPT [156]
Preference
Alignment(§3.5)
Model Self-Improvement OAIF [69], SELF-JUDGE [235], SALMON [193], SteerLM [47]
Understanding (§4.1) Alpaca [196], WizardLM [223], WRAP[150], LLaVA [130], ChartLlama[73],Genixer[266]
Synthesizing and
Augmenting Method d-RLAIF[106], LLM2LLM[107], Wizardmath[145], STaR[248], SciGLM[253], ChemLLM[254]
(§5.1)
Impact of Data
Challenges and Synthesis and DataDreamer[159],HARMONIC[209]
Limitations (§5) Augmentation (§5.3)
Impact on Different
Applications and PANDA[127],REGA[206]
Tasks (§??)
Future Directions
TabSynDex[32],CoLa-Diff[87],WizardCoder[146], WebGPT[156]
(§5.5)
flipping[184], copy-pasting[61], adding noise[149], pairing[84] and cater to the demands of multimodal learning, existing research
so forth. Similarly, in text data, augmentation operations involve has addressed cross-modal information alignment during data aug-
synonym replacement[95], copy-pasting[185], etc. Moreover, to mentation. MixGen[75] generates new training samples by linearly
4
interpolating images and concatenating text sequences from two ex- 2.2.1 General Model Distillation. Among these, general model
isting image-text pairs. The semantic relationship within the newly distillation involves leveraging powerful general models, typically
generated image-text pair remains consistent and matched. Re- featuring larger parameters and superior performance, such as Sta-
cently, in the rapidly advancing landscape of LLMs, data augmenta- bleVicuna, ChatGPT, and GPT-4, to generate datasets that can en-
tion has emerged as a cornerstone for bolstering model performance hance the capabilities of weaker models. There are various ways to
through the diversification of training exemplars, circumventing employ these powerful models, such as using predefined templates
the necessity for extensive additional data gathering. From a data- to generate tiny stories[53] and leveraging the LLMs themselves
centric perspective, we systematically categorize existing research to evaluate the quality of the generated data. Phi-1 and its series
on data augmentation into three distinct categories: data label- [67, 120]have demonstrated that a small amount of high-quality
ing[3, 63, 94, 136, 198, 275], data reformation[45, 51, 143, 237], data can also train a powerful model, by leveraging the compre-
and co-annotation[11, 43, 116]. hensive generation of textbooks and exercises from GPT-3.5. Some
other methods have also achieved performance improvements by
2.1.1 Data Labeling. Data labeling endeavors to leverage the generating instruction datasets and fine-tuning models after im-
comprehensive language understanding capabilities of LLMs to proving the quality of these datasets[22, 80, 196].
annotate vast unlabeled datasets. This methodology is particularly
beneficial in fields that possess a substantial unlabeled data corpus, 2.2.2 Domain Model Distillation. Domain model distillation
encompassing domains such as cross-lingual processing and multi- pertains to the utilization of models that are tailored to generate
modal learning[3, 63, 275], where the automation of annotation can data within a particular domain. This approach is often necessary
significantly expedite the data preparation process. Recent research when general models fail to meet the specific needs of industry
studies the zero-shot annotation ability of LLMs, such as GPT-4 on applications. For instance, in the context of code programming, do-
labeling political Twitter[198]. Moreover, Khan et al. [94] focus on main model distillation can be employed to generate instructional
Visual question answering (VQA) tasks by generating pseudo-label data tailored to specific coding tasks[146, 215]. In the realm of math-
data from unlabeled images by utilizing the SelTDA framework. ematics, methods such as Minerva[108] and DeepSeekMath[220]
are designed to generate solutions to mathematical problems while
2.1.2 Data Reformation. Data reformation involves transform- ensuring their accuracy and diversity. Additionally, the realm of
ing and restructuring existing data into a broader spectrum of varia- industry data often presents barriers, such as limited data scales
tions, thereby facilitating more fine-grained data augmentation[45, and the inaccessibility of data within specific enterprises within the
51]. This approach aims to enrich the training landscape with di- domain. These factors necessitate the adoption of domain-specific
verse yet pertinent examples, enhancing the model’s robustness and models that can effectively address the unique challenges posed by
generalization capabilities. Classic methods such as rotation[92], these scenarios.
color channel transformation[64], and synonym replacement[95]
are commonly used. Recently, approaches utilizing LLMs have also 2.2.3 Model Self-Improvement. Model self-improvement refers
emerged. For example, Chen et al.[27] propose Disco, an approach to the process where a model generates higher-quality data to en-
that harnesses LLMs to produce large-scale, high-quality counter- hance its capabilities. For instance, leveraging existing instructions
factual data. to adjust the model and prompting it to paraphrase documents on
the web in specific styles, such as Wikipedia-style or QA-style, can
2.1.3 Co-Annotation. Co-annotation designates the collabora- be used to jointly pre-train LLMs for both authentic and synthetic
tive effort between human annotators and LLMs in the annota- paraphrasing tasks[150]. Self-Instruct [210]enhances LMs them-
tion process[11]. By integrating the strengths of both annotation selves by autogenerating and refining instructional data, boosting
methodologies, co-annotation not only mitigates annotation costs performance with minimal human intervention.
but also concurrently enhances annotation performance, fostering
a more efficient and effective approach to data annotation. Li et
al.[116] introduce CoAnnotating, a framework that strategically
3 Data Synthesis and Augmentation in the Full
assigns data points for annotation either to humans or to LLMs, Lifecycle of LLM
based on an assessment of the LLM’s annotation uncertainty. From the perspective of the full lifecycle of LLM, We divide the
existing investigations into six stages, including data preparation,
2.2 Data Synthesis pre-training, fine-tuning, instruction-tuning, preference alignment,
and applications. The present section introduces relevant research
Data synthesis, on the other hand, aims to create entirely new data
in each stage.
from scratch or based on generative models, which are similar to
the distribution of real data. In recent years, with the explosion
and advancements in generative AI[13, 41, 42, 78, 139, 161, 169], 3.1 Data Preparation
there have been significant strides in the quality and generation In the data preparation phase, data synthesis and augmentation
efficiency of synthetic data. Based on the requirements of LMs, this aim to generate diverse and high-quality datasets for the training
paper categorizes data synthesis methods into three main types: of LLMs, addressing the challenge of the scarcity of real-world
general model distillation[22, 53, 120, 263, 266], domain model data. According to the taxonomy discussed in Section 2, We divide
distillation[108, 145, 146, 215], and model self-improvement[54, the present subsection into general model distillation and data
150, 210, 248]. augmentation.
5
Figure 4: Illustration of the evolutionary steps in the development of data synthesis and augmentation techniques for large
models.
3.1.1 General Model Distillation. This way aims to leverage encompassing various complexities. In each evolution, in-depth
the powerful capabilities of general LLMs to distill high-quality and in-breadth evolving are employed to either enhance the basic
data. According to the approach and data modality, we further di- instructions to more sophisticated ones or innovate entirely new
vided general model distillation into five categories: synthesize from directives.
seeds, synthesize reasoning steps, synthesize with controllability, Synthesize Reasoning Steps. To enhance the reasoning ca-
synthesize from scratch, and synthesize multimodal data. pability of LLMs, additional reasoning steps are generated in the
Synthesize from Seeds. To synthesize datasets for specific tasks, process of data synthesis. The synthetic question-response pairs in
prompting LLMs with a small number of relevant examples can MMIQC[131] are iteratively constructed by augmenting the initial
effectively produce high-quality datasets at a low cost. For instance, problems and adding additional reasoning steps without altering
to investigate “how small can an LLM be to achieve certain capabil- their intrinsic logical structure. Similarly, an effective generation
ities”, TinyStories[53] is constructed by instructing an LLM to gen- strategy is put forward in which an LLM is requested to synthe-
erate stories that combine three words randomly chosen from 1500 size chain-of-thought (CoT) answers after question generation and
basic words, and it can be used to train and evaluate language mod- verification[109]. Based on the generation of question-CoT pairs
els. Based on the collected large-scale functions, Case2Code[180] through Self-Instruct, MathInstruct[244] further supplements the
incorporates LLMs to generate suitable inputs for these functions Program-of-Thought (PoT) rationale to simplify the math-solving
and utilizes the code interpreter to calculate their corresponding process.
outputs. Due to the potential insufficiency in quantity and diversity Synthesize with Controllability. To control the quality of
of single-round synthetic data, methods for iterative data synthesis synthetic data, researches are conducted into techniques for data
are investigated. For example, Self-Instruct[210] can be repeated synthesis with controllability. Driven by the goal of reducing the
for many iterations to accumulate a substantial volume of tasks. potential bias of synthetic data, OSS-Instruct[215] utilizes open-
In each iteration, an LLM is prompted to generate new instruc- source seed code snippets to prompt an LLM in generating coding
tions from a small seed set, then creates input-output instances for problems and corresponding solutions. The seed snippets provide
each instruction independently. Similarly, Evol-Instruct[223] can controllability of the generation and encourage the LLM to synthe-
be conducted over multiple rounds to gather a sufficient dataset size a variety of coding problems. Similarly, Genie[236] prompts an
6
Table 1: Data synthesis and augmentation in data preparation. In the table, method outlines the techniques presented by each
research. Data source and synthetic data indicate the original data used to generate synthetic data and the synthetic data created
for training purposes, respectively. A dash (-) in any cell denotes that the respective content was not mentioned in the cited
literature.
LLM with four content-example pairs to generate a synthetic ex- of captions for pre-training. Taking into account the possible issue
ample to match the extracted content with no example. In addition, of overly simplistic cross-modal instructions generated by LLMs,
seeded with a few annotated dialogues, DIALOGIC[122] guides ComVint[49], when presented with an image that has available an-
GPT-3 to synthesize annotated dialogues in a controllable way, in notations, adopts a pipeline that includes synthesis, complication,
which an auxiliary generator and slot-value match filter are utilized and reformulation. Standing distinct from the previous data synthe-
to mitigate de-generation and over-generation issues, respectively. sis approach, StableLlava[121] synchronously synthesizes images
Synthesize from Scratch. Another approach avoids reliance and dialogues. The method first employs ChatGPT to craft prompts
on the seed dataset, synthesizing data from scratch. For instance, for image generation and develops content-rich dialogues, subse-
UltraChat[44] is composed of questions about the world, creation quently leverages StabeDiffusion to synthesize images from these
and generation, and assistance on existing materials, among which prompts. Multimodal data can also be synthesized from scratch.
questions about the world request ChatGPT to generate meta top- To create an image and a corresponding instruction from scratch,
ics, subtopics, and questions about conceptions from scratch. As Multi-modal Self-Instruct[263] first instructs the LLM to conceive
aligned LLMs can generate user queries due to the auto-regression an original visual concept. Subsequently, it produces detailed code
nature, Magpie[227] directly constructs instruction data by prompt- to visualize the idea. Once the desired image is synthesized, the
ing aligned LLMs with a pre-query template to generate instruc- LLMs self-instruct multiple sets of high-quality question-answer
tions along with their corresponding responses. Focusing on gen- pairs tailored to the visual content. Likewise, in the field of interpret-
erating large instruction datasets from a single prompt, generator ing chart figures, ChartLlama[73] initially leverages the capabilities
prompts[19] boost the output diversity by asking an LLM to pro- of GPT-4 to generate chart data by providing specific attributes
duce a long list of possible choices and select one of the candidates like topics, distributions, and trends. Following this, GPT is fur-
at random. ther employed to generate both chart figure and the associated
Synthesize Multimodal Data. Similar to unimodal, Prompting instruction-answer data. AnyInstruct-108k[251] is constructed by a
powerful LLMs like GPT to synthesize data based on seed sets is two-stage approach including the generation of text-based conver-
also the most common method for multimodal data synthesis. For sations incorporating multimodal elements from meta topics and
instance, ShareGPT4V[21] consists of 100K high-quality captions text-to-multimodality conversion. Additionally, there are methods
that are produced by employing data-specific prompts to guide that empowers MLLM to synthesize data rather than prompting
GPT4-vision in generating comprehensive descriptions for super- GPT. Genixer[266], a holistic data generation pipeline, consists of
vised fine-tuning. Additionally, an alternative caption model is four key steps including instruction data collection, instruction
fine-tuned on the 100k high-quality captions, to expand the number
7
template design, empowering MLLMs, and data generation and prompt to direct ChatGPT in annotating tool-augmented dataset.
filtering. To address the problem of low agreement between annotators,
ChatGPT is used to augment the annotation process with phrases
3.1.2 Data Augmentation. Data augmentation aims to further identified as relevant features and brief explanations[11]. Investi-
process existing data to obtain a more diverse set of high-quality gations on allocating data within the same dataset to humans and
data. In the present survey, we divide the existing methods of data ChatGPT can achieve a more efficient and more accurate annota-
augmentation into four groups: data labeling, data reformation, tion. Specifically, the CoAnnotating[116] framework automatically
co-annotation, and non-LLM-driven. decides whether each data instance should be annotated by humans
Data Labeling. Data labeling aims to harness the language com- or by the LLMs by computing the uncertainty level of the LLMs’
prehension abilities of LLMs for annotating unlabeled datasets[3, annotations. There are also iterative co-annotation methods for
63]. Based on a balanced sample with 500 tweets from Republican data augmentation. For example, initiated with a task prompt con-
and Democratic politicians, ChatGPT-4 is utilized to annotate the sisting of a description for desired dialogue and dialogue history,
political affiliation[198]. The results indicate that LLM annotations Dialgen[142] first proposes a candidate subdialogue, subsequently
display higher accuracy and lower bias than human classifiers. To validates, edits and annotates teh generated subdialogue by human
apply to distinctive annotation tasks, one approach[275] begins reviewers before requesting continuation via an updated prompt
by creating a generalized prompt template. This template is then to the LLM. The process can be repeated multiple times until the
utilized to generate a ChatGPT prompt aims at extracting labels dialogue is complete.
in alignment with the dataset’s original annotation methodology. Non LLM-Driven. Some methods do not use LLMs to synthesize
Similarly, another method involves initially creating dialogues that or filter high-quality data. For exmaple, AMPS[77], an extensive and
closely match the content of a reference dialogue, subsequently varied corpus for mathematics pretraining, comprises over 5 million
prompting the LLM to label the generated conversation using the problems generated with Mathematica scripts, based on 100 metic-
same annotation schema provided in the existing repository[99]. ulously crafted modules. In the field of physics, Mind’s Eye[136]
Further, scholars have studied methods to improve the quality utilizes a computational physics engine to produce ground-truth
of LLMs’ labeled data using additional information. For instance, answers for the UTOPIA multi-task physics alignment dataset, de-
FullAnno[74] instructs LLMs to acquire comprehensive annotations signed to assess the ability of LLMs to understand fundamental
for images with prompts including the category and position of physical laws. Besides, filtering and pruning strategy are utilized
objects, region description, and text information within the image. for data augmentation. For instance, Proof-Pile-2[6] is an extensive
In the field of speech emotion recognition, one approach[103] first dataset with 55B tokens of mathematics and mathematical code,
annotates samples based solely on text. Afterwards, it enhances the enriched by filtering high-quality data from publicly available re-
annotation by integrating audio features and gender information sources. Recognizing substantial redundancies in synthetic training
alongside textual data. In addition, a VQ-VAE is employed to pro- data, an efficient and scalable pruning strategy[200] is proposed
duce a 64-dimensional discrete representation of the audio, which which encompasses encoding, dimensionality reduction, clustering,
is supplied to the LLM. and pruning.
Data Reformation. Data reformation attempts to transform
existing data into a wider array of variations, and it typically in- 3.2 Pre-Training
volves the utilization of prompt engineering to guide LLMs in
During the stage of pre-training, data synthesis and augmentation
generating reformatted data. For example, TinyGSM[128] is con-
can provide LLMs with abundant, diverse, and controllable training
structed by prompting an LLM to generate problem variants from
data cost-effectively and efficiently, thereby enhancing model per-
GSM8K, subsequently filtering out low-quality instances. Similarly,
formance and reducing bias. We also discuss the existing methods
GPT3Mix[237] extracts augmentation from the generation of the
from three perspective: model self-improvement, general model
LLM by constructing a task-specific prompt from the selected ex-
distillation, and data augmentation.
amples and meta-information about the dataset. Data reformation
is also widely applied in generating counterfactual augmented data. 3.2.1 Model Self-Improvement. In the pre-training phase, model
Specifically, CORE[45] initially learns to retrieve relevant text ex- self-improvement denotes synthesize data by an LLM, and further
cerpts. Following this the retrieved excerpts along with instructions utilizes the synthetic data to pre-train the same LLM. For instance,
and demonstrations are supplied as a prompt to GPT-3 to counter- VILA-2[54] utilize a self-augmenting process where the present
factually edit input text. In addition, DISCO[27] firstly decomposes round of VILA is used to generate long and detailed captions with
given task instances into spans using linguistic processing tools, and appropriate prompt choice and conversation template for the next
subsequently employs prompt engineering and in-context learning round pretraining.
with an LLM to overgenerate a diverse set of perturbations for these
instances. For multi-modality, ALIA[51] first generates captions 3.2.2 General Model Distillation. General model distillation
for each image, and summarizes the captions into a short list of denotes the utilization of a general LLM with a strong capability to
domain descriptions with an LLM. Then uses these descriptions to distill high-quality data. To demonstrate the power of high-quality
generate edits of the training data with Stable Diffusion. data in breaking existing scaling laws, phi-1[67] is pre-trained on
Co-Annotation. Co-annotation refers to the process where hu- code dataset of “textbook quality”, both synthetically generated
mans and LLMs annotate unlabeled data together. For instance, with GPT-3.5 and filtered from web sources. Following the method
Toolcoder[259] utilizes human-written input-output within the of phi-1 and extending the task to common sense reasoning in
8
Table 2: Data synthesis and augmentation in pre-training. Method outlines the techniques presented by each research. Data
source and synthetic data indicate the original data used to generate synthetic data and the synthetic data created for pre-
training, respectively. Base model and pre-trained model indicate the foundational models and the models that have undergone
pre-training, respectively. A dash (-) in any cell denotes that the respective content was not mentioned in the cited literature.
Category Method Data Source Synthetic Data Base Model Pre-trained Model Date
Model self-improvement Multi-modality - MMC4, Coyo, ShareGPT4v Images with re-captioned texts VILA VILA-2[54] 07/2024
- code snippets 1B tokens Python textbooks 1.3B parameter model phi-1-Base[67] 06/2023
- 20k topics 20B tokens textbooks data 1.3B parameter model phi-1.5[120] 09/2023
Uni-modality
General - conversation seed TinyDialogues[56] GPT-2 - 08/2024
model - Scientific corpora SciLitIns Qwen2-Base SciLitLLM[118] 08/2024
distillation TRAIT[123] GSM8k, OpenWebMath Task-oriented synthetic passages Mistral - 06/2024
Multi-modality - 100 meta topics AnyInstruct-108k Llama 2 AnyGPT[251] 02/2024
- sub-molecule groups Physics-based synthetic data[134] MoLFormer - 07/2024
WRAP[150] C4 Re-phrased texts decoder-only transformers - 01/2024
Augmentation Uni-modality
- CC100 knowledge-intensive multilingual data mBERT, XLM-R KMLM[133] 11/2021
- bioS bioR[276] Llama - 09/2023
natural language, phi-1.5[120] is pre-trained on a combination of been proven that generated data can effectively contribute to the
phi-1’s training data and newly created synthetic data. Inspired fine-tuning of LLMs[210, 223]. We discuss the existing methods
by TinyStories, TinyDialogues[56] is created by prompting GPT-4 from three perspectives: model self-improvement, general model
to generate realistic dialogues featuring children of various ages distillation, and data augmentation.
as the main participants. In the continual pre-training stage of
SciLitLLM[118], Llama3-7B-Instruct is utilized to correct the errors 3.3.1 Model Self-Improvement. The model self-improvement
introduced during the PDF parsing process followed by supervised approach enables the LLM to learn from its outputs through a feed-
transfer learning on a classifier to filter out low-quality texts from back process, thus eliminating the need for external support. Based
the dataset. For multi-modality, AnyGPT[251], pre-trained on a mul- on whether the method uses iterative self-improvement and the
timodal text-centric dataset, is an any-to-any multimodal language modality of synthetic data, we group the existing self-improvement
model capable of comprehending and producing diverse modal- strategy into three categories: single-shot self-improvement, itera-
ities. Utilizing the open-source text-to-image generation model tive self-improvement, and multi-modal self-improvement.
GLIDE, it has been demonstrated that synthetic data significantly Single-Shot Self-Improvement. Single-shot self-improvement
aids classifier learning and holds substantial promise for model denotes the process of synthesizing data through an LLM and then
pre-training[76]. performing a single fine-tuning for the same LLM with the synthe-
sized data. One category of methods involves supplementing in-
3.2.3 Data Augmentation. Data augmentation aims to further
formation to the training dataset. For example, one approach[81] in-
process existing data to obtain a more diverse dataset. In the pre-
volves using an LLM to create ‘high-confidence’ rationale-augmented
training stage, there are mainly two kinds of methods: data refor-
responses for unlabeled questions through Chain-of-Thought prompt-
mation and non-LLMs-driven method.
ing and self-consistency checks. Based on the observation that the
Data Reformation. Data reformation transforms the original
model performance has a log-linear relation versus the supervised
dataset to obtain a new dataset with diversity and quality. For
data, reject sampling fine-tuning (RFT)[242] utilizes the LLM’s ca-
instance, WRAP[150] employs an off-the-shelf instruction-tuned
pabilities to generate and compile accurate reasoning trajectories
model to paraphrase web documents in various styles, thereby pre-
as an augmented dataset. Similarly, Self-Translate-Train[170] cap-
training LLMs on a combination of real and synthetic rephrases.
italizes on the LLM’s translation prowess to produce synthetic
Similarly, the BIO dataset bioR[276] is constructed by rewriting the
training datasets in the desired target language, subsequently em-
synthetic dataset of 100k biographies using Llama to make them
ploying this self-generated data for fine-tuning. On the other hand,
close to real-life biography style.
another category of methods synthesizes new samples based on
Non LLMs-Driven. Other methods augment the original dataset
existing seed data. For instance, CodeRL[104] introduces an actor-
without the utilization of LLMs. For example, Code Llama undergoes
critic framework that employs an LLM as the actor network to
further pretraining on Proof-Pile-2, a 55B-token augmented dataset
produce synthetic samples, while a separate neural network serves
consisting of scientific papers and web data, yielding LLEMMA[6].
as a critic model to assess the quality of these samples. Further,
KMLM[133] is pre-trained on massive multilingual knowledge
Self-Instruct[210] prompts an LLM to generate new instructions
graph triples which are constructed by converting the structured
and corresponding instances which can be used for the instruction
knowledge from knowledge graphs to sequential data. To tackle the
tuning of the LLM itself. Based on the self-instruct, Code Llama-
pathology of data scarcity, a physics-based modeling framework[134]
Instruct[174] are enhanced through fine-tuning on a combination
is proposed that generates a multitude of synthetic data to align
of proprietary instructional data and synthetic self-instruct dataset
the LLM to a physically consistent initial state.
which is crafted by prompting Llama 2 for coding issues and Code
Llama for corresponding unit tests and solutions.
3.3 Fine-Tuning Iterative Self-Improvement. Furthermore, to improve the qual-
In the fine-tuning phase, data synthesis and augmentation refer to ity, diversity and amount of synthetic data, various approaches
the employment of the generated data to fine-tune LLMs. It has iteratively synthesize datasets and continuously fine-tune LLM in
9
Table 3: Data synthesis and augmentation in fine-tuning. In the table, method outlines the techniques presented by each
research. Data source and synthetic data indicate the original data used to generate synthetic data and the synthetic data created
for fine-tuning, respectively. Base model and fine-tuned model indicate the foundational models and the models that have
undergone fine-tuning, respectively. A dash (-) in any cell denotes that the respective content was not mentioned in the cited
literature.
Category Method Data Source Synthetic Data Base Model Fine-tuned Model Date
STaR[248] Arithmetic, etc. A rationale dataset GPT-J - 03/2022
- GSM8k, etc. reasoning dataset PaLM-540B LMSI[81] 10/2022
ReST[66] IWSLT 2014, etc. - Transformer - 08/2023
ReST-EM[187] MATH, APPS synthetic math and code PaLM 2 - 12/2023
Model - RLHF V5 Self-instruction dataset Llama 2 Code Llama-Instruct[174] 08/2023
self- Uni-modality - miniF2F, FIMO theorem-proof pairs DeepSeek math DeepSeek Prover[220] 05/2024
improvement Self-Translate[170] SQuAD, multiNLI Translated synthetic data Llama 2 - 06/2024
RFT[242] GSM8k Reject sampling samples Llama - 08/2023
CodeRL[104] public code synthetic samples CodeT5 - 07/2022
SPIN[26] Ultrachat200k 100k synthetic dataset zephyr - 01/2024
Self-Instruct[210] 175 human-written samples 52k instructions GPT-3 GPT-3-self-inst 12/2022
Impossible Distillation[91] contextual constraints DIMPLE T5-large impossible-T5 05/2023
- 175 instruction-response pairs instruction-following samples Llama Alpaca[196] 2023
LAB[191] a taxonomy synthetic instruction dataset Llama-2, Mistral Labradorite, Merlinite 03/2024
GLAN[111] a taxonomy synthetic instruction dataset Mistral - 02/2024
- Code Alpaca instruction-following data StarCoder WizardCoder[146] 06/2023
CLINGEN[226] clinically relevant knowledge synthetic clinical data PubMedBERT - 11/2023
Self-chat MedQuAD 111.5k dialogues Llama Baize[222] 04/2023
Uni-modality LLM2LLM[107] GSM8k, etc. task-specific datasets Llama 2 - 03/2024
General - CMeKG Question-answer instances Llama HuaTuo[204] 04/2023
model - FLAN-v2 Query-response pairs Llama Orca[154] 06/2023
distillation - FLAN-v2, Orca 2 dataset 1 million data Llama 2 Orca 2[152] 11/2023
Evol-Instruct[223] Alpaca 250k instructions Llama WizardLM 04/2023
OSS-Instruct[215] starcoderdata coding instruction data CodeLlama, etc. Magicoder 12/2023
- MATH, etc. MathInstruct Llama MAmmoTH[244] 09/2023
NExT-GPT[219] multi-modal data MosIT dataset - - 09/2023
- COCO instruction-following data VIcuna LLava[138] 04/2023
Multi-modality - PMC-15M instruction-following data LLaVA LLaVA-Med[110] 06/2023
- specific characteristics ChartLlama LLaVA ChartLlama[73] 11/2023
- 100k images ShareGPT-4V LLaVA ShareGPT4V[21] 11/2023
- NCBI Disease, GAD structured information[195] BERT, etc. - 03/2023
- MedQA UltraMedical Llama 3 Llama-3-UltraMedical[258] 06/2024
- GSM8k, MATH MetaMathQA Llama MetaMath[238] 09/2023
Symbol tuning[213] NLP datasets input-label pairs Flan-PaLM - 05/2023
Uni-modality
Augmentation - MedMCQA DISC-Med-SFT Baichuan-13B-Base DISC-MedLLM[10] 08/2023
MathGenie[144] GSM9k, MATH problem-solution pairs Llama 2 - 02/2024
- scientific papers, web data Proof-Pile-2 Code Llama Llemma[6] 10/2023
- MedDialog-CN, IMCS-V2, etc. BianQueCorpus ChatGLM BianQue[25] 10/2023
Multi-modality SelTDA[94] A-OKVQA, AQUA question-answer pairs ViT-B/16 - 2023
improved loops. To this end, a self-improvement strategy[71] is pro- 3.3.2 General Model Distillation. General model distillation
posed where the LLM generates its own synthetic puzzle-solution denotes distilling high-quality fine-tuning data from a powerful
pairs which are filtered before being used to fine-tune the LLM LLM. In the present survey, we divide the existing method of gen-
itself. Further, STaR[248] constructs an augmented dataset by using eral model distillation into five categories: synthesize with seeds,
the LLM’s rationale generation ability and justifying ground-truth synthesize data iteratively, synthesize reasoning steps, taxonomy-
answers to problems the model failed to solve. Besides, ReST[66] driven synthesize, and synthesize multimodal data.
first augments the training dataset by generating multiple output Synthesize with Seeds. Synthesizing data from existing in-
predictions using an LLM, and then fine-tunes the same LLM on stance examples or data seed is the most common way[17, 196, 215].
the filtered dataset with an offline reinforcement learning objec- For instance, Unnatural Instruction[80] gathers examples by ini-
tive. Following ReST, ReST-EM[187] refrains from augmenting the tially providing a language model with three seed examples to
dataset in the generate step with human-generated outputs and generate a fourth. The dataset is then expanded by prompting the
fine-tunes the base LLM instead of the model obtained from the pre- model to rephrase each instruction. By utilizing self-chat to gener-
vious iteration in the improve step. In addition, DeepSeek-Prover[? ate a multi-turn chat corpus with ChatGPT, Baize[222] is obtained
] is consistently fine-tuned on the synthetic data which is gener- through parameter-efficient tuning and self-distillation with feed-
ated through auto formalization, quality filtering, and statement back. Moreover, CLINGEN[226] capitalizes on clinical knowledge
proving, and the updated model is then utilized for the subsequent extraction to contextualize prompts. The strategy encompasses
iteration. In SPIN[26], a self-play strategy is implemented where creating clinical topics from both knowledge graphs and LLMs,
an LLM is fine-tuned to distinguish the response of the opponent as well as extracting writing style recommendations from LLMs.
player (the LLM from the previous iteration) from the target data Similarly, HuaTuo[204] is a Llama-based model that has undergone
distribution, thereby iteratively aligning the LLM with the target SFT using generated QA which is synthesized through extracting
data distribution. knowledge instances from a knowledge graph and creating addi-
tional instances with the help of ChatGPT. In addition, Impossible
10
Distillation[91] enhances small, low-quality language models by uti- is developed with a broad spectrum of capabilities for chart under-
lizing paraphrase proximity and critic-guided distillation to create standing and generation capabilities. Besides, A curriculum learn-
a high-quality paraphrase dataset. ing method[110] is introduced which involves first fine-tuning
Synthesize Data Iteratively. To construct high-quality data LLaVA to align biomedical vocabulary and continue training the
with diversity, some approaches build frameworks that can be per- model using generated instruction-following data by GPT-4. Fur-
formed multiple times. For instance, WizardLM[223], fine-tuned ther, ShareGPT4V-7B[21] is fine-tuned on the ShareGPT4V dataset
with 70k synthetic data generated by the Evol-Instruct method, and demonstrates impressive performance across various multi-
achieves state-of-the-art results on high-complexity tasks and re- modal benchmarks. In addition, NExT-GPT[219] leverages a modality-
mains competitive on other metrics. Evol-Instruct is further adapted switching instruction tuning (MosIT) methodology which prompts
to the code domain with modifications including refining evolution- GPT-4 to generate multimodal dialogues under various scenarios
ary instruction, simplifying evolutionary prompts formats, incorpo- based on template dialogue examples.
rating code debugging and time-space complexity constraints[146].
Moreover, LLM2LLM[107] fine-tunes a student model on an initial 3.3.3 Data Augmentation. Data augmentation involves enhanc-
dataset, then identifies errors, and augments training data with ing the existing data through various techniques to create a more
synthetic examples from a teacher LLM based on those errors. The extensive and varied dataset. In the fine-tuning stage, there are
process repeats to train the student model on increasingly targeted mainly two kinds of methods: data labeling nad data reformation.
data points. Data Labeling. Data labeling denotes generating annotations
Synthesize Reasoning Steps. Recent studies have concentrated for unlabeled data. For instance, A pipeline[195] is proposed to gen-
on improving the performance of LLM by imitation learning, draw- erate a large volume of synthetic data with labels using LLMs and
ing with the outputs generated by large foundation models (LFM). further eliminate low-quality or duplicated samples. The results
However, the smaller language model tends to imitate the style, but demonstrate the effectiveness of the method compared to LLM’s
not the reasoning process of LFM. To this end, GAIR-Abel[28] found zero-shot performance. Futhermore, Llama-3-UltraMedical[258] is
that the structure of the augmented responses significantly impacts obtained by supervised fine-tuning on the UltraMedical dataset,
the overall performance, with answers that initiate with a para- which includes instructions annotated with completions from vari-
phrasing of the question and proceed with step-by-step solution ous LLMs with preferences annotated by GPT-4. For mutil-modality,
showing superior performance than those in standard format. Fur- SelTDA[94] uses the vision language model and target dataset to
ther, Orca[154] is explanation-tuned on augmented query-response build a teacher model that can generate question-answer pseudo
pairs with detailed responses from GPT-4. The approach allows labels directly conditioned on an image alone.
Orca to learn from the rich signal including explanation traces, Data Reformation. Data reformation refers to transform the
step-by-step thought processes, and other complex instructions. existing data into a more diverse form, thereby augmenting the
In Orca 2[152], various reasoning techniques including step-by- data. For example, Symbol tuning[213] fine-tunes language model
step, recall then generate, recall-reason-generate and direct answer on input-label pairs presented in-context where natural language la-
are investigated, which endows the model with better reasoning bels are remapped to arbitrary symbols. Further, DISC-MedLLM[10]
capability. Additionally, MAmmoTH[244], a series of open-source is fine-tuned on the SFT dataset which is constructed by utilizing
LLMs specifically tailored for general math problem-solving, are medical knowledge-graphs, reconstructing real-world dialogues,
fine-tuned on the hybrid instruction tuning dataset MathInstruct and rephrasing human-guided preference. Besides, MetaMath[238]
with CoT and PoT rationales. is fine-tuned on the MetaMathQA dataset which is constructed
Taxonomy-Driven Synthesize. The aforementioned methods by rewriting questions with both forward and backward reason-
are mostly based on synthesizing datasets from seeds, while re- ing paths through LLMs. In addition, MathGenie[144] consists of
cent research has adopted another novel approach to synthesizing three components including iterative solution augmentation, ques-
datasets through a taxonomy-driven method. To address the is- tion back-translation, and verification-based solution filtering to
sue of tail phenomenon, LAB[191] replaced the random sampling create diverse and reliable data. Moreover, BianQueCorpus[25] is
in existing synthetic data generation methods with a taxonomy- constructed through collecting real-world multi-turn health conver-
driven approach to guide the sampling of synthetic data. Similar sations, constructing a data automatic cleaning process, and using
to LAB, GLAN[111] utilizes a semi-automatic approach to synthe- ChatGPT to polish the doctors’ suggestion of multi-turn conversa-
size large-scale datasets which uses a human-curated taxonomy tions.
to generate instruction-tuning data from a teacher model. Further,
SciLitIns[118] is constructed by a three-step pipeline including col- 3.4 Instruction-Tuning
lecting a probability table of domain keywords, compiling a list of In the instruction tuning phase, data synthetic aims at exploring
task descriptions, and prompting GPT-4o to generate. synthetic instruction or prompt contents to generate instruction-
Synthesize MultiModal Data. General model distillation also following high-quality data via LLMs. According to the way of
holds great potential for multimodal applications. For instance, synthetic data, they consist of the following categories: (1) gen-
Visual instruction-tuning[130] extends instruction-tuning to the eral model distillation, (2) model self-improvement, and (3) data
language-image multimodal space by leveraging language-only augmentation, as shown in Table 4.
GPT-4 for multimodal instruction-following data generation. By
fine-tuning LLaVA on multimodal synthetic data, ChartLlama[73] 3.4.1 General Model Distillation. To obtain diverse data, a popular
method adopts a stronger LLM to synthesize data and perform
11
Table 4: Data synthesis and augmentation in Instruction-Tuning. A dash (-) indicates no relevant content
Category Method Data Source Synthetic Data Base Mmodel Target Model
Alpaca [196] 175 instruction-response pairs 52k instruction-following samples GPT-3.5 LLaMA
Vicuna [29] 175 instruction-response pairs 9k instruction-following samples ChatGPT LLaMA
Uni-modality WizardLM [223] 175 instruction-response pairs 250k instructions ChatGPT LLaMA
General model distillation,
Orca [154] FLANv2 5 million query-response pairs GPT-4 LLaMA
Orca 2 [152] FLAN-v2 & Orca 2 dataset 1 million data GPT-4 LLaMA
Multi-modality LLaVA [130] image-text pairs text-only prompt-answer pairs ChatGPT and GPT-4 Vicuna and CLIP
Self-Instruct [210] 175 human-written samples 52k instructions GTP-3 GTP-3
Backtranslation [119] 13.2k instruction examples 502k instruction-output pairs LLAMA LLAMA
Uni-modality SPIN [26] 50k prompts from Ultrachat200k 100k synthetic dataset zephyr-7b-sft-full zephyr-7b-sft-full
Model self-improvement
ReST𝐸𝑀 [187] Hendrycks’ MATH and APPS dataset 32 or 64 solutions per problem PaLM 2 PaLM 2
CAI [8] 16 principles or instructions 318k harmful and helpful instructions 52B SL-CAI 52B SL-CAI
Multi-modality SelTDA [94] unlabeled images image-conditional question-answer pairs ViT-B/16 ViT-B/16
[198] political Twitter messages Annotating political Twitter messages Chatgpt-4 -
Data labeling Machine Translation [252] monolingual data back-/forward-translation monolingual data GLM-130B -
T-SciQ [205] original question-answer data planning-based CoT rationales GPT-3.5 -
CORE [45] task-related unlabeled texts diverse counterfactual perturbations GPT-3 -
Data augmentation Data reformation ALIA [51] image data language-guided image editing GPT-4 -
ChatAug [37] text data semantically similar sentences ChatGPT -
CoAnnotating [116] unstructured texts responses via different prompt variations ChatGPT -
Co-annotation [11] sponsored content on social media generated explanations ChatGPT -
ToolCoder [259] source code API-augmented code ChatGPT -
instruction-tuning for a weaker LLM [80], including uni-modal LLMs via reverse instructions and builds an instruction-following
synthesis and multi-modal synthesis. text dataset, offering a cost-effective and fast approach to perform
Uni-Modality. Uni-modality synthesizes a specific type of data instruction tuning and output high-quality synthetic data. To fine-
via teacher LLMs. Alpaca [196] first generates instruction-following tune instructions to optimize Code Large Language Models (Code
demonstrations via GPT-3.5 (text-davinci-003) and then fine-tunes LLMs), WizardCoder [146] enhances the fine-tuning of complex
llama-7b to create a replicable instruction-following model. Next, instructions for Code LLMs by adapting the Evol-Instruct method
Alpagasus [22] discovers that the instruction-tuning dataset used by to the code domain. WizardCoder produces intricate code instruc-
Alpaca contains many incorrect or irrelevant low-quality instances. tion set to improve StarCoder [117] model through code-specific
In response, they design a quality filtering strategy that leverages Evol-Instruct [223].
powerful LLMs like ChatGPT to automatically identify and filter out Multi-Modality. Multi-modality generates cross-modality data
low-quality data. The results demonstrated that a small amount of via LLMs [50]. As a typical method, LLaVA [130] is the first attempt
high-quality data was sufficient to train a model with even stronger to extend instruction-tuning to the language-image multimodal
performance. Based on Alpaca, Vicuna [29] gathers user-shared domain. It uses ChatGPT and GPT-4 to convert image-text pairs
conversations from ShareGPT.com to build an open-Source chatbot. into multimodal instruction-following data and then fine-tunes on
Given that the production of high-complexity instructions may the generated instructional vision-language data by combining the
pose a challenge for humans, WizardLM [223] proposes an Evol- visual encoder CLIP [164] and language decoder Vicuna [29]. On
Instruct strategy to rewrite and produce more complex instructions. this basis, LLaVA-1.5 [129], LLaVA-Plus [138], LLaVA-Interactive
Evol-Instruct uses LLMs to automatically mass-produce various [24], and LLaVA-Med [110] further extend LLaVA to a variety of
instructions at different levels, including two evolution strategies: multimodal tasks and design specialized prompt templates for bet-
in-depth evolution and in-breadth evolution. ter fine-tuning. For example, LLaVA-Plus is dedicated for tool and
While these above models produce abundant data via LLMs, they skill uses in human-AI interaction sessions by incorporating user in-
often lack the ability of reasoning and comprehension skills dis- structions that request multimodal tools and their execution results
played by LLMs. To this end, Orca [154] and Orca2 [152] imitate the into LLaVA. LLaVA-Med instructs instruction-following data from
reasoning process of stronger LLMs via explanation traces to output the captions through GPT-4 to capture open-ended conversational
synthetic samples. Compared to vanilla instruction tuning, Orca semantics, building a vision-language conversational assistant to
leverages system instructions to augment query-response pairs answer questions about biomedical images.
with detailed reasoning explanations. Based on Orca, Orca2 further
introduces various reasoning strategies, such as step-by-step and
recall-reason-generate, to learn to determine the most effective 3.4.2 Model Self-Improvement. Model self-improvement aims at
solution strategy for each task. Orca2 distills a synthetic dataset by bootstrapping synthetic data from the model itself, including uni-
collecting FLAN-v2 Collection [141], 55K Few-Shot dataset, Math modal synthesis and multi-modal synthesis.
dataset [176], and 2000 Doctor-Patient Conversations, creating cau- Uni-Modality. This category generates uni-modality data to
tious system instructions to achieve cautious reasoning. Moreover, implement instruction-tuning via the LLM itself. For example, Self-
Baize [222] proposes an open-source chat model that generates a Instruct [210] prompts an off-the-shelf GPT3 to generate both new
high-quality multi-round dialogue corpus by leveraging ChatGPT instructions and corresponding instances. It enhances GPT3’s abil-
to engage in a conversation with itself. Baize employs questions ity to follow instructions by leveraging its own generated outputs.
from Quora3 and Stack Overflow4 as seeds to generate 111.5k dia- Motivated by backtranslation methods, Instruction Backtranslation
logues through self-chat. LongForm [96] generates instructions via [119] generates instructions from human-written “responses” from
web documents, instead of generating responses from instructions.
12
It adopts a self-curation step to select high-quality pairs and pro- counterfactual excerpts into prompts to GPT-3 with few-shot learn-
duce augmented instruction-response pairs. In addition, SPIN [26] ing abilities for counterfactual editing. Moreover, ALIA [51] sum-
designs a self-play mechanism like generative adversarial networks marizes the captions of images into short descriptions by prompting
(GAN) to achieve instruction tuning. It adopts instances of the same GPT-4, and then performs language-guided image editing of the
LLMs from different iterations to combine the player (discrimina- training data with Stable Diffusion [171]. In the NLP task, ChatAug
tor) and the opponent (generator). To beyond human data, ReST𝐸𝑀 [37] transforms every sentence within the training samples into
[187] proposes a two-step self-training method via expectation- numerous conceptually akin yet semantically distinct instances.
maximization for reinforcement learning. It first generates multiple Co-annotation. Co-annotation aims to collaboratively annotate
samples for each instruction and filters them to create synthetic data data derived from humans and LLMs. As a representative method,
through binary feedback, and then fine-tunes on model-generated Coannotating designs the human-LLM co-annotation paradigm
synthetic data. To explore the harmlessness from AI feedback, CAI by using variational prompts to generate responses, which utilizes
[8] generates self-critiques and revised responses via designed prin- uncertainty to estimate LLMs’ annotation abilities [116]. To improve
ciples or instructions (i.e., constitution) and then fine-tunes the annotation accuracy, ChatGPT is used to augment annotation data
original model to achieve self-improvement on harmlessness. To with phrases identified as pertinent attributes and brief explanations
better teach LLMs to use tools, Toolformer [177] allows LLMS to [11]. To fine-tune code generation models [23, 157] to assist in high-
automatically transform the original input into the input for API quality code generation, ToolCoder [259] develops an automated
calls via the prompt. In this way, LLMS can teach themselves to use data annotation approach for incorporating tool usage information
external tools via simple APIs in a self-supervised way. into the source codes through API-augmented prompts.
Multi-Modality. The above designs various instruction sam-
ples to improve the LLM’s alignment ability. However, these works
are usually text-only. Another category synthesizes multi-modality 3.5 Preference Alignment
data via the LLM itself. To fine-tune the large Vision-Language Preference alignment is achieved by systematically refining large
Model (VLM) to improve Visual Question Answering (VQA), SelTDA models to match complex human preferences [14, 59]. This process
[94] designs a self-taught data augmentation strategy for finetun- starts with General model distillation [4, 5, 7, 30, 35, 97, 106, 114,
ing large VLMs on small-scale VQA datasets. SelTDA generates 148, 212], which synthesizes broad preference data, providing a
question-answer pseudo-labels directly conditioned on an image foundational alignment across diverse tasks. Domain model distil-
alone by prompting the BLIP [112] VLM to caption images, allowing lation [38, 60, 85, 145, 146, 189, 225] then optimizes models with
us to pseudo-label unlabeled images. specialized datasets, enhancing performance in specific domains.
Model self-improvement [8, 22, 69, 79, 106, 193, 235, 241] allows
models to iteratively refine their capabilities with minimal human
intervention, using self-generated feedback. Data augmentation
3.4.3 Data Augmentation. Data augmentation aims at enhancing further strengthens model generalization by expanding and diver-
model performance by diversifying training samples without the sifying the training data. These interconnected methods form a
requirement for extra data. It leverages high-quality instructions or coherent framework for optimizing model alignment with both
prompts to generate augmented data that users expect and match general and domain-specific human preferences.
the target tasks. There are three main types: data labeling, data
reformation, and co-annotation.
Data Labeling. Data labeling employs the language comprehen- 3.5.1 General Model Distillation. General model distillation
sion abilities of LLMs to annotate unlabeled examples [275]. For aims to generate high-quality preference data by leveraging large
example, ChatGPT-4 is used to classify political Twitter messages language models (LLMs) and external tools to better align models
through specific instructions [198]. Subsequently, some works re- with complex human preferences [5]. This process is crucial for
veal that these data augmentation ways derived from open-source improving LLM performance in practical applications, particularly
LLMs are better than manual labeling in many annotating high- in areas like safety, reliability, and ethical considerations [30]. One
resource tasks [3, 275]. An LLM-based prompt strategy evaluates of the primary challenges in this approach is the bias and limitations
various factors for prompt template and demonstration to augment inherent in strong models [7]. To address this, distillation from
machine translation data [252]. It adopts pseudo parallel prompts multiple strong models, rather than relying on a single one, can be
from monolingual data via zero-shot prompting to improve trans- employed to reduce bias and increase the diversity of responses.
lation performance. To achieve multi-modal data augmentation, Building upon these strategies, several approaches have been
T-SciQ [205] teaches science question answering with chain-of- developed to refine preference alignment and mitigate the afore-
thought (CoT) prompting format to generate high-quality CoT ra- mentioned challenges. For instance, RLAIF [106] synthesizes pref-
tionales. It is eventually used to train much smaller models to erence data using sources like Reddit TL;DR [202] for summariza-
perform CoT reasoning in complex modalities. tion tasks and aligns dialogue generation with human preferences
Data Reformation. Data Reformation transforms the existing through Anthropic’s Helpful and Harmless Human Preferences.
data into other variations to meet the data format requirements Similarly, ULTRAFEEDBACK [35] utilizes GPT-4 to generate over a
of the target task. For instance, CORE [45] proposes a retrieval- million feedback points. By employing techniques such as best-of-n
augmented generation approach by creating diverse counterfactual sampling [156] and Proximal Policy Optimization (PPO) [101], it
perturbations for counterfactual data augmentation. It incorporates enhances feedback quality and minimizes annotation bias.
13
In addition to these methods, large-scale datasets have been Mathematical Problem Solving. For mathematical reasoning
created to enhance model alignment through crowd-sourced an- tasks, PRM800K [124] offers a large dataset of 800,000 step-level
notations. For example, Open Assistant [97] was developed with labels across 75,000 solutions to 12,000 math problems. Labelers
contributions from over 13,500 volunteers, resulting in more than assign positive, negative, or neutral labels to each step, allowing
161,000 messages across 66,497 conversation trees. Each message models to focus on logical consistency and correctness. This ap-
is annotated for quality, creativity, and harmfulness. Furthermore, proach reinforces the model’s ability to solve complex problems
HelpSteer [212] enhances data quality by annotating 37,000 con- through step-wise reasoning, improving mathematical problem-
versations for attributes like helpfulness, correctness, coherence, solving capabilities.
complexity, and verbosity. Another crucial technique in improv- Search-based Question Answering. For search-based QA, We-
ing model alignment is Selective Reflection-Tuning [114], which bGPT [156] trains models using long-form question-answering
refines responses by filtering the teacher model’s outputs before datasets such as ELI5, TriviaQA, and ARC. By interacting with a
using them for distillation. The filtering is based on the student web-browsing environment, the model generates answers and com-
model’s r-IFD score, ensuring that only the most challenging and pares them with human evaluations. This feedback loop improves
appropriate responses are retained for training. Additionally, mod- the model’s search capabilities and performance, particularly in
els like LEMA [4] enhance the refinement process by using GPT-4 to tasks that require sourcing answers from the internet.
identify and correct errors in the student model’s responses. These Code Generation and Logical Reasoning. WizardCoder [146]
corrections are then used to fine-tune the student model, making and WizardMath [145] have also advanced instruction generation in
the alignment process more accurate and effective. specific domains like coding and logic-based tasks. By extending the
The refinement and critique capabilities of LLMs are critical to initial WizardLM framework, these models improve the diversity of
improving alignment. SELF-REFINE [148] allows models to critique instructions for code generation and math problem-solving, helping
their own responses, generating improved outputs based on their the model handle a wide range of domain-specific challenges.
own feedback. Furthermore, evaluation methods such as MetaCri-
tique from the Critique of Critique [192] provide metrics to assess 3.5.3 Model Self-Improvement. Model self-improvement fo-
how effectively a model’s critique improves refinement. CriticBench cuses on enabling weaker LLMs to iteratively enhance their perfor-
[126] also explores the relationship between generative capacity mance without requiring additional human-annotated data. This
and the ability to critique and correct responses, offering insights approach consists of two categories: self-feedback loops [8, 69, 79,
into model performance. 106, 193, 235, 241], where the model autonomously refines its out-
puts based on self-generated feedback, and external evaluation
3.5.2 Domain Model Distillation. Domain model distillation models [22, 26, 47, 119, 241], which rely on external evaluators to
focuses on optimizing models for specific tasks by training them assess the model’s responses. Both methods aim to create a scal-
on specialized and domain-specific datasets, often using reinforce- able system of improvement by reducing dependency on human
ment learning and preference modeling techniques. This approach intervention, allowing the models to continuously optimize their
enables models to perform well across various domains, enhanc- performance through internal adjustments or external guidance.
ing their ability to handle complex, specialized tasks. Through this Self-Feedback Loops. One of the early methods that exemplify
process, models are refined to meet the requirements of various self-improvement through feedback loops is CAI [8], which synthe-
fields, including Safety-oriented Scenarios [38, 60, 85, 225], Summa- sizes alignment datasets by blending human and model-generated
rization [189], Mathematical Problem Solving [124], Search-based prompts using few-shot prompting. By focusing on tasks such as
Question Answering [156], as well as Code Generation and Logical red teaming and helpfulness, CAI enables iterative improvement
Reasoning [145, 146]. through AI self-critique and chain-of-thought reasoning, reducing
Safety-oriented Scenarios. In sensitive or adversarial environ- reliance on human feedback. This laid the foundation for RLAIF
ments, ensuring safe deployment is essential. BAD [225] addresses [106], where AI selects preference data that aligns with constitu-
this by collecting adversarial dialogue data where annotators inten- tional requirements. However, early RLAIF faced challenges like
tionally provoke unsafe behaviors in chatbots, helping train models distribution shifts due to offline data generation. To address this,
to detect and prevent harmful responses. Datasets like REALTOXI- methods like OAIF [69]and SELF-JUDGE [235] introduced on-policy
CITYPROMPTS [60] with 100K sentence-level prompts annotated preference selection, where a pre-trained Judge model selects pref-
for toxicity. Inspired by Safe RLHF [38], BEAVERTAILS [85] syn- erences in real-time, ensuring alignment with the LLM’s current
thesizes data from over 44,000 red-teaming prompts, generating state. A key aspect of self-feedback loops is the role of reward
QA pairs with safety meta-labels to reduce harmful content and models in refining responses. In earlier methods, reward models
improve safety alignment. were often static, leading to problems such as reward hacking [188].
Summarization. In text summarization, Stiennon et al. [189] Self-Rewarding [241] introduced a dynamic solution by allowing
generate high-quality summaries by comparing pairs of summaries the LLM to act as its own reward model, iteratively selecting pref-
from the Reddit TL;DR [202]and CNN/DM datasets [179]. Human erence data and improving itself through Direct Preference Op-
evaluators provide pairwise comparisons for these summaries, which timization (DPO)[165] training. This approach ensures that both
are then used to train a reward model. This model is further fine- the model and the reward mechanism evolve together, maintain-
tuned using reinforcement learning, refining summarization out- ing alignment throughout the training process. Another method
puts to align with human preferences for clarity and quality, partic- for dynamic preference adjustment within self-feedback loops is
ularly in domains like news and social media. SALMON [193], which introduced an Instructable Reward Model
14
that allows for flexible scoring of responses based on different prin- enabling models to improve their adaptability and performance. A
ciples. This adaptability enables more precise preference alignment prominent method in this area is in-context learning [46], where
during training. Additionally, CycleAlign [79] uses the probability examples embedded in prompts guide LLMs to generate responses
of LLM-generated outputs to rank similar responses, selecting the that reflect the provided patterns. This closely aligns with Instruc-
longest common subsequence of two ranking results to refine the tion Evolution, which emphasizes increasing the complexity and
final ordering. diversity of instructions. Early works like Self-Instruct [210] and
External Evaluation Models. External evaluation models play Unnatural Instructions [80]relied on task pools with hand-crafted
an important role in several self-improvement frameworks. For seed examples, while LaMini-LM [218] expanded this approach by
example, ALPAGASUS [22] employs a strong LLM like ChatGPT incorporating rich data from Wikipedia to generate more diverse
to filter out low-quality data, significantly reducing the training instructions. Auto Evol-Instruct [250], initially developed to evolve
set size while improving model performance. This method demon- instructions, automates the optimization of evolution rules by al-
strates how external evaluation can enhance model refinement lowing an Optimizer LLM to iteratively improve the rules based
by focusing only on high-quality inputs. Another prominent tech- on evolving feedback data. Additionally, Instruction Backtransla-
nique is Instruction Backtranslation [119], which generates instruc- tion [119] enhances instruction-following abilities by generating
tion prompts for unlabeled web data and selects high-quality pairs instruction-response pairs from unannotated data, reducing the
for fine-tuning. This approach boosts the model’s ability to fol- need for manual annotation. This ongoing refinement in data ref-
low instructions without requiring large-scale human annotations. ormation is crucial for boosting performance across a variety of
SteerLM [47] takes this a step further by fine-tuning models based tasks.
on specific attributes like humor, creativity, and toxicity. This is Co-Annotation. Co-Annotation refers to the collaborative pro-
done through an attribute prediction model that evaluates responses cess of combining human and machine-generated annotations to
and refines them using datasets such as Open Assistant [97] and improve the quality and safety of model outputs. For instance,
HH-RLHF [7]. BEAVERTAILS [85] demonstrates how synthesizing data from over
Recent approaches like Self-Rewarding [241] and SPIN [26] fur- 44,000 adversarial prompts, with both human and machine input,
ther integrate preference optimization with iterative feedback sys- generates QA pairs with safety meta-labels, helping models avoid
tems. Self-Rewarding employs the DPO framework, where LLMs harmful outputs. Similarly, HelpSteer2 [211] blends human exper-
generate new prompts and candidate responses, which are then tise with LLM-generated responses to refine multi-turn conversa-
judged by the LLM itself, continuously improving alignment. SPIN, tions, ensuring that models adhere more closely to ethical guide-
in contrast, eliminates the need for reward models by relying en- lines.
tirely on human-annotated data for iterative improvement. How-
ever, as SPIN notes, this approach can become bottlenecked once the 3.6 Applications
model reaches human-level performance, as further improvement Most large language models (LLMs) are pretrained and finetuned
requires continuous human intervention. on general-purpose corpora. To apply them effectively to down-
stream tasks, continuous task-specific pretraining or finetuning is
3.5.4 Data Augmentation. Data augmentation is essential for en- often required, where data quality and relevance become critical
hancing large model alignment by creating task-specific variations for performance on specialized tasks. However, unlike the abun-
of existing data, which strengthens model generalization and robust- dance of general-purpose data, domain-specific datasets are often
ness. This approach increases the diversity of training data without scarce due to the intensive knowledge required for their creation.
the need for additional data collection. Techniques like Data Label- To address the problem, many studies have explored the synthesiz-
ing [126, 240, 250, 273], Data Reformation [80, 119, 210, 218, 250], ing of specialized data with diverse characteristics tailored to each
and Co-Annotation [85, 211] are employed to ensure that the aug- application.
mented data remains relevant and consistent, contributing to more
precise model performance across various tasks. 3.6.1 Math. Applying LLMs in mathematical scenarios, which
Data Labeling. Data labeling plays a crucial role in aligning involves question understanding and answering, requires intensive
models with human preferences by providing structured, high- logical reasoning. Many researchers have proposed that generating
quality feedback to guide learning. Starling-7B [273] is an example more rationale corpus [145, 197, 220, 238, 245, 248] and diverse
of this, collecting ranked data from chat prompts and generating questions and answers [1, 131] in the training corpus helps the
millions of pairwise comparisons to refine model alignment. A model to better understand and reason.
more dynamic approach, inspired by Instruction Evolution [250], Some studies focus on generating chain of thoughts (CoTs) that
is seen in UltraInteract [240], which constructs a Preference Tree explicitly outlines the reasoning steps, either by data augmentation
from LLM responses. This tree refines incorrect responses based or through LLMs. Galactia [197] proposes a working memory token,
on feedback from models like GPT-4, creating more diverse and wrapping step-by-step reasoning within “<work> </work>” for 4
robust preference data. These advancements reflect the need for datasets to form a new mathematical reasoning dataset. Addition-
dynamic evolution in instructions to enhance preference labeling ally, recent literature harnesses the reasoning capability of advanced
and refinement, as CriticBench [126] suggests that feedback may closed-source LLMs. MammoTH [245] compiles 13 math datasets
be more effective than generation in certain knowledge domains. with intermediate rationales to form the MathInstruct dataset, and
Data Reformation. Data reformation is the process of restruc- extends them with hybrid CoT and PoT (Programe-of-Thought)
turing existing data to better align with task-specific objectives, rationales with the help of GPT-4.
15
Diverse questions and solutions can also enhance the math un- self-correction [115, 183] process. An instruction quality classi-
derstanding and reasoning capability of LLMs. General purpose fier is trained with positive data labeled by LLMs and humans,
LLMs like GPT-4 are utilized to model linear programming of a and negative data identified by LLMs. Galactia [197] focuses on
specific real-world problem [1] through conversations between two knowledge-intensive scientific tasks: equations, chemical reactions,
LLM agents, and supervised by another GPT-4 agent for quality and and citation prediction, and proposes a set of tokenizers for vari-
correctness. MetaMath [238] increases the diversity of questions ous input formats. It also creates a working memory technique for
by rewriting mathematical questions from two reasoning paths better in-context reasoning.
and rephrasing using LLMs. Some methods attempt to compose
3.6.3 Code. Generating synthetic data that enhance coding per-
questions from seed questions. For example, Liu et al.[131] propose
formance has long been studied, which requires a clear understand-
an Iterative Question Composing (IQC) method, which uses an LLM
ing of questions and accurate reasoning to produce correct code.
to compose questions from seed questions and employs another
Since the accuracy of the code can be easily validated in a simulated
LLM to conduct rejection sampling.
coding environment, this enables the generation of large-scale in-
As mathematical questions and answers are verifiable, some
struction tuning datasets for coding tasks. Haluptzok et. al. [72]
approaches expand the training corpus through self-generated for-
proposed a self-play method to generate programming puzzles [70]
mulated problems and proofs that have been verified by themselves
and its solution, where the correctness is verified by a Python in-
or external tools and models. STaR [248] generates rationales to
terpreter. They further finetune the LLM on the generated data to
answer many answers with few rationale examples as a prompt to
improve the performance. For more time-efficient code generation,
bootstrap the reasoning capability of GPT-J [203]. The answers are
Shypula et. al. [186] present a data generation method for code
verified and, if the generated answers are wrong, the model will
optimization through self-play and few-shot Chain-of-Thoughts.
retry until the rationale gives the correct answer. All correct ratio-
They prompt GPT-3.5 to generate more optimization problems on
nales are used for fine-tuning. DeepSeekProver [220] is initialized
the proposed PIE dataset, and then the answer is generated by a
using DeepSeekMath-Base [181] finetuned on MMA dataset [86] to
finetuned GPT-3.5 on the PIE dataset, resulting in more samples.
acquire basic auto-formalization capability. The model generates
WizardCoder [146] utilizes StarCoder 15B as the foundation and
formulated math problems and proofs, and filters the generated
finetunes it using the code instruction-following evolved through
data with miniF2F-valid as few-shot examples. The proofs are gen-
Evol-Instruct [224]. Code Alpaca [17] is fine-tuned from a 7B and
erated for both original theorems and their negations as an aug-
13B LLaMA model on 20K instruction-following data generated by
mentation approach. All valid generated data will be used to train
the techniques in the Self-Instruct [208].
the model, and the process is repeated several times until a mar-
Diverse programming problems and the corresponding answers
ginal performance gain is observed. WizardMath [145] generates
promote the capabilities of LLMs. MagicCoder [215] utilizes open-
evolved instructions using ChatGPT and also provides supervi-
source code snippets to generate more diverse, realistic, and con-
sion during the generation process. It also evolves the Wizard-E
trollable coding instruction data. Code LLama [174] constructs a
model through Reinforcement Learning from Evol-Instruct Feed-
dataset of 14,000 question-test-solution triplets by generating 52,000
back (RLEIF), alongside Evol-Instruct and Reinforcement Learning,
unique programming questions with Llama 2 70B and using Code
to improve LLM reasoning performance.
Llama 7B to create unit tests and Python solutions. Phi-1 [67] en-
hances diversity by constraining topics and target audiences when
generating text-code interleaved textbooks using GPT-3.5 as a train-
3.6.2 Science. Scientific applications require a deep understand-
ing corpus. Its successor, Phi-1.5 [120], expands this approach by
ing of knowledge-intensive concepts and reasoning, which requires
creating synthetic textbook-style data for common sense reason-
high-quality data sets for effective instruction fine-tuning. How-
ing and world knowledge, while incorporating samples from web
ever, generating such data sets is challenging due to the varying
datasets to further increase diversity.
formats across disciplines and the underlying logic can be difficult
to articulate. 3.6.4 Medical. In medical applications, large models mainly serve
Unifying the format of different disciplines is the first step in as medical dialogue chatbots, and need to interact with patients
processing science-related corpus, by transforming structured data with multi-round dialogues. To achieve interactive data synthesis,
into readable texts [118, 253, 254] or a specialized tokenizer [197]. specialized documents are first collected as seed corpus, such as
The instruction tuning data sets are then generated from the medical diagnostic records. Based on that, diverse question-and-
collected raw data. To ensure diversity, SciLitLLM [118] proposes a answer pairs can be generated with the aid of general-purpose
three-step pipeline to generate diverse and high-quality instructions large language models, and used to improve understanding ability
for scientific contexts. It first collects keywords and task descrip- to produce helpful responses.
tions for different scientific domains from domain literature and To simulate conversations between the patient and the doctor,
datasets, and then prompts GPT-4o with keywords and descriptions DISC-MedLLM [10] refines the MedDialog [249] and cMedQA2 [261]
sampled from previously collected data. ChemLLM [254] employ a datasets by leveraging GPT-3.5 to generate high-quality conver-
Play as Playwrights CoT style while prompting GPT-4 to promote sation samples. This process involves the removal of colloquial
context richness and logical coherence. Chain-of-thought also aid expressions, resolution of inconsistencies, and standardization of
the model to better understand scientific topics. SciGLM [253] col- language for greater uniformity. HuatuoGPT [256] utilizes data
lects the raw questions on college-level physics, chemistry, math, distilled from ChatGPT through self-instruct [208] and conver-
and formal proofs, generating CoT for them using GPT-4 with a sations between two ChatGPTs. HuatuoGPT-II [20] introduces a
16
one-stage adaptation strategy that merges traditional continued data. From the perspective of the content of understanding, it in-
pre-training and supervised fine-tuning into a single step of super- cludes uni-modal and multi-modal understandings. Uni-modal un-
vised fine-tuning. It employs GPT-4 for data unification, converting derstanding mainly comprehends the semantics of texts, including
domain-specific pre-training text into a (instruction, output) format text understanding and semantic labeling. Multi-modal understand-
with a standardized language and style. ChatCounselor [132] is a ing combines multiple modalities.
mental health support trained on GPT-4 refined query-answer pairs Text Understanding. To improve the instruction-following
based on doctor-patient dialogues, where GPT-4 also generates capacities of LLMs, Alpaca [196] and Alpagasus [22] utilize human-
summaries of key information from each interaction for broader written instruction-response pairs to generate instruction-following
context. demonstrations via few-shot prompting, improving the capabilities
Privacy is also a primary issue when dealing with sensitive of instruction-following models. Along this line, WizardLM [223]
medical records, some literatures extract medical knowledge from generates more diversified and multitask-oriented instructions from
knowledge graphs, and generate synthetic medical text without any both depth and breadth to improve the performance of LLMs. To
personal information. ClinGen [226] alleviate this by generating perform the model self-improvement, Self-Instruct seeks to improve
synthetic clinical text from Knowledge Graphs [190] and chatGPT. the instruction-following capabilities of LLMs by bootstrapping off
DISC-MedLLM [10] synthesizes medical dialogues by incorporating their own generations [210]. It classifies the generated instruction
domain-specific knowledge graphs question-answer pairs. and generates possible class labels to promote diversity. To further
comprehend text content, SPIN [26] considers instruction-tuning as
3.6.5 Law. LLM-based legal assistants have gained considerable a two-player game by using the main player as the discriminator to
attention for providing affordable and convenient legal services, distinguish the responses between LLM and humans. This strength-
particularly in the areas of legal question answering and consulta- ens LLM without the need for additional human-annotated data
tion. Recent studies focus on the quantity and quality of finetuning by understanding and discriminating textual responses. Moreover,
datasets by using data synthesis to improve the clarity and for- WRAP[150] capitalizes on the inherent diversity of web articles,
mality of the responses. DISC-LawLLM [243] design prompts for enabling LLMs to produce high-quality paraphrases of noisy and
GPT-3.5-turbo to refine the legal syllogism, knowledge expansion, unstructured content found online.
and Chain-of-Thought reasoning for the SFT instruction dataset. Semantic Labeling. For a deeper understanding of text prop-
LawyerLLaMA [82] uses ChatGPT to generate explanations for each erties, Instruction Backtranslation promotes instruction follow-
question, or to generate explanations based on questions and an- ing the language model by automatically labelling human-written
swers. In this work, the data generated by human experts produces documents with corresponding instructions [119]. Analogously,
better results than only 6k and 42k automatically generated data. ChatGPT-4 is used to classify political messages through instruc-
They identify that more automatically generated data may achieve tions by labeling them with political tendencies [198], further un-
better performance, and mark it as future work. LawGPT [272] derstanding semantics from messages. Introducing the data aug-
refine open-source datasets by prompting chatGPT for instruction- mentation pattern of ChatGPT annotating, Co-Annotating [116]
tuning data to generate more formal, polite, and clear answers. collaboratively views the relation between humans and LLMs and
WisdomInterrogatory [270] prompt GPT-3.5 models as agents to the selected Pareto efficiency concept enables participators to com-
imitate conversations between a law agent and a user to generate pare strategies and gain a deeper comprehension of tradeoffs be-
multi-turn instruction text. tween cost and performance. To offer annotators with explanations,
ChatGPT is introduced to improve human labeling of sponsored
3.6.6 Others. In addition to these previous applications, the po- content on social media by designing prompts to generate model
tential of synthetic datasets is also explored in financial [12] and explanations [259].
education [48, 102]. These areas are more challenging for data Multi-Modal Understanding. LLaVA [130] is the first work
synthesis due to the higher knowledge density and demands on to use LLMs to generate high-quality language-image instruction-
quality. As research continues, these areas may become increasingly following data for general-purpose visual and language understand-
promising. ing. It employs captions and bounding boxes of images as symbolic
representations to encode the images as an LLM-recognizable se-
quence and produces three types of instruction-following data to
4 Functionality fully capture semantic information from the images. It has been
From the functional perspective of LLMs, data synthesis and aug- demonstrated that synthetic data improves the performance of
mentation can be divided into four categories: understanding, logic, models for image recognition tasks in data-scarce settings[76].
memory, and generation [31, 199]. By exploring the four basic func- ChartLlama[73] is trained on the dataset generated through a multi-
tions in LLMs, data synthesis and augmentation can fully capture step process, encompassing a broader variety of chart and task
the inherent patterns in large-scale data and effectively apply them types than those found in existing datasets. Genixer[266] flexibly
to downstream applications and tasks. produces a variety of visual instruction tuning data from unla-
beled images, enhancing capabilities for visual question answer-
ing and referring expression comprehension tasks. The quality of
4.1 Understanding SciLitInsis[118] has been evaluated from five aspects including clar-
Understanding functionality leverages the powerful language un- ity, complexity, correctness, usefulness, and adaptability to improve
derstanding of LLM to understand the inherent patterns in the LLMs’ ability in scientific literature understanding.
17
4.2 Logic caption into a boolean question. By this means, SelTDA builds a
Logic functionality fully taps the reasoning and logic functions in direct image-to-text task and pseudo-labels unlabeled images by
the process of synthesizing and augmenting data. According to the sampling questions and answers. Recognizing that LLMs struggle
application of logic, there are the following three categories: code with abstract image perception and reasoning, multimodal self-
logic, math logic, and reasoning. instruct[263] creates a benchmark of abstract images tailored for
Code. To provide complex reasoning capabilities, ReST𝐸𝑀 [187] everyday contexts.
selects training samples with a binary reward function to output
synthetic data. It leverages logical reasoning to improve the per-
4.3 Memory
formance of LLMs on problem-solving tasks, such as code gener-
ation and mathematical problems. To optimize code, ToolCoder Memory functionality remembers and utilizes previously learned
[259] utilizes prompts to generate API-augmented codes for the information in LLM when synthesizing data [83]. According to the
API search tool with ChatGPT. The code generation process is properties of memory content, memory functionality can be divided
enhanced by integrating API search tools, enabling the model to into three categories: procedural memory, semantic memory, and
automatically utilize the search tool for suggestions when selecting episodic memory [201].
an API. Moreover, OSS-Instruct[215] enlightens LLMs to generate Procedural Memory. Procedural memory preserves a specific
more diverse, realistic, and controllable coding instruction data, process of how tasks and actions are executed. To perform the
which can substantially boost the reasoning performance of vari- code sorting task for software engineering, a sorting algorithm
ous LLMs. High-quality Case2Code[180] dataset is automatically is used to sort out the code LLMs by recognizing affiliation cat-
and efficiently harvested from pre-training code texts by leverag- egories, including different communities and contributors [269].
ing small LLMs and a coder interpreter The LLM’s performance To enable the LLM agent [265] to operate automatically, a code
is enhanced through fine-tuning on synthetic programming and generation model employs three types of in-context supervision to
solutions which are verified by a Python interpreter[71]. specify library functions and construct the agent’s codes for solving
Math. MathInstruct[244] has two main characteristics including user-instructed tasks [160]. Additionally, to explore the internal rea-
broad coverage of different math fields and complexity levels, and soning processes of LLM, Quiet-STaR [247] generates thoughts and
hybrid CoT and PoT rationales, enabling MAmmoTH to outper- rationales to explain future text when following all tokens in the
form existing open-source models. Benefiting from the diverse data text based on Self-Taught Reasoner [248]. It resorts to procedural
sources of MMIQC[131] and the effectiveness of iterative question reasoning to produce the process of thinking like humans.
composing, the models fine-tuned on MMIQC achieve new SOTAs Semantic Memory. Semantic memory synthesizes symbolically
on math benchmark. representable data to retain acknowledged world knowledge like
Reasoning. For reliable reasoning performance, Orca [154] pro- knowledge graphs, documents, and APIs. To achieve autonomous
poses progressive learning from complex explanation Traces of knowledge graph construction and reasoning tasks, AutoKG de-
GPT-4 and learns to imitate the reasoning process of LLM. Further, signs a multi-agent-based approach that combines multiple intelli-
Orca2 [152] designs various reasoning strategies, e.g., step-by-step gent agents to play different roles like assistant and domain expert
and recall-reason-generate, to enhance the reasoning abilities of to collaborate to complete tasks [274]. Meanwhile, AutoKG incor-
smaller LLMs. To generate synthetic data for instruction-tuning, porates the external knowledge base and Internet resources into
CAI [8] sample a critique request from a short list of constitu- the knowledge graph to make up for the limitations of LLMs. To
tions and prompt the model to generate a critique of the response, explore semantic content from general documents, KPC designs a
identifying harmful outputs to build a harmless but nonevasive AI knowledge-driven prompt chaining-based code generation model
assistant [172]. To enhance the science question answering task, T- [167]. It disaggregates the process of code generation into iterative
SciQ [205] uses the CoT reasoning capabilities of LLMs to generate validation and revision phases and employs fine-grained exception-
QA-CoT samples as teaching data samples. Since some questions handling knowledge derived from documentation to facilitate LLMs
are exceptionally intricate, T-SciQ further designs a 3-step zero-shot in code generation. To effectively utilize existing APIs and prevent
prompting way to produce planning-based CoT rationales, which the model from inventing non-existent APIs, De-Hallucinator au-
breaks down complex problems into simpler subproblems that can tonomously recognizes API references pertinent to specific projects
be easily solved. Utilizing CoT examples as a guide, the LLM gener- with the model’s preliminary predictions, subsequently incorporat-
ates high-confidence CoT reasoning paths which serve as the final ing these references into the prompt [52].
training samples to be fed back to the model for fine-tuning[81]. In Episodic Memory. Episodic memory remembers contextual
addition, STaR[248] improves the LLM itself by augmenting a fine- content that is closely related to the current state and personal
tuning dataset using rationalization and justifying ground-truth experiences. To create diverse synthetic data to showcase diverse
answers to problems the model failed to solve. Symbol tuning[213] personas, Persona Hub proposes a persona-driven data synthesis
leverages the intuition that when a model cannot use instructions methodology to capture knowledge and experiences from real-
or natural language labels to figure out a task, it must instead reason world users [16]. Persona Hub generates 1 billion diverse personas
with input-label mappings in-context. from massive web data by text-to-persona and persona-to-persona
Moreover, reasoning can also be extended to multi-modal scenar- ways. To obtain an accurate response, AceCoder proposes a prompt-
ios. To optimize the visual question answering task, SelTDA [94] enhanced technique to capture context information. It first tells
prompts VLM to caption various images and then converts each LLM to analyze and clarify requirements and then selects simi-
lar programs as supplementary examples in prompts to provide
18
relevant content for prompts [113]. To effectively utilize user ex- Multi-Modal Generation. To achieve automated language-
perience information, RepoCoder optimizes the code completion guided image augmentation, ALIA [51] summarizes the image cap-
process at the repository level by integrating a similarity-based tions to generate natural language descriptions via LLM and shows
retrieval mechanism and a pre-trained code language model within significantly visual diversity compared to the original data.
an iterative retrieval-generation framework [255]. It adopts avail- Retrieval-Augmented Generation. Retrieval-Augmented Gen-
able retrievers to locate pertinent knowledge within a repository, eration (RAG) integrates external knowledge to generate accurate
thereby improving the contextual foundation for LLM. and contextually appropriate content [18]. To address holistic ques-
tions across an entire text corpus, GraphRAG [40] utilizes LLMs
to construct a graph-based text index and integrates all partial re-
sponses to generate a global response. The constructed knowledge
graph contains a wealth of entities and relations in GraphRAG,
4.4 Generation which is conducive to generating more comprehensive and diversi-
Generation functionality aims at producing coherent and contextu- fied answers. Considering the limitations of knowledge embedding
ally relevant content for downstream tasks and applications. Based model and independent processing between query and context,
on the form of generated content, there are the following two cate- RankRAG [246] instruction-tunes LLMs to concurrently assess the
gories: content generation (e.g.,text and multi-modal generation) relevance between queries and contexts and utilizes the retrieved
and retrieval-augmented generation. context to generate accurate answers. To mitigate the imbalance
Text Generation. To accomplish sequence understanding and between the burden of the retriever, LongRAG [88] designs the long
generation for machine translation, a LLM-based prompt strategy retriever and reader components to reduce the corpus size, thereby
[252] leverages back-/forward-translation to augment monolingual achieving effective retrieval recall with only a few top units.
data via zero-shot prompting. For text augmentation, ChatAug
rephrases a sentence into multiple semantically similar sentences
[37]. Genie [236] generates high-quality data that is natural, faithful 5 Challenges and Limitations
and diverse, and the models trained on the synthetic data are more
Data synthesis and augmentation play a pivotal role in enhancing
faithful to the grounding content. From the guidance of principles
the capabilities of LLMs. However, these techniques face significant
of diversity and quality, UltraMedical [258] dataset can be used to
challenges and limitations that can impede their effectiveness.
build specialized generalists in biomedicine.
By utilizing paraphrase proximity and critic-guided distillation,
Impossible Distillation [91] produces a high-quality dataset to en-
hance the model’s capability of paraphrase generation and sentence 5.1 Synthesizing and Augmenting Method
summarization. HuaTuo[204] enhances the factual accuracy of its Despite the significance of synthesizing and augmenting data, there
responses by initially drawing knowledge instances from a knowl- are critical challenges of using different synthesizing and augment-
edge graph, followed by the creation of instances tailored to that ing method in practice.
specific knowledge using ChatGPT. TinyStories[53] is designed Dependence on LLMs. The ability of LLMs to perform few-shot
to capture the essence of natural language, thereby enabling the or zero-shot tasks is leveraeged to generate data, suggesting that
models trained upon it to produce fluent and consistent stories. these models need to be sufficiently large to possess a certain level
UltraChat is created using two distinct ChatGPT Turbo APIs to of reasoning or data generation capabilities [107, 232]. Otherwise,
craft informative and realistic multi-turn dialogues. One simulates data synthesis or augmentation with smaller models may not be
the user to produce queries while the other crafts the response[44]. sufficient to enhance teh specific capabilities of LLMs. Hence, the
Additionally, Baize[222] employs self-distillation techniques that methods of data synthesis and augmentation seriously rely on the
incorporate feedback, enabling the model to discern nuance within abilities of LLMs themselves.
the feedback and carry out fine-grained optimization. DIALOGIC[122] Complicating Evaluation and Depollution in Model Train-
efficiently scales a limited dialogue dataset with minimal to zero ing. The use of synthetic data in model training can significantly
human involvement and parameter adjustments, addressing the complicate fair evaluation. Evaluation benchmarks are often created
challenge of low-resource scenarios. DISC-MedLLM[10] leverages based on public text sources (such as course websites or forums),
ChatGPT to rephrase existing medical NLP datasets to provide ac- and the introduction of synthetic data can exacerbate this issue.
curate and truthful medical responses in end-to-end conversational While the community has proposed several techniques for detecting
healthcare services. In ReST[66], multiple steps of negative log- evaluation pollution, these token-based depollution methods may
likelihood training with progressively increasing filtering thresh- prove ineffective when synthetic data is involved in model training
olds in the improve step and the human evaluation scores in the [65]. It is recommended that model developers invest in creating
grow step lead to continuous improvements in the model’s per- and maintaining internal and protected evaluation benchmarks to
formance. Self-translate-train[170] enhances cross-lingual perfor- ensure the integrity of the evaluation process.
mance by effectively harnessing the model’s inherent translation Uncertainty and Search Complexity in RLAIF. Preference
capabilities. TinyDialogues[56] discovered that a diverse blend of alignment has progressed from human feedback to AI-generated
data sources outperforms homogeneous conversation data, indi- feedback. Despite these advancements, current methodologies re-
cating that high-quality synthetic conversation data yields better main largely dependent on passive feedback mechanisms, where
performance than natural conversation data. meaningful insights are only gathered through active evaluation
19
and querying of the model’s performance [106]. This poses a signif- to unreliable outputs [153]. If a dataset used for training contains
icant challenge, as the expansive search space demands a consider- biased samples, the synthesized data will likely perpetuate those
able number of test samples to acquire valuable feedback. Moreover, biases.
a critical concern with AI-generated feedback is the uncertainty Distribution Inconsistency. The inconsistency between the
regarding its true alignment with human preferences. To date, no distributions of synthetic and human-generated data is crucial for
definitive evidence has been presented to confirm that AI feedback LLMs. When there is a significant mismatch, models may strug-
can reliably function as an effective alignment signal, raising doubts gle to understand and generate responses that resonate with users
about its credibility and validity. [221]. First, biases in the original data and LLMs may generate
Unstable and Inconsistent Logical Paths. In applications that inappropriate data that do not reflect the natural distribution of lan-
depend significantly on logical reasoning, researchers strive to gen- guage usage, leading to distribution discrepancies. Second, the lack
erate synthetic reasoning paths [145, 186, 248, 253, 254] to enhance of context in synthetic and augmented data causes the generated
the logical capabilities of large language models (LLMs), often ver- phrases or sentences to be out of alignment with the pragmatics of
ifying only a subset of the steps or only the answers. However, human communication.
verifying the correctness and consistency of every step in these
reasoning paths remains challenging, particularly in knowledge-
intensive disciplines. Integrating knowledge bases or formal logical 5.3 Impact of Data Synthesis and Augmentation
reasoning tools into data generation systems may facilitate the In addition to the quality and methodological factors of the data
creation of more stable and consistent logical paths. itself, data synthesis and augmentation can also have a certain
impact on the external environment, involving individuals, groups,
societies, and human values.
5.2 Data Quality Privacy. Synthetic and augmented data may contain traces or
Data quality is paramount particularly when it comes to synthetic patterns derived from real individuals, thereby compromising user
and augmented data for LLMs. Unlike the diverse, credible, high- privacy. This challenge is exacerbated in applications where the
quality real data that already exists, the nature of data synthesis generated data mimics sensitive attributes, making it difficult to en-
and augmentation may influence the quality the generated data. sure complete anonymity [209, 228]. Moreover, to address concerns
Data Diversity. Many current methods rely on predefined tem- over ownership and authenticity, data watermarking techniques
plates or algorithms that may inadvertently produce homogeneous are being explored. These techniques allow for the embedding of
outputs. This homogeneity restricts the model’s exposure to diverse identifiable information within synthetic data, enabling traceability
linguistic patterns and contexts, ultimately hindering its ability to and accountability. However, the implementation of such systems
generalize to varied real-world inputs [15, 33]. In addition, synthetic must balance the need for privacy with the ability to verify data
and augmented data often fails to capture the intricate nuances of provenance, raising questions about the feasibility of effective wa-
real-world language use, such as idiomatic expressions, cultural ref- termarking without compromising data utility.
erences, and contextual variations. This shortcoming arises because Security. While synthetic and augmented data is often seen as
synthetic and augmented data may not incorporate the richness a solution to privacy concerns, its security is not guaranteed. The
and diversity found in authentic conversations and interactions. generation process can be exploited by malicious actors to craft
Consequently, models trained predominantly on these data may adversarial examples that deceive LLMs [257]. Moreover, these
struggle with the subtleties of language that are crucial for effective generated data can sometimes be reverse-engineered, revealing pat-
communication and comprehension [155]. terns that can lead to the reconstruction of sensitive real-world data.
Long-tail Phenomenon. The most significant improvements in Another potential risk is that synthetic and augmented data can
data generated by LLMs are observed in languages that are widely also serve as a vector for introducing vulnerabilities into models.
used. Conversely, there may be only slight advancements in con- Attackers may embed backdoors into the data generation process,
texts where languages are less frequently used. As a consequence, leading to models that behave unpredictably under specific condi-
the methods for data synthesis and augmentation might struggle to tions. Such vulnerabilities can be exploited for nefarious purposes,
effectively handle rare or innovative instructions [217, 267]. This raising significant security concerns [216].
causes the long-tail problem that the distribution of language usage Social Impacts. Data synthetic and augmentation introduce a
is highly skewed. Common phrases and contexts may dominate myriad of legal challenges. For instance, the authenticity of gen-
the synthetic and augmented data, while rare or specialized cases erated data can complicate issues of intellectual property rights
receive little to no representation. This can lead to significant per- and data ownership [159]. Furthermore, the misuse of data poses a
formance drops in scenarios involving infrequent language use, significant threat to the dissemination of misleading information.
such as technical jargon or niche topics. In particular, its application in mimicking real individuals, and ma-
Reliability. The reliability of synthetic data cannot be effectively nipulating public opinion, carries grave risks. An important fact
guaranteed [62, 140]. For example, despite conforming to linguistic is that these data can inadvertently perpetuate or amplify societal
rules, is fluent, and is aligned with human preferences, there are biases. If LLMs are trained on biased data, the outputs derived from
still many problems in the fact and timeliness of the synthetic LLMs may reflect and reinforce harmful stereotypes. Moreover,
content, and it is not possible to make a reliable evaluation of cultural norms vary significantly across societies; thus, synthetic
the synthesized content. More important, synthesized data can and augmented data that is appropriate in one context may be of-
reflect biases or inaccuracies from the original datasets, leading fensive or inappropriate in another. The socio-cultural implications
20
of data should be carefully considered to avoid entrenching existing synthesis of large-scale pre-training datasets is expected to enhance
inequalities [182]. the comprehensive capabilities of pre-trained models [93].
Robust Quality Evaluation Metrics. Data synthesis and aug-
5.4 Impact on Different Applications and Tasks mentation need to ensure the quality of generated data [32]. Current
methods lack standardized metrics that can adequately evaluate
Data synthesis and augmentation techniques play a crucial role in
the diversity and relevance of synthesized datasets. Future research
enhancing the performance of LLMs across diverse tasks. However,
should focus on developing robust metrics that assess not just the
the effectiveness of these methods can vary significantly depending
linguistic quality of generated text but also its contextual appropri-
on their implementation and the specific characteristics of the
ateness, diversity, and potential biases.
synthetic and augmented data.
Ethical Considerations and Responsible Data Synthesis
Generalization. While synthetic and augmented data can be
and Augmentation. Ethical considerations and responsible prac-
useful for training models in scenarios where labeled data is scarce,
tices in data synthesis and augmentation for LLM are critical to
its ability to generalize to unseen examples often falls short. This
ensuring the integrity and fairness of AI systems. As these meth-
limitation arises from the fact that the generated data may not fully
ods increasingly influence model training and outcomes, some key
capture the complexity and variability present in real-world data.
issues such as bias propagation, misinformation, and data misuse
As a result, models trained predominantly on these data may exhibit
become obvious. It is essential to establish ethical guidelines that
poor performance when applied to new, real-world situations, lead-
govern the generation and application of synthetic and augmented
ing to a phenomenon known as overfitting to the specific patterns
data, promoting transparency, accountability, and inclusivity [89].
of data. For instance, in the medical domain, data synthesized from
one healthcare system might not be applicable to another due to
differing protocols or patient demographics [230, 231]. 6 Conclusion
Transferability and Domain Adaptation. Synthetic and aug- Data synthesis and augmentation are essential for advancing LLMs,
mented data can be designed to mimic various domains. However, particularly in meeting the need for large-scale and high-quality
the transfer performance to different domains remains a challenge. data for LLMs. This survey provides a comprehensive review of
Data pattern from one domain may struggle to adapt to another due LLM-oriented data synthesis and augmentation techniques, system-
to discrepancies between the synthetic and real data distributions atically exploring their applications across the entire lifecycle and
[127, 206]. Therefore, when data is generated, it often embodies spe- core functions of LLMs, while building a framework that connects
cific characteristics of the source domain, which may not fully align existing research, highlights key methods, and clarifies strengths
with the target domain. This misalignment can hinder the model’s and limitations. We envision that advancements in LLM-oriented
ability to effectively transfer its knowledge. For instance, a model data synthesis and augmentation methods will unlock new possi-
trained on synthetic dialogues that simulate formal language may bilities to enhance data efficiency, improve generalization across
perform poorly when faced with informal conversational contexts tasks, and drive the evolution of data-centric AI. We hope this
in real-world applications. survey serves as a foundation for future research, inspiring innova-
tion and progress in the field of LLM-oriented data synthesis and
5.5 Future Directions augmentation.
Multi-Modal Synthesis. Multi-modal data synthesis technology
addresses the integration of diverse data types (e.g., text, images, Acknowledgement
audio, and sensor data) to create richer and more informative data This research was supported by “Pioneer” and “Leading Goose”
[87]. This helps to create diverse training samples that enhance R&D Program of Zhejiang (No. 2024C01020), National Key R&D
model robustness against over-fitting and variability. These diverse Program of China (No. 2023YFF0725600), National Science Founda-
samples effectively alleviate the data scarcity problem in many tion of China (No. 62406015).
specific scenarios like healthcare and autonomous systems.
Real-time Synthesis. Advances in real-time data synthesis will
allow for dynamic generations, enabling applications like virtual References
[1] Yelaman Abdullin, Diego Mollá Aliod, Bahadorreza Ofoghi, John Yearwood, and
assistants and interactive gaming environments to adapt seamlessly Qingyang Li. 2024. Synthetic Dialogue Dataset Generation using LLM Agents.
to user inputs [93]. This capability will enhance user engagement arXiv abs/2401.17461 (2024).
and create more personalized experiences across various platforms. [2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal
Domain Model Distillation. Existing investigations mainly Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
focus on synthesizing data using a general model, such as GPT. (2023).
However, domain-specific models have stronger capabilities within [3] Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego
Bermeo, Maria Korobeynikova, and Fabrizio Gilardi. 2023. Open-source large
their respective domains compared to general models. Therefore, language models outperform crowd workers and approach ChatGPT in text-
leveraging domain-specific models to synthesize domain data is annotation tasks. arXiv preprint arXiv:2307.02179 101 (2023).
[4] Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and
expected to further enhance the capabilities of LLMs within that Weizhu Chen. 2023. Learning from mistakes makes llm better reasoner. arXiv
domain [124, 146, 156]. preprint arXiv:2310.20689 (2023).
Large-Scale Synthesis. Pre-training a large model requires a [5] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom
Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021.
vast amount of high-quality data. Existing data synthesis or aug- A general language assistant as a laboratory for alignment. arXiv preprint
mentation methods have limited scalability in generating data. The arXiv:2112.00861 (2021).
21
[6] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, [28] Ethan Chern, Haoyang Zou, Xuefeng Li, Jiewen Hu, Kuhua Feng, junlong Li,
Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. and Pengfei Liu. 2023. Generative AI for Math: Abel. (2023).
2023. Llemma: An open language model for mathematics. arXiv preprint [29] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang,
arXiv:2310.10631 (2023). Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al.
[7] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt
DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6.
2022. Training a helpful and harmless assistant with reinforcement learning [30] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos,
from human feedback. arXiv preprint arXiv:2204.05862 (2022). Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E
[8] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by
Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, human preference. arXiv preprint arXiv:2403.04132 (2024).
et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint [31] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
arXiv:2212.08073 (2022). Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se-
[9] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways.
Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A Multitask, Journal of Machine Learning Research 24, 240 (2023), 1–113.
Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, [32] Vikram S Chundawat, Ayush K Tarun, Murari Mandal, Mukund Lahoti, and
and Interactivity. In Proceedings of the 13th International Joint Conference on Pratik Narang. 2022. A universal metric for robust evaluation of synthetic
Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of tabular data. IEEE Transactions on Artificial Intelligence 5, 1 (2022), 300–309.
the Association for Computational Linguistics (Volume 1: Long Papers). 675–718. [33] John Joon Young Chung, Ece Kamar, and Saleema Amershi. 2023. Increasing
[10] Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie diversity while maintaining accuracy: Text data generation with large language
Peng, Xuanjing Huang, and Zhongyu Wei. 2023. DISC-MedLLM: Bridging models and human interventions. arXiv preprint arXiv:2306.04140 (2023).
General Large Language Models and Real-World Medical Consultation. arXiv [34] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le.
abs/2308.14346 (2023). 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings
[11] Thales Bertaglia, Stefan Huber, Catalina Goanta, Gerasimos Spanakis, and Adri- of the IEEE/CVF conference on computer vision and pattern recognition. 113–123.
ana Iamnitchi. 2023. Closing the loop: Testing chatgpt to generate model [35] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong
explanations to improve human labelling of sponsored content on social media. Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language
In World Conference on Explainable Artificial Intelligence. Springer, 198–213. models with high-quality feedback. arXiv preprint arXiv:2310.01377 (2023).
[12] Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, and Muhammad [36] Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw:
Abdul-Mageed. 2024. FinTral: A Family of GPT-4 Level Multimodal Financial Open-source legal large language model with integrated external knowledge
Large Language Models. arXiv:2402.10986 [cs] bases. arXiv preprint arXiv:2306.16092 (2023).
[13] Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint [37] Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Zihao Wu, Lin
arXiv:2005.14165 (2020). Zhao, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, et al. 2023. Chataug:
[14] Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007
Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, 1, 2 (2023).
Tianqi Liu, et al. 2024. Human alignment of large language models through [38] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou
online preference optimisation. arXiv preprint arXiv:2403.08635 (2024). Wang, and Yaodong Yang. 2023. Safe rlhf: Safe reinforcement learning from
[15] Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, and human feedback. arXiv preprint arXiv:2310.12773 (2023).
Peter Brusilovsky. 2024. Effects of diversity incentives on sample diversity [39] Dhairya Dalal, Marco Valentino, André Freitas, and Paul Buitelaar. 2024. In-
and downstream model performance in LLM-based text augmentation. arXiv ference to the Best Explanation in Large Language Models. arXiv preprint
preprint arXiv:2401.06643 (2024). arXiv:2402.10767 (2024).
[16] Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. Scaling syn- [40] Newman Cheng Joshua Bradley Alex Chao Apurva Mody Steven Truitt
thetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094 Jonathan Larson Darren Edge, Ha Trinh. 2024. From Local to Global: A
(2024). Graph RAG Approach to Query-Focused Summarization. arXiv preprint
[17] Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for arXiv:2404.16130 (2024).
code generation. GitHub repository (2023). [41] Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for
[18] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language understanding. arXiv preprint arXiv:1810.04805 (2018).
language models in retrieval-augmented generation. In Proceedings of the AAAI [42] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on
Conference on Artificial Intelligence, Vol. 38. 17754–17762. image synthesis. Advances in neural information processing systems 34 (2021),
[19] Jiuhai Chen, Rifaa Qadri, Yuxin Wen, Neel Jain, John Kirchenbauer, Tianyi Zhou, 8780–8794.
and Tom Goldstein. 2024. GenQA: Generating Millions of Instructions from a [43] Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen
Handful of Prompts. arXiv preprint arXiv:2406.10323 (2024). Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, and Shafiq Joty. 2024. Data
[20] Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, augmentation using llms: Data perspectives, learning paradigms and challenges.
Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, arXiv preprint arXiv:2403.02990 (2024).
Haizhou Li, and Benyou Wang. 2023-11-16. HuatuoGPT-II, One-stage Training for [44] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan
Medical Adaption of LLMs. arXiv:2311.09774 [cs] http://arxiv.org/abs/2311.09774 Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by
[21] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233
Zhao, and Dahua Lin. 2023. Sharegpt4v: Improving large multi-modal models (2023).
with better captions. arXiv preprint arXiv:2311.12793 (2023). [45] Tanay Dixit, Bhargavi Paranjape, Hannaneh Hajishirzi, and Luke Zettlemoyer.
[22] Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, 2022. CORE: A retrieve-then-edit framework for counterfactual data generation.
Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023. Alpagasus: arXiv preprint arXiv:2210.04873 (2022).
Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023). [46] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu
[23] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv
De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, preprint arXiv:2301.00234 (2022).
Greg Brockman, et al. 2021. Evaluating large language models trained on code. [47] Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, and Olek-
arXiv preprint arXiv:2107.03374 (2021). sii Kuchaiev. 2023. Steerlm: Attribute conditioned sft as an (user-steerable)
[24] Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan alternative to rlhf. arXiv preprint arXiv:2310.05344 (2023).
Li. 2023. Llava-interactive: An all-in-one demo for image chat, segmentation, [48] Jacob Doughty, Zipiao Wan, Anishka Bompelli, Jubahed Qayum, Taozhi Wang,
generation and editing. arXiv preprint arXiv:2311.00571 (2023). Juran Zhang, Yujia Zheng, Aidan Doyle, Pragnya Sridhar, Arav Agarwal, Christo-
[25] Yirong Chen, Zhenyu Wang, Xiaofen Xing, Zhipei Xu, Kai Fang, Junhong Wang, pher Bogart, Eric Keylor, Can Kultur, Jaromir Savelka, and Majd Sakr. 2024. A
Sihang Li, Jieling Wu, Qi Liu, Xiangmin Xu, et al. 2023. Bianque: Balancing Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in
the questioning and suggestion ability of health llms with multi-turn health Programming Education. In Proceedings of the 26th Australasian Computing
conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023). Education Conference. Association for Computing Machinery, 114–123.
[26] Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. 2024. [49] Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan
Self-play fine-tuning converts weak language models to strong language models. Wang, Mingchen Cai, Ruihua Song, and Ji-Rong Wen. 2023. What makes for
arXiv preprint arXiv:2401.01335 (2024). good visual instructions? synthesizing complex visual reasoning instructions
[27] Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle for visual instruction tuning. arXiv preprint arXiv:2311.01487 (2023).
Richardson. 2022. DISCO: Distilling counterfactuals with large language models. [50] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad
arXiv preprint arXiv:2212.10534 (2022). Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan,
et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
22
[51] Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E Gonzalez, and [75] Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li,
Trevor Darrell. 2023. Diversify your vision datasets with automatic diffusion- and Mu Li. 2023. Mixgen: A new multi-modal data augmentation. In Proceedings
based augmentation. Advances in neural information processing systems 36 of the IEEE/CVF winter conference on applications of computer vision. 379–389.
(2023), 79024–79034. [76] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song
[52] Aryaz Eghbali and Michael Pradel. 2024. De-hallucinator: Iterative grounding Bai, and Xiaojuan Qi. 2022. Is synthetic data from generative models ready for
for llm-based code completion. arXiv preprint arXiv:2401.01701 (2024). image recognition? arXiv preprint arXiv:2210.07574 (2022).
[53] Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models [77] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart,
be and still speak coherent english? arXiv preprint arXiv:2305.07759 (2023). Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical
[54] Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).
Marco Pavone, Song Han, and Hongxu Yin. 2024. 𝑉 𝐼 𝐿𝐴2 : VILA Augmented [78] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion proba-
VILA. arXiv preprint arXiv:2407.17453 (2024). bilistic models. Advances in neural information processing systems 33 (2020),
[55] Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, 6840–6851.
Teruko Mitamura, and Eduard Hovy. 2021. A Survey of Data Augmentation [79] Jixiang Hong, Quan Tu, Changyu Chen, Xing Gao, Ji Zhang, and Rui Yan. 2023.
Approaches for NLP. In Findings of the Association for Computational Linguistics: Cyclealign: Iterative distillation from black-box llm to white-box models for
ACL-IJCNLP 2021. 968–988. better human alignment. arXiv preprint arXiv:2310.16271 (2023).
[56] Steven Y Feng, Noah D Goodman, and Michael C Frank. 2024. Is Child- [80] Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2022. Unnatural
Directed Speech Effective Training Data for Language Models? arXiv preprint instructions: Tuning language models with (almost) no human labor. arXiv
arXiv:2408.03617 (2024). preprint arXiv:2212.09689 (2022).
[57] Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, [81] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun
and consequences. Minds and Machines 30 (2020), 681–694. Yu, and Jiawei Han. 2022. Large language models can self-improve. arXiv
[58] Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, and Graham preprint arXiv:2210.11610 (2022).
Neubig. 2024. Better Synthetic Data by Retrieving and Transforming Existing [82] Quzhe Huang, Mingxu Tao, Zhenwei An, Chen Zhang, Cong Jiang, Zhibin Chen,
Datasets. arXiv preprint arXiv:2404.14361 (2024). Zirui Wu, and Yansong Feng. 2023. Lawyer LLaMA Technical Report. arXiv
[59] Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan abs/2305.15062 (2023).
Hu, Runxin Xu, Qingxiu Dong, Ce Zheng, et al. 2024. Towards a Unified View [83] Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li,
of Preference Learning for Large Language Models: A Survey. arXiv preprint Yuchi Ma, Qianxiang Wang, Zibin Zheng, and Yanlin Wang. 2024. Agents
arXiv:2409.02795 (2024). in Software Engineering: Survey, Landscape, and Vision. arXiv preprint
[60] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A arXiv:2409.09030 (2024).
Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in [84] Hiroshi Inoue. 2018. Data augmentation by pairing samples for images classifi-
language models. arXiv preprint arXiv:2009.11462 (2020). cation. arXiv preprint arXiv:1801.02929 (2018).
[61] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, [85] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen,
Quoc V Le, and Barret Zoph. 2021. Simple copy-paste is a strong data aug- Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2024. Beavertails: Towards
mentation method for instance segmentation. In Proceedings of the IEEE/CVF improved safety alignment of llm via a human-preference dataset. Advances in
conference on computer vision and pattern recognition. 2918–2928. Neural Information Processing Systems 36 (2024).
[62] Hamed Babaei Giglou, Jennifer D’Souza, and Sören Auer. 2024. LLMs4Synthesis: [86] Albert Q. Jiang, Wenda Li, and Mateja Jamnik. 2023. Multilingual Mathematical
Leveraging Large Language Models for Scientific Synthesis. arXiv preprint Autoformalization. arXiv abs/2311.03755 (2023).
arXiv:2409.18812 (2024). [87] Lan Jiang, Ye Mao, Xiangfeng Wang, Xi Chen, and Chao Li. 2023. Cola-diff: Con-
[63] Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms ditional latent diffusion model for multi-modal mri synthesis. In International
crowd workers for text-annotation tasks. Proceedings of the National Academy Conference on Medical Image Computing and Computer-Assisted Intervention.
of Sciences 120, 30 (2023), e2305016120. Springer, 398–408.
[64] Alan R Gillespie, Anne B Kahle, and Richard E Walker. 1987. Color enhancement [88] Ziyan Jiang. 2024. LongRAG: Enhancing Retrieval-Augmented Generation with
of highly correlated images. II. Channel ratio and “chromaticity” transformation Long-context LLMs. arXiv preprint arXiv:2406.15319 (2024).
techniques. Remote Sensing of Environment 22, 3 (1987), 343–365. [89] Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. 2024. Navigating
[65] Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data llm ethics: Advancements, challenges, and future directions. arXiv preprint
contamination in large language models. arXiv preprint arXiv:2308.08493 (2023). arXiv:2406.18841 (2024).
[66] Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, [90] Glenn Jocher. 2020. YOLOv5 by Ultralytics. https://doi.org/10.5281/zenodo.
Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, 3908559
Chenjie Gu, et al. 2023. Reinforced self-training (rest) for language modeling. [91] Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher,
arXiv preprint arXiv:2308.08998 (2023). Taylor Sorensen, and Yejin Choi. 2024. Impossible Distillation for Paraphras-
[67] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie ing and Summarization: How to Make High-quality Lemonade out of Small,
Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Low-quality Model. In Proceedings of the 2024 Conference of the North Ameri-
Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need. arXiv preprint can Chapter of the Association for Computational Linguistics: Human Language
arXiv:2306.11644 (2023). Technologies (Volume 1: Long Papers). 4439–4454.
[68] Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, [92] Agastya Kalra, Guy Stoppi, Bradley Brown, Rishav Agarwal, and Achuta
Jianwei Yue, and Yupeng Wu. 2023. How close is chatgpt to human experts? Kadambi. 2021. Towards rotation invariance in object detection. In Proceedings
comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597 of the IEEE/CVF International Conference on Computer Vision. 3530–3540.
(2023). [93] S Karunya, M Jalakandeshwaran, Thanuja Babu, and R Uma. 2023. AI-Powered
[69] Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Real-Time Speech-to-Speech Translation for Virtual Meetings Using Machine
Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. 2024. Learning Models. In 2023 Intelligent Computing and Control for Engineering and
Direct language model alignment from online ai feedback. arXiv preprint Business Systems (ICCEBS). IEEE, 1–6.
arXiv:2402.04792 (2024). [94] Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, and Manmohan
[70] Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. 2022. Generating Chandraker. 2023. Q: How to Specialize Large Vision-Language Models to Data-
Programming Puzzles to Train Language Models. In Deep Learning for Code Scarce VQA Tasks? A: Self-Train on Unlabeled Images!. In Proceedings of the
Workshop. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15005–15015.
[71] Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. 2022. Language [95] Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by
models can teach themselves to program better. arXiv preprint arXiv:2207.14502 words with paradigmatic relations. arXiv preprint arXiv:1805.06201 (2018).
(2022). [96] Abdullatif Köksal, Timo Schick, Anna Korhonen, and Hinrich Schütze. 2023.
[72] Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. 2023. Language Longform: Optimizing instruction tuning for long text generation with corpus
Models Can Teach Themselves to Program Better. In International Conference extraction. arXiv preprint arXiv:2304.08460 (2023).
on Learning Representations (ICLR). [97] Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui
[73] Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd
Hanwang Zhang. 2023. Chartllama: A multimodal llm for chart understanding Nagyfi, et al. 2024. Openassistant conversations-democratizing large language
and generation. arXiv preprint arXiv:2311.16483 (2023). model alignment. Advances in Neural Information Processing Systems 36 (2024).
[74] Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, [98] Mario Michael Krell and Su Kyoung Kim. 2017. Rotational data augmentation
Kun Yao, Errui Ding, and Jingdong Wang. 2024. FullAnno: A Data Engine for for electroencephalographic data. In 2017 39th Annual International Conference
Enhancing Image Comprehension of MLLMs. arXiv preprint arXiv:2409.13540 of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 471–474.
(2024). [99] Tiziano Labruna, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023.
Unraveling chatgpt: A critical analysis of ai-generated goal-oriented dialogues
23
and annotations. In International Conference of the Italian Association for Artificial [121] Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chun-
Intelligence. Springer, 151–171. hua Shen, Ling Chen, and Yunchao Wei. 2023. Stablellava: Enhanced vi-
[100] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya sual instruction tuning with synthesized image-dialogue data. arXiv preprint
Jia. 2024. Lisa: Reasoning segmentation via large language model. In Proceedings arXiv:2308.10253 (2023).
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9579– [122] Zekun Li, Wenhu Chen, Shiyang Li, Hong Wang, Jing Qian, and Xifeng Yan.
9589. 2022. Controllable dialogue simulation with in-context learning. arXiv preprint
[101] Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen arXiv:2210.04185 (2022).
Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. [123] Xiao Liang, Xinyu Hu, Simiao Zuo, Yeyun Gong, Qiang Lou, Yi Liu, Shao-Lun
2024. Rewardbench: Evaluating reward models for language modeling. arXiv Huang, and Jian Jiao. 2024. Task Oriented In-Domain Data Augmentation. arXiv
preprint arXiv:2403.13787 (2024). preprint arXiv:2406.16694 (2024).
[102] Ehsan Latif, Luyang Fang, Ping Ma, and Xiaoming Zhai. 2024. Knowledge [124] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker,
Distillation of LLMs for Automatic Scoring of Science Assessments. Springer Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s
Nature Switzerland. 166–174 pages. verify step by step. arXiv preprint arXiv:2305.20050 (2023).
[103] Siddique Latif, Muhammad Usama, Mohammad Ibrahim Malik, and Björn W [125] Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han,
Schuller. 2023. Can large language models aid in annotating speech emotional Hang Xu, Xiaojun Chang, and Xiaodan Liang. 2024. NavCoT: Boosting LLM-
data? uncovering new frontiers. arXiv preprint arXiv:2307.06090 (2023). Based Vision-and-Language Navigation via Learning Disentangled Reasoning.
[104] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven arXiv preprint arXiv:2403.07376 (2024).
Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained [126] Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang.
models and deep reinforcement learning. Advances in Neural Information Pro- 2024. CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. arXiv
cessing Systems 35 (2022), 21314–21328. preprint arXiv:2402.14809 (2024).
[105] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel [127] An Liu, Zonghan Yang, Zhenhe Zhang, Qingyuan Hu, Peng Li, Ming Yan, Ji
Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Zhang, Fei Huang, and Yang Liu. 2024. PANDA: Preference Adaptation for
Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language Enhancing Domain-Specific Abilities of LLMs. arXiv preprint arXiv:2402.12835
model. (2023). (2024).
[106] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, [128] Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li,
Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling Anh Nguyen, Rachel Ward, and Yi Zhang. 2023. Tinygsm: achieving> 80% on
reinforcement learning from human feedback with ai feedback. arXiv preprint gsm8k with small language models. arXiv preprint arXiv:2312.09241 (2023).
arXiv:2309.00267 (2023). [129] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved base-
[107] Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, lines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference
Sheng Shen, Gopala Anumanchipali, Michael W Mahoney, Kurt Keutzer, and on Computer Vision and Pattern Recognition. 26296–26306.
Amir Gholami. 2024. Llm2llm: Boosting llms with novel iterative data enhance- [130] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual
ment. arXiv preprint arXiv:2403.15042 (2024). instruction tuning. Advances in neural information processing systems 36 (2024).
[108] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk [131] Haoxiong Liu, Yifan Zhang, Yifan Luo, and Andrew C Yao. 2024. Augmenting
Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Math Word Problems via Iterative Question Composing. In ICLR 2024 Workshop
Gutman-Solo, et al. 2022. Solving quantitative reasoning problems with language on Navigating and Addressing Data Problems for Foundation Models.
models. Advances in Neural Information Processing Systems 35 (2022), 3843–3857. [132] June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023-
[109] Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, 09-27. ChatCounselor: A Large Language Models for Mental Health Support.
Zheng Zhang, and Houwen Peng. 2024. Common 7b language models already arXiv:2309.15461 [cs] http://arxiv.org/abs/2309.15461
possess strong math capabilities. arXiv preprint arXiv:2403.04706 (2024). [133] Linlin Liu, Xin Li, Ruidan He, Lidong Bing, Shafiq Joty, and Luo Si. 2021. En-
[110] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei hancing multilingual language model with massive multilingual knowledge
Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2024. Llava-med: triples. arXiv preprint arXiv:2111.10962 (2021).
Training a large language-and-vision assistant for biomedicine in one day. [134] Ning Liu, Siavash Jafarzadeh, Brian Y Lattimer, Shuna Ni, Jim Lua, and Yue Yu.
Advances in Neural Information Processing Systems 36 (2024). 2024. Large language models, physics-based modeling, experimental measure-
[111] Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, ments: the trinity of data-scarce learning of polymer properties. arXiv preprint
Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong arXiv:2407.02770 (2024).
Zhang, et al. 2024. Synthetic data (almost) from scratch: Generalized instruction [135] Pei Liu, Xuemin Wang, Chao Xiang, and Weiye Meng. 2020. A survey of text data
tuning for language models. arXiv preprint arXiv:2402.13064 (2024). augmentation. In 2020 International Conference on Computer Communication
[112] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping and Network Security (CCNS). IEEE, 191–195.
language-image pre-training for unified vision-language understanding and [136] Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire
generation. In International conference on machine learning. PMLR, 12888–12900. Cui, Denny Zhou, and Andrew M Dai. 2022. Mind’s eye: Grounded language
[113] Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2023. AceCoder: Utilizing model reasoning through simulation. arXiv preprint arXiv:2210.05359 (2022).
Existing Code to Enhance Code Generation. CoRR (2023). [137] Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao,
[114] Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Jiuxiang Gu, and Tianyi Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. 2024.
Zhou. 2024. Selective reflection-tuning: Student-selected data recycling for Best Practices and Lessons Learned on Synthetic Data for Language Models.
llm instruction-tuning. arXiv preprint arXiv:2402.10110 (2024). arXiv:2404.07503 [cs]
[115] Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Heng Huang, Jiuxiang Gu, and [138] Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan
Tianyi Zhou. 2023. Reflection-Tuning: Data Recycling Improves LLM Instruction- Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Llava-plus: Learning to use
Tuning. https://arxiv.org/abs/2310.11716 tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023).
[116] Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F Chen, Zhengyuan [139] Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach.
Liu, and Diyi Yang. 2023. Coannotating: Uncertainty-guided work allocation arXiv preprint arXiv:1907.11692 (2019).
between human and large language models for data annotation. arXiv preprint [140] Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and
arXiv:2310.15638 (2023). Haobo Wang. 2024. On LLMs-Driven Synthetic Data Generation, Curation, and
[117] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Evaluation: A Survey. In Findings of the Association for Computational Linguistics
Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny ACL 2024. 11065–11082.
Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint [141] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay,
arXiv:2305.06161 (2023). Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The flan collection:
[118] Sihang Li, Jian Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Designing data and methods for effective instruction tuning. In International
Xiang Wang, Linfeng Zhang, Guolin Ke, and Hengxing Cai. 2024. SciLitLLM: Conference on Machine Learning. PMLR, 22631–22648.
How to Adapt LLMs for Scientific Literature Understanding. arXiv preprint [142] Bo-Ru Lu, Nikita Haduong, Chia-Hsuan Lee, Zeqiu Wu, Hao Cheng, Paul Koester,
arXiv:2408.15545 (2024). Jean Utke, Tao Yu, Noah A Smith, and Mari Ostendorf. 2023. DIALGEN: collab-
[119] Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, orative human-lm generated dialogues for improved understanding of human-
Jason Weston, and Mike Lewis. 2023. Self-alignment with instruction backtrans- human conversations. arXiv preprint arXiv:2307.07047 (2023).
lation. arXiv preprint arXiv:2308.06259 (2023). [143] Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, and Fei Yuan. 2024. Llamax: Scaling
[120] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, linguistic horizons of llm by enhancing translation capabilities beyond 100
and Yin Tat Lee. 2023. Textbooks are all you need ii: phi-1.5 technical report. languages. arXiv preprint arXiv:2407.05975 (2024).
arXiv preprint arXiv:2309.05463 (2023). [144] Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan,
Mingjie Zhan, and Hongsheng Li. 2024. Mathgenie: Generating synthetic data
with question back-translation for enhancing mathematical reasoning of llms.
24
arXiv preprint arXiv:2402.16352 (2024). [167] Xiaoxue Ren, Xinyuan Ye, Dehai Zhao, Zhenchang Xing, and Xiaohu Yang. 2023.
[145] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang From Misuse to Mastery: Enhancing Code Generation with Knowledge-Driven
Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz- AI Chaining. In 2023 38th IEEE/ACM International Conference on Automated
ardmath: Empowering mathematical reasoning for large language models via Software Engineering (ASE). IEEE, 976–987.
reinforced evol-instruct. arXiv preprint arXiv:2308.09583 (2023). [168] Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao
[146] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, et al.
Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2024. WizardCoder: 2023. Pangu- {\Sigma } : Towards trillion parameter language model with sparse
Empowering Code Large Language Models with Evol-Instruct. In International heterogeneous computing. arXiv preprint arXiv:2303.10845 (2023).
Conference on Learning Representations (ICLR). [169] Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normal-
[147] Runyuan Ma, Wei Li, and Fukai Shang. 2024. Investigating Public Fine-Tuning izing flows. In International conference on machine learning. PMLR, 1530–1538.
Datasets: A Complex Review of Current Practices from a Construction Perspec- [170] Ryokan Ri, Shun Kiyono, and Sho Takase. 2024. Self-Translate-Train: A Simple
tive. arXiv preprint arXiv:2407.08475 (2024). but Strong Baseline for Cross-lingual Transfer of Large Language Models. arXiv
[148] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah preprint arXiv:2407.00454 (2024).
Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. [171] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
Information Processing Systems 36 (2024). Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[149] Kiran Maharana, Surajit Mondal, and Bhushankumar Nemade. 2022. A review: 10684–10695.
Data pre-processing and data augmentation techniques. Global Transitions [172] Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, David Atanasov,
Proceedings 3, 1 (2022), 91–99. Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, and Frank
[150] Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Rudzicz. 2024. Representation noising effectively prevents harmful fine-tuning
Jaitly. 2024. Rephrasing the web: A recipe for compute and data-efficient lan- on LLMs. arXiv preprint arXiv:2405.14577 (2024).
guage modeling. arXiv preprint arXiv:2401.16380 (2024). [173] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xi-
[151] Oscar Mañas, Benno Krojer, and Aishwarya Agrawal. 2024. Improving auto- aoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
matic vqa evaluation using large language models. In Proceedings of the AAAI Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-
Conference on Artificial Intelligence, Vol. 38. 4171–4179. Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal
[152] Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and
Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Gabriel Synnaeve. 2023. Code Llama: Open Foundation Models for Code. arXiv
Aggarwal, et al. 2023. Orca 2: Teaching small language models how to reason. abs/2308.12950 (2023).
arXiv preprint arXiv:2311.11045 (2023). [174] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao-
[153] Devam Mondal and Carlo Lipizzi. 2024. Mitigating Large Language Model Bias: qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy
Automated Dataset Augmentation and Prejudice Quantification. Computers 13, Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris-
6 (2024), 141. tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade
[154] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas
Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex Scialom, and Gabriel Synnaeve. 2024. Code Llama: Open Foundation Models
explanation traces of gpt-4. arXiv preprint arXiv:2306.02707 (2023). for Code. https://doi.org/10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]
[155] Manish Nagireddy, Bernat Guillén Pegueroles, and Ioana Baldini. 2024. DARE [175] Vinay Samuel, Houda Aynaou, Arijit Ghosh Chowdhury, Karthik Venkat Ra-
to Diversify: DAta Driven and Diverse LLM REd Teaming. In Proceedings of manan, and Aman Chadha. 2023. Can llms augment low-resource read-
the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ing comprehension datasets? opportunities and challenges. arXiv preprint
6420–6421. arXiv:2309.12426 (2023).
[156] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina [176] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019.
Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Analysing mathematical reasoning abilities of neural models. arXiv preprint
et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv:1904.01557 (2019).
arXiv preprint arXiv:2112.09332 (2021). [177] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli,
[157] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024.
Silvio Savarese, and Caiming Xiong. 2022. A conversational paradigm for Toolformer: Language models can teach themselves to use tools. Advances in
program synthesis. arXiv preprint arXiv:2203.13474 30 (2022). Neural Information Processing Systems 36 (2024).
[158] Övgü Özdemir and Erdem Akagündüz. 2024. Enhancing Visual Question An- [178] Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and
swering through Question-Driven Image Captions as Prompts. In Proceedings of William Yang Wang. 2024. Velma: Verbalization embodiment of llm agents
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1562–1571. for vision and language navigation in street view. In Proceedings of the AAAI
[159] Ajay Patel, Colin Raffel, and Chris Callison-Burch. 2024. DataDreamer: A Tool Conference on Artificial Intelligence, Vol. 38. 18924–18933.
for Synthetic Data Generation and Reproducible LLM Workflows. arXiv preprint [179] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Sum-
arXiv:2402.10379 (2024). marization with pointer-generator networks. arXiv preprint arXiv:1704.04368
[160] Arkil Patel, Siva Reddy, Dzmitry Bahdanau, and Pradeep Dasigi. 2023. Eval- (2017).
uating In-Context Learning of Libraries for Code Generation. arXiv preprint [180] Yunfan Shao, Linyang Li, Yichuan Ma, Peiji Li, Demin Song, Qinyuan Cheng,
arXiv:2311.09635 (2023). Shimin Li, Xiaonan Li, Pengyu Wang, Qipeng Guo, et al. 2024. Case2Code: Learn-
[161] Walter HL Pinaya, Mark S Graham, Eric Kerfoot, Petru-Daniel Tudosiu, Jessica ing Inductive Reasoning with Synthetic Data. arXiv preprint arXiv:2407.12504
Dafflon, Virginia Fernandez, Pedro Sanchez, Julia Wolleb, Pedro F Da Costa, (2024).
Ashay Patel, et al. 2023. Generative ai for medical imaging: extending the monai [181] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan
framework. arXiv preprint arXiv:2307.15208 (2023). Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits
[162] Han Qiu, Yi Zeng, Tianwei Zhang, Yong Jiang, and Meikang Qiu. 2020. Fence- of Mathematical Reasoning in Open Language Models. arXiv abs/2402.03300
box: A platform for defeating adversarial examples with data augmentation (2024).
techniques. arXiv preprint arXiv:2012.01701 (2020). [182] Hong Shen, Tianshi Li, Toby Jia-Jun Li, Joon Sung Park, and Diyi Yang. 2023.
[163] Jianing Qiu, Lin Li, Jiankai Sun, Jiachuan Peng, Peilun Shi, Ruiyang Zhang, Shaping the emerging norms of using large language models in social computing
Yinzhao Dong, Kyle Lam, Frank P-W Lo, Bo Xiao, et al. 2023. Large ai models research. In Companion Publication of the 2023 Conference on Computer Supported
in health informatics: Applications, challenges, and the future. IEEE Journal of Cooperative Work and Social Computing. 569–571.
Biomedical and Health Informatics (2023). [183] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan,
[164] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- and Shunyu Yao. 2023. Reflexion: language agents with verbal rein-
hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. forcement learning. In Advances in Neural Information Processing Sys-
2021. Learning transferable visual models from natural language supervision. tems 36: Annual Conference on Neural Information Processing Systems 2023,
In International conference on machine learning. PMLR, 8748–8763. NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice
[165] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and
Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language Sergey Levine (Eds.). http://papers.nips.cc/paper_files/paper/2023/hash/
model is secretly a reward model. Advances in Neural Information Processing 1b44b878bb782e6954cd888628510e90-Abstract-Conference.html
Systems 36 (2024). [184] Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data
[166] Steve Rathje, Dan-Mircea Mirea, Ilia Sucholutsky, Raja Marjieh, Claire E Robert- augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
son, and Jay J Van Bavel. 2024. GPT is an effective tool for multilingual psycho- [185] Connor Shorten, Taghi M Khoshgoftaar, and Borko Furht. 2021. Text data
logical text analysis. Proceedings of the National Academy of Sciences 121, 34 augmentation for deep learning. Journal of big Data 8, 1 (2021), 101.
(2024), e2308950121.
25
[186] Alexander G Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob R. Gardner, [208] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan
Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Os- Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency
bert Bastani, and Amir Yazdanbakhsh. 2024. Learning Performance-Improving Improves Chain of Thought Reasoning in Language Models. In The Eleventh
Code Edits. In The Twelfth International Conference on Learning Representations. International Conference on Learning Representations.
[187] Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, [209] Yuxin Wang, Duanyu Feng, Yongfu Dai, Zhengyu Chen, Jimin Huang, Sophia
Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, et al. 2023. Ananiadou, Qianqian Xie, and Hao Wang. 2024. HARMONIC: Harnessing
Beyond human data: Scaling self-training for problem-solving with language LLMs for Tabular Data Synthesis and Privacy Protection. arXiv preprint
models. arXiv preprint arXiv:2312.06585 (2023). arXiv:2408.02927 (2024).
[188] Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. [210] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel
Defining and characterizing reward gaming. Advances in Neural Information Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language
Processing Systems 35 (2022), 9460–9471. Models with Self-Generated Instructions. In The 61st Annual Meeting Of The
[189] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Association For Computational Linguistics.
Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to [211] Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert,
summarize with human feedback. Advances in Neural Information Processing Jimmy J Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. 2024. Help-
Systems 33 (2020), 3008–3021. Steer2: Open-source dataset for training top-performing reward models. arXiv
[190] Chang Su, Yu Hou, Manqi Zhou, Suraj Rajendran, Jacqueline R. M. A. Maasch, preprint arXiv:2406.08673 (2024).
Zehra Abedi, Haotan Zhang, Zilong Bai, Anthony Cuturrufo, Winston Guo, [212] Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar,
Fayzan F. Chaudhry, Gregory Ghahramani, Jian Tang, Feixiong Cheng, Yue Li, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope,
Rui Zhang, Steven T. DeKosky, Jiang Bian, and Fei Wang. 2023. Biomedical et al. 2023. Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv
discovery through the integrative biomedical knowledge hub (iBKH). iScience preprint arXiv:2311.09528 (2023).
26, 4 (2023). [213] Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay,
[191] Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, et al. 2023. Symbol tuning im-
Cox, and Akash Srivastava. 2024. Lab: Large-scale alignment for chatbots. arXiv proves in-context learning in language models. arXiv preprint arXiv:2305.08298
preprint arXiv:2403.01081 (2024). (2023).
[192] Shichao Sun, Junlong Li, Weizhe Yuan, Ruifeng Yuan, Wenjie Li, and Pengfei [214] Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. 2023. Simple
Liu. 2024. The Critique of Critique. arXiv:2401.04518 [cs.CL] https://arxiv.org/ synthetic data reduces sycophancy in large language models. arXiv preprint
abs/2401.04518 arXiv:2308.03958 (2023).
[193] Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, [215] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and LINGMING ZHANG.
David Daniel Cox, Yiming Yang, and Chuang Gan. 2024. SALMON: Self- 2024. Magicoder: Empowering Code Generation with OSS-Instruct. In Forty-
Alignment with Instructable Reward Models. In The Twelfth International Con- first International Conference on Machine Learning.
ference on Learning Representations. [216] Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick McDaniel, and Chaowei Xiao.
[194] Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. 2019. Data augmenta- 2024. A new era in llm security: Exploring security concerns in real-world
tion using random image cropping and patching for deep CNNs. IEEE Transac- llm-based systems. arXiv preprint arXiv:2402.18649 (2024).
tions on Circuits and Systems for Video Technology 30, 9 (2019), 2917–2931. [217] Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng
[195] Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. 2023. Does synthetic Hou, and Julian McAuley. 2024. CoRAL: Collaborative Retrieval-Augmented
data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 Large Language Models Improve Long-tail Recommendation. In Proceedings
(2023). of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[196] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos 3391–3401.
Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An [218] Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and
instruction-following llama model. https://github.com/tatsu-lab/stanford_ Alham Fikri Aji. 2023. Lamini-lm: A diverse herd of distilled models from
alpaca large-scale instructions. arXiv preprint arXiv:2304.14402 (2023).
[197] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony [219] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023. Next-gpt:
Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519 (2023).
2022. Galactica: A Large Language Model for Science. arXiv abs/2211.09085 [220] Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong
(2022). Ruan, Wenda Li, and Xiaodan Liang. 2024. DeepSeek-Prover: Advancing Theo-
[198] Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in rem Proving in LLMs through Large-Scale Synthetic Data. arXiv abs/2405.14333
annotating political twitter messages with zero-shot learning. arXiv preprint (2024).
arXiv:2304.06588 (2023). [221] Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Examining
[199] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne inter-consistency of large language models collaboration: An in-depth analysis
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal via debate. arXiv preprint arXiv:2305.11595 (2023).
Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv [222] Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-
preprint arXiv:2302.13971 (2023). source chat model with parameter-efficient tuning on self-chat data. arXiv
[200] Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. 2024. Code Less, Align More: preprint arXiv:2304.01196 (2023).
Efficient LLM Fine-tuning for Code Generation with Data Pruning. arXiv [223] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng,
preprint arXiv:2407.05040 (2024). Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large lan-
[201] Endel Tulving. 1985. Memory and consciousness. Canadian Psychol- guage models to follow complex instructions. arXiv preprint arXiv:2304.12244
ogy/Psychologie canadienne 26, 1 (1985), 1. (2023).
[202] Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. 2017. Tl; dr: [224] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng,
Mining reddit to learn automatic summarization. In Proceedings of the Workshop Chongyang Tao, Qingwei Lin, and Daxin Jiang. 2024. WizardLM: Empow-
on New Frontiers in Summarization. 59–63. ering Large Pre-Trained Language Models to Follow Complex Instructions. In
[203] Ben Wang. 2024-09-01T09:35:33Z. Kingoflolz/Mesh-Transformer-Jax. https: International Conference on Learning Representations (ICLR).
//github.com/kingoflolz/mesh-transformer-jax [225] Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan.
[204] Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and 2021. Bot-adversarial dialogue for safe conversational agents. In Proceedings
Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge. of the 2021 Conference of the North American Chapter of the Association for
arXiv preprint arXiv:2304.06975 (2023). Computational Linguistics: Human Language Technologies. 2950–2968.
[205] Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. [226] Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, May Dongmei
2024. T-sciq: Teaching multimodal chain-of-thought reasoning via large lan- Wang, Wei Jin, Joyce Ho, and Carl Yang. 2024. Knowledge-Infused Prompting:
guage model signals for science question answering. In Proceedings of the AAAI Assessing and Advancing Clinical Text Data Generation with Large Language
Conference on Artificial Intelligence, Vol. 38. 19162–19170. Models. In Conference on Semantics in Text Processing (STEP). 15496–15523.
[206] Rui Wang, Fei Mi, Yi Chen, Boyang Xue, Hongru Wang, Qi Zhu, Kam-Fai Wong, [227] Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Pooven-
and Ruifeng Xu. 2024. Role Prompting Guided Domain Adaptation with General dran, Yejin Choi, and Bill Yuchen Lin. 2024. Magpie: Alignment Data Syn-
Capability Preserve for Large Language Models. arXiv preprint arXiv:2403.02756 thesis from Scratch by Prompting Aligned LLMs with Nothing. arXiv preprint
(2024). arXiv:2406.08464 (2024).
[207] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, [228] Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and
Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. 2024. Visionllm: Large language Xiuzheng Cheng. 2024. On protecting the data privacy of large language models
model is also an open-ended decoder for vision-centric tasks. Advances in Neural (llms): A survey. arXiv preprint arXiv:2403.05156 (2024).
Information Processing Systems 36 (2024). [229] Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. Fingpt: Open-
source financial large language models. arXiv preprint arXiv:2306.06031 (2023).
26
[230] Haoran Yang, Hongyuan Lu, Wai Lam, and Deng Cai. 2024. Exploring Compo- [252] Biao Zhang, Barry Haddow, and Alexandra Birch. 2023. Prompting large lan-
sitional Generalization of Large Language Models. In Proceedings of the 2024 guage model for machine translation: A case study. In International Conference
Conference of the North American Chapter of the Association for Computational on Machine Learning. PMLR, 41092–41110.
Linguistics: Human Language Technologies (Volume 4: Student Research Work- [253] Dan Zhang, Ziniu Hu, Sining Zhoubian, Zhengxiao Du, Kaiyu Yang, Zihan
shop). 16–24. Wang, Yisong Yue, Yuxiao Dong, and Jie Tang. 2024. SciGLM: Training Scientific
[231] Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng Ann Heng, and Language Models with Self-Reflective Instruction Annotation and Tuning. https:
Wai Lam. 2024. Unveiling the generalization power of fine-tuned large language //arxiv.org/abs/2401.07950
models. arXiv preprint arXiv:2403.09162 (2024). [254] Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li,
[232] Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, and Chang Zhou. Weiran Huang, Xiangyu Yue, Dongzhan Zhou, Shufei Zhang, Mao Su, Hansen
2024. Synthesizing text-to-sql data from weak and strong llms. arXiv preprint Zhong, Yuqiang Li, and Wanli Ouyang. 2024. ChemLLM: A Chemical Large
arXiv:2408.03256 (2024). Language Model. arXiv abs/2402.06852 (2024).
[233] Suorong Yang, Weikang Xiao, Mengchen Zhang, Suhan Guo, Jian Zhao, and [255] Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi
Furao Shen. 2022. Image data augmentation for deep learning: A survey. arXiv Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository-level
preprint arXiv:2204.08610 (2022). code completion through iterative retrieval and generation. arXiv preprint
[234] Wen Yang, Chong Li, Jiajun Zhang, and Chengqing Zong. 2023. Bigtranslate: arXiv:2303.12570 (2023).
Augmenting large language models with multilingual translation capability [256] Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen,
over 100 languages. arXiv preprint arXiv:2305.18098 (2023). Jianquan Li, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou
[235] Hai Ye and Hwee Tou Ng. 2024. Self-Judge: Selective Instruction Following with Wang, and Haizhou Li. 2023. HuatuoGPT, Towards Taming Language Model to
Alignment Self-Evaluation. arXiv preprint arXiv:2409.00935 (2024). Be a Doctor. In Conference on Empirical Methods in Natural Language Processing
[236] Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, (EMNLP). 10859–10885.
Eyal Shnarch, and Leshem Choshen. 2024. Genie: Achieving human parity in [257] Jie Zhang, Haoyu Bu, Hui Wen, Yu Chen, Lun Li, and Hongsong Zhu. 2024.
content-grounded datasets generation. arXiv preprint arXiv:2401.14367 (2024). When llms meet cybersecurity: A systematic literature review. arXiv preprint
[237] Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyeong arXiv:2405.03644 (2024).
Park. 2021. Gpt3mix: Leveraging large-scale language models for text augmen- [258] Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan
tation. arXiv preprint arXiv:2104.08826 (2021). Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, et al. 2024. Ultramedical:
[238] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, Building specialized generalists in biomedicine. arXiv preprint arXiv:2406.03949
James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2024. MetaMath: (2024).
Bootstrap Your Own Mathematical Questions for Large Language Models. In [259] Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. 2023. Tool-
International Conference on Learning Representations (ICLR). coder: Teach code generation models to use api search tools. arXiv preprint
[239] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng arXiv:2305.04032 (2023).
Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. 2021. Florence: [260] Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum,
A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 and Chuang Gan. 2023. Planning with large language models for code generation.
(2021). arXiv preprint arXiv:2303.05510 (2023).
[240] Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, [261] Sheng Zhang, Xin Zhang, Hui Wang, Lixiang Guo, and Shanshan Liu. 2018. Multi-
Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. 2024. Advancing llm Scale Attentive Interaction Networks for Chinese Medical Question Answer
reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078 Selection. IEEE Access 6 (2018), 74061–74071. https://ieeexplore.ieee.org/
(2024). document/8548603/
[241] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, [262] Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and
Jing Xu, and Jason Weston. 2024. Self-rewarding language models. arXiv preprint Tatsunori B Hashimoto. 2024. Benchmarking large language models for news
arXiv:2401.10020 (2024). summarization. Transactions of the Association for Computational Linguistics 12
[242] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and (2024), 39–57.
Chang Zhou. 2023. Scaling relationship on learning mathematical reasoning [263] Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi
with large language models. arXiv preprint arXiv:2308.01825 (2023). Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, et al. 2024. Multimodal
[243] Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using
Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei. Language Model. arXiv preprint arXiv:2407.07053 (2024).
2023. DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal [264] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Pan, and Lidong Bing. 2024. Senti-
Services. arXiv abs/2309.11325 (2023). ment Analysis in the Era of Large Language Models: A Reality Check. In Findings
[244] Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, of the Association for Computational Linguistics: NAACL 2024. 3881–3906.
and Wenhu Chen. 2023. Mammoth: Building math generalist models through [265] Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao
hybrid instruction tuning. arXiv preprint arXiv:2309.05653 (2023). Huang. 2024. Expel: Llm agents are experiential learners. In Proceedings of the
[245] Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu AAAI Conference on Artificial Intelligence, Vol. 38. 19632–19642.
Su, and Wenhu Chen. 2024. MAmmoTH: Building Math Generalist Models [266] Henry Hengyuan Zhao, Pan Zhou, and Mike Zheng Shou. 2023. Genixer:
through Hybrid Instruction Tuning. In International Conference on Learning Empowering Multimodal Large Language Models as a Powerful Data Generator.
Representations (ICLR). arXiv preprint arXiv:2312.06731 (2023).
[246] Zihan Liu Boxin Wang Jiaxuan You Chao Zhang MohammadShoeybi [267] Qihao Zhao, Yalun Dai, Hao Li, Wei Hu, Fan Zhang, and Jun Liu. 2024. LTGC:
Bryan Catanzaro Yue Yu, Wei Ping. 2024. RankRAG: Unifying Context Ranking Long-tail Recognition via Leveraging LLMs-driven Generated Content. In Pro-
with Retrieval-Augmented Generation in LLMs. arXiv preprint arXiv:2407.02485 ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
(2024). 19510–19520.
[247] Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and [268] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,
Noah D Goodman. 2024. Quiet-star: Language models can teach themselves to Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey
think before speaking. arXiv preprint arXiv:2403.09629 (2024). of large language models. arXiv preprint arXiv:2303.18223 (2023).
[248] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. STaR: Boot- [269] Z Zheng, K Ning, J Chen, Y Wang, W Chen, L Guo, and W Wang. [n. d.]. Towards
strapping Reasoning With Reasoning. Advances in Neural Information Processing an understanding of large language models in software engineering tasks (2023).
Systems 35 (2022), 15476–15488. arXiv preprint arXiv:2308.11396 ([n. d.]).
[249] Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi [270] zhihaiLLM. 2024-07-31T18:18:47Z. zhihaiLLM/wisdomInterrogatory. https:
Zhang, Meng Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, Hongchao Fang, //github.com/zhihaiLLM/wisdomInterrogatory
Penghui Zhu, Shu Chen, and Pengtao Xie. 2020-11. MedDialog: Large-scale [271] Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, and Yuan Wu. 2024. A survey on
Medical Dialogue Datasets. In EMNLP 2020 (Online), Bonnie Webber, Trevor data augmentation in large model era. arXiv preprint arXiv:2401.15422 (2024).
Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, [272] Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiaowen Yang, Yi-Xuan Jin, Lan-Zhe
9241–9250. https://aclanthology.org/2020.emnlp-main.743 Guo, and Yu-Feng Li. 2024. LawGPT: A Chinese Legal Knowledge-Enhanced
[250] Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, and Weizhu Chen. 2024. Large Language Model. arXiv abs/2406.04614 (2024).
Automatic Instruction Evolving for Large Language Models. arXiv preprint [273] Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. 2023.
arXiv:2406.00770 (2024). Starling-7b: Improving llm helpfulness & harmlessness with rlaif.
[251] Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, [274] Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin
Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. 2024. Anygpt: Unified Deng, Huajun Chen, and Ningyu Zhang. 2024. Llms for knowledge graph
multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226 construction and reasoning: Recent capabilities and future opportunities. World
(2024). Wide Web 27, 5 (2024), 58.
27
[275] Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, and Gareth Tyson. 2023. [276] Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1,
Can chatgpt reproduce human-generated labels? a study of social computing knowledge storage and extraction. arXiv preprint arXiv:2309.14316 (2023).
tasks. arXiv preprint arXiv:2304.10145 (2023).
28