Knowledge Distillation of LLM
Knowledge Distillation of LLM
Abstract—In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring
advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral.
Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-
improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD’s role within the realm of
LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-
improvement. Our survey is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a
comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across
diverse fields. Crucially, the survey navigates the interaction between data augmentation (DA) and KD, illustrating how DA emerges
as a powerful paradigm within the KD framework to bolster LLMs’ performance. By leveraging DA to generate context-rich, skill-
specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness,
ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful
guide for researchers and practitioners, offering a detailed overview of current methodologies in knowledge distillation and proposing
future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the potential
for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal terms
that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at
https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.
Index Terms—Large language models, knowledge distillation, data augmentation, skill distillation, supervised fine-tuning
y2≻ y3≻ y1
rank
label expand synthesize extract y,1 y,2y3
Y X,Y X,Y feature output output
Fig. 2: An overview of this survey on knowledge distillation of large language models. Note that ‘Section’ is abbreviated
as ‘Sec.’ in this figure. RMS (·) denotes the student reward model. ⃝
1⃝2⃝
3⃝4 denote the steps in KD of LLMs.
of applications and users. A survey in this field is vital illustrating the practical implications and transformative
for synthesizing the current methodologies, challenges, and impact of these approaches. The survey suggests open
breakthroughs in knowledge distillation. It may serve as a problems in §6, identifying current challenges and gaps in
beacon for researchers and practitioners alike, guiding them knowledge distillation research that offer opportunities for
to distill complex AI capabilities into more manageable and future work. Finally, the conclusion and discussion in §7
accessible forms. Moreover, such a survey can illuminate the synthesize the insights gained, reflecting on the implica-
path forward, identifying gaps in current techniques and tions for the broader AI and NLP research community and
proposing directions for future research. proposing directions for future research. Figure 2 shows an
overview of this survey.
Survey Organization. The remainder of this survey is orga-
nized into several comprehensive sections, each designed to 2 OVERVIEW
offer a deep dive into the multifaceted aspects of knowledge
distillation within the realm ofLLMs. Following this intro- 2.1 Comparing Traditional Recipe
duction, §2 provides a foundational overview of knowledge The concept of knowledge distillation in the field of AI
distillation, comparing traditional techniques with those and deep learning (DL) refers to the process of transferring
emerging in the era of LLMs and highlighting the role of knowledge from a large, complex model (teacher) to a
data augmentation (DA) in this context. §3 delves into the smaller, more efficient model (student) (Gou et al., 2021).
approaches to elicit knowledge from teacher LLMs and core This technique is pivotal in mitigating the challenges posed
distillation algorithms, examining methods from supervised by the computational demands and resource constraints of
fine-tuning to more complex strategies involving divergence deploying large-scale models in practical applications.
and similarity, reinforcement learning, and ranking opti- Historically, knowledge distillation techniques, prior to
mization. Then, §4 focuses on skill distillation, exploring the era of LLMs, primarily concentrated on transferring
how student models can be enhanced to improve context knowledge from complex, often cumbersome neural net-
understanding, alignment with user intentions, and perfor- works to more compact and efficient architectures (Sanh
mance across a variety of NLP tasks. This includes discus- et al., 2019; Kim and Rush, 2016). This process was largely
sions on natural language understanding (NLU), genera- driven by the need to deploy machine learning models in
tion (NLG), information retrieval, recommendation systems, resource-constrained environments, such as mobile devices
and the evaluation of text generation. In §5, we venture or edge computing platforms, where the computational
into domain-specific vertical distillation, showcasing how power and memory are limited. The focus was predomi-
knowledge distillation techniques are applied within spe- nantly on ad-hoc neural architecture selection and training
cialized fields such as law, healthcare, finance, and science, objectives tailored for single tasks. These earlier methods
4
AnnoLLM (He et al., 2023a), PandaLM (Wang et al., 2023b), CoT-Distill (Hsieh et al., 2023)
Labeling Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), Baize (Xu et al., 2023b),
Mammoth (Yue et al., 2023a), Mixed Distill (Chenglin et al., 2023)
Self-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023)
Expansion Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a),
WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b)
UltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a),
Curation Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), WaveCoder (Yu et al., 2024)
ZeroGen (Ye et al., 2022), SunGen (Gao et al., 2023a), InPars (Bonifacio et al., 2022)
Knowledge
BabyLlama (Timiryasov and Tastet, 2023), MiniLLM (Gu et al., 2024),
Feature
GKD (Agarwal et al., 2024), QuantGPT (Tao et al., 2022a), LLM-QAT (Liu et al., 2023a),
CAI (Bai et al., 2022a), WizardMath (Luo et al., 2023b), UltraFeedback (Cui et al., 2023a),
Feedback Zephyr (Tunstall et al., 2023), CycleAlign (Hong et al., 2023), RLAIF (Lee et al., 2023a),
Lion (Jiang et al., 2023b), PERsD (Chen et al., 2023a), GKD (Agarwal et al., 2024)
KD Algorithms Self-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024),
Self-Knowledge ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023),
Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022)
Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a),
Supervised Fine-Tuning
Self-Instruct (Wang et al., 2022a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022),
DistilGPT (Sanh et al., 2019), f-Distill (Wen et al., 2023), MiniLLM (Gu et al., 2024)
Divergence and Similarity
TED (Liang et al., 2023a), GKD (Agarwal et al., 2024),BabyLlama(Timiryasov and Tastet, 2023)
Distillation
CAI (Bai et al., 2022a), UltraFeedback (Cui et al., 2023a), WizardMath (Luo et al., 2023b),
Reinforcement Learning
MiniLLM (Gu et al., 2024), GKD (Agarwal et al., 2024), GPT3 Reward (Kwon et al., 2023)
Rank Optimization Zephyr (Tunstall et al., 2023), CycleAlign (Hong et al., 2023),
Self-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023),
Instruction Following WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023),
WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a),
Context Following Vicuna (Chiang et al., 2023), Baize (Xu et al., 2023b), UltraLLaMA (Ding et al., 2023b),
Multi-turn Dialogue
CAMEL (Li et al., 2023b), OpenChat (Wang et al., 2023c), Zephyr (Tunstall et al., 2023),
RAG Capbility KARD (Kang et al., 2023a), SAIL (Luo et al., 2023c), Self-RAG (Asai et al., 2023),
Selfee (Ye et al., 2023), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023),
Thinking Pattern
AFT (Wang et al., 2023d), AdaptLLM (Cheng et al., 2023), KnowPAT (Zhang et al., 2023a),
Knowledge Distillation of LLMs
CAI (Bai et al., 2022a), GPT-3 Reward (Kwon et al., 2023), ILF (Scheurer et al., 2023),
Alignment Preference ALMoST (Kim et al., 2023a), RLEF (Roit et al., 2023), RLAIF (Lee et al., 2023a),
Zephy (Tunstall et al., 2023), UltraFeedback (Cui et al., 2023a),
CAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b),
Value
Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024)
Toolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023),
Tool Using ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a),
Confucius (Gao et al., 2023b), MLLM-Tool (Wang et al., 2024), α-UMi (Shen et al., 2024),
Agent
FireAct (Chen et al., 2023b), AgentTuning (Zeng et al., 2023a), Lumos (Yin et al., 2023a),
Planning
AUTOACT (Qiao et al., 2024), TPTU-v2 (Kong et al., 2023),
Skill
Distillation AugGPT (Dai et al., 2023a), GPT Annotation (Gilardi et al., 2023), (Ding et al., 2023a),
NLU TDG (He et al., 2023b), SunGen (Gao et al., 2023a), Mix Distill (Chenglin et al., 2023),
Annollm (He et al., 2023a), UDG (Wang et al., 2021a), ZeroGen (Ye et al., 2022),
InheritSumm (Xu et al., 2023c), RECOMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023),
NLG ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a),
ChatGPT NMT (Yang and Nicolai, 2023),
QUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022),
NLP Task Information Retrieval AugTriever (Meng et al., 2023), (Sun et al., 2023a), RankVicuna (Pradeep et al., 2023a),
Specialization RankZephyr (Pradeep et al., 2023b), ExaRanker (Ferraretto et al., 2023),
Recommendation NDR (Mysore et al., 2023), InstrcutRec (Zhang et al., 2023b), ONCE (Liu et al., 2023c),
PandaLM (Wang et al., 2023b), Prometheus (Kim et al., 2024), InstructScore (Xu et al., 2023d),
Text Generation Evaluation
TigerScore (Jiang et al., 2023c), Auto-J (Li et al., 2024a),
CodeAlpaca (Chaudhary, 2023), CodeLlama (Rozière et al., 2023), Magicoder (Wei et al., 2023)
Code Phi-1 (Gunasekar et al., 2023), PERsD (Chen et al., 2023a), MFTCoder (Liu et al., 2023d),
WaveCoder (Yu et al., 2024), Code Clean (Jain et al., 2023),
LLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c),
Multi-Modality LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b),
Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e),
Verticalization Law (Huang et al., 2023b; Cui et al., 2023b); Medical & Healthcare (Zhang et al., 2023c; Chen et al., 2023d); Finance (Zhang and Yang, 2023);
Distillation Science (Xie et al., 2023a; Zhang et al., 2024) and Misc. (Dan et al., 2023; Guo et al., 2023b)
Fig. 3: Taxonomy of Knowledge Distillation of Large Language Models. The detailed taxonomy of Verticalization
Distillation is shown in Figure 7.
5
involved training a smaller student network to mimic the The relationship between DA and KD in LLMs is both
output of a larger teacher network, often through techniques symbiotic and foundational. By leveraging a set of seed
like soft target training, where the student learns from knowledge, KD employs DA to prompt LLMs to produce
the softened softmax output of the teacher. Please refer to explicit data that encapsulates specific skills or domain
the survey (Gou et al., 2021) for more details on general expertise (Chaudhary, 2023; West et al., 2022). This method
knowledge distillation techniques in AI and DL. stands out as a potent mechanism for bridging the knowl-
In contrast, the advent of LLMs has revolutionized edge and capability gap between proprietary and open-
the knowledge distillation landscape. The current era of source models. Through DA, LLMs are prompted to create
knowledge distillation in LLMs shifts the focus from mere targeted, high-quality datasets that are not merely larger in
architecture compression to knowledge elicitation and trans- volume but are also rich in diversity and specificity. This
fer (Taori et al., 2023; Chaudhary, 2023; Tunstall et al., 2023). approach enables the distillation process to be more effec-
This paradigm change is largely due to the expansive and tive, ensuring that the distilled models not only replicate
deep-seated knowledge that LLMs like GPT-4 and Gemini the teacher model’s output behavior but also embody its
possess. And the inaccessible parameters of LLMs make it deep-seated understanding and cognitive strategies.
hard to compress them by using pruning (Han et al., 2016) or DA acts as a force multiplier, enabling the distilled mod-
quantization (Liu et al., 2023a) techniques. Unlike the earlier els to acquire and refine capabilities that would otherwise
era, where the goal was to replicate the output behavior of require exponentially larger datasets and computational re-
the teacher model or reduce the model size, the current focus sources. It facilitates a more effective transfer of knowledge,
in LLM-based knowledge distillation is to elicit the specific focusing on the qualitative aspects of learning rather than
knowledge these models have. quantitative expansion. This strategic use of DA within
The key to this modern approach lies in heuristic and KD processes underscores a pivotal shift towards a more
carefully designed prompts, which are used to elicit specific efficient, sustainable, and accessible approach to harnessing
knowledge (Ding et al., 2023b) or capabilities (Chaudhary, the power of LLMs. It empowers open-source models with
2023) from the LLMs. These prompts are crafted to tap the ability to approximate the contextual adeptness, ethical
into the LLM’s understanding and capabilities in various alignment, and deep semantic insights characteristic of their
domains, ranging from natural language understanding (He proprietary counterparts, thereby democratizing access to
et al., 2023a) to more complex cognitive tasks like reason- advanced AI capabilities and fostering innovation across a
ing (Hsieh et al., 2023) and problem-solving (Qiao et al., broader spectrum of applications and users.
2024). The use of prompts as a means of knowledge elici-
tation offers a more flexible and dynamic approach to dis- 2.3 Survey Scope
tillation. It allows for a more targeted extraction of knowl-
Building on the discussions introduced earlier, this survey
edge, focusing on specific skills or domains of interest. This
aims to comprehensively explore the landscape of knowl-
method is particularly effective in harnessing the emergent
edge distillation within the context of LLMs, following
abilities of LLMs, where the models exhibit capabilities
a meticulously structured taxonomy as in Figure 3. The
beyond their explicit training objectives.
Furthermore, this era of knowledge distillation also em- survey’s scope is delineated through three primary facets:
phasizes the transfer of more abstract qualities such as KD Algorithms, Skill Distillation, and Verticalization Dis-
reasoning patterns (Mitra et al., 2023), preference align- tillation. Each facet encapsulates a range of subtopics and
ment (Cui et al., 2023a), and value alignment (Sun et al., methodologies. It’s important to note that KD algorithms
2024b). This is in stark contrast to the earlier focus on output provide the technical foundations for skill distillation and
replication (Taori et al., 2023), indicating a shift towards verticalization distillation.
a more holistic and comprehensive transfer of cognitive KD Algorithms. This segment focuses on the technical
capabilities. The current techniques involve not just the foundations and methodologies of knowledge distillation. It
replication of outputs, but also the emulation of the thought includes an in-depth exploration of the processes involved
processes (Mitra et al., 2023) and decision-making (Asai in constructing knowledge from teacher models (e.g., pro-
et al., 2023) patterns of the teacher model. This involves prietary LLMs) and integrating this knowledge into student
complex strategies like chain-of-thought prompting, where models (e.g., open-source LLMs). Under the umbrella of
the student model is trained to learn the reasoning process ‘knowledge’, we delve into strategies such as labeling (Hsieh
of the teacher, thereby enhancing its problem-solving and et al., 2023), expansion (Taori et al., 2023), curation (Gu-
decision-making capabilities. nasekar et al., 2023), feature understanding (Agarwal et al.,
2024), feedback mechanisms (Tunstall et al., 2023), and self-
2.2 Relation to Data Augmentation (DA) knowledge generation (Wang et al., 2022a). This exploration
In the era of LLMs, Data Augmentation (DA) (Wang et al., seeks to uncover the various ways in which knowledge
2022a; Ye et al., 2022) emerges as a critical paradigm integral can be identified, expanded, and curated for effective dis-
to the process of knowledge distillation. Unlike traditional tillation. The ‘distillation’ subsection examines learning ap-
DA techniques such as paraphrasing (Gangal et al., 2022) or proaches like supervised fine-tuning (SFT) (Wang et al.,
back-translation (Longpre et al., 2019), which primarily aim 2022a), divergence minimization (Agarwal et al., 2024),
at expanding the training dataset in a somewhat mechanical reinforcement learning techniques (Cui et al., 2023a), and
manner, DA within the context of LLMs focuses on the rank optimization strategies (Tunstall et al., 2023). Together,
generation of novel, context-rich training data tailored to these techniques demonstrate how KD enables open-source
specific domains and skills. models to obtain knowledge from proprietary ones.
6
Skill Distillation. This facet examines the specific compe- from a sophisticated teacher model to a less complex student
tencies and capabilities enhanced through KD. It encom- model. This pipeline is integral for leveraging the advanced
passes detailed discussions on context following (Taori et al., capabilities of models like GPT-4 or Gemini in more acces-
2023; Luo et al., 2023c), with subtopics like instruction sible and efficient open-source counterparts. The outline of
following and retrieval-augmented generation (RAG) Capa- this pipeline can be broadly categorized into four distinct
bility. In the realm of alignment (Mitra et al., 2023; Tun- stages, each playing a crucial role in the successful distilla-
stall et al., 2023), the survey investigates thinking patterns, tion of knowledge. An illustration is shown in Figure 4. The
persona/preference modeling, and value alignment. The detailed pipeline could also be seen in Figure 2.
‘agent’ category delves into skills such as Tool Using and I. Target Skill or Domain Steering Teacher LLM. The
Planning. NLP task specialization (Dai et al., 2023a; Jung first stage involves directing the teacher LLM towards a
et al., 2023; Chaudhary, 2023) is scrutinized through lenses specific target skill or domain. This is achieved through care-
like natural language understanding (NLU), natural lan- fully crafted instructions or templates that guide the LLM’s
guage generation (NLG), information retrieval, recommen- focus. These instructions are designed to elicit responses
dation systems, text generation evaluation, and code gen- that demonstrate the LLM’s proficiency in a particular area,
eration. Finally, the survey addresses multi-modality (Liu be it a specialized domain like healthcare or law, or a skill
et al., 2023e; Zhao et al., 2023b), exploring how KD enhances such as reasoning or language understanding.
LLMs’ ability to integrate multiple forms of input. II. Seed Knowledge as Input. Once the target area is
defined, the next step is to feed the teacher LLM with
Verticalization Distillation. This section assesses the ap-
seed knowledge. This seed knowledge typically comprises
plication of KD across diverse vertical domains, offering
a small dataset or specific data clues relevant to the elicit
insights into how distilled LLMs can be tailored for spe-
skill or domain knowledge from the teacher LLM. It acts
cialized fields such as Law (LAW, 2023), Medical & Health-
as a catalyst, prompting the teacher LLM to generate more
care (Wang et al., 2023a), Finance (Zhang and Yang, 2023),
elaborate and detailed outputs based on this initial infor-
Science (Zhang et al., 2024), among others. This exploration
mation. The seed knowledge is crucial as it provides a
not only showcases the practical implications of KD tech-
foundation upon which the teacher model can build and
niques but also highlights their transformative impact on
expand, thereby creating more comprehensive and in-depth
domain-specific AI solutions.
knowledge examples.
Through these facets, this survey provides a compre-
III. Generation of Distillation Knowledge. In response
hensive analysis of KD in LLMs, guiding researchers and
to the seed knowledge and steering instructions, the teacher
practitioners through methodologies, challenges, and op-
LLM generates knowledge examples. These examples are
portunities in this rapidly evolving domain.
predominantly in the form of question-and-answer (QA)
Declaration. This survey represents our earnest effort to dialogues or narrative explanations, aligning with the nat-
provide a comprehensive and insightful overview of knowl- ural language processing/understanding capabilities of the
edge distillation techniques applied to LLMs, focusing on LLM. In certain specialized cases, the outputs may also in-
algorithms, skill enhancement, and domain-specific appli- clude logits or hidden features, although this is less common
cations. Given the vast and rapidly evolving nature of due to the complexity and specific requirements of such
this field, especially with the prevalent practice of elic- data forms. The generated knowledge examples constitute
iting knowledge from training data across academia, we the core of the distillation knowledge, encapsulating the
acknowledge that this manuscript may not encompass every advanced understanding and skills of the teacher LLM.
pertinent study or development. Nonetheless, it endeavors IV. Training the Student Model with a Specific Learn-
to introduce the foundational paradigms of knowledge dis- ing Objective. The final stage involves the utilization of
tillation, highlighting key methodologies and their impacts the generated knowledge examples to train the student
across a range of applications. model. This training is guided by a loss function that aligns
with the learning objectives. The loss function quantifies
2.4 Distillation Pipeline in LLM Era the student model’s performance in replicating or adapting
the knowledge from the teacher model. By minimizing this
Learning loss, the student model learns to emulate the target skills or
Skill/Domain
Objective domain knowledge of the teacher, thereby acquiring similar
capabilities. The process involves iteratively adjusting the
steer train student model’s parameters to reduce the discrepancy be-
tween its outputs and those of the teacher model, ensuring
drive Generated
Knowledge the effective transfer of knowledge.
Seed
In essential, the above four stages can be abstracted
Knowledge
Teacher LLM Student Model as two formulations. The first formulation represents the
Knowledge Elicitation Distillation Algorithm process of eliciting knowledge:
Fig. 4: An illustration of a general pipeline to distill knowl- DI(kd) = {Parse(o, s)|o ∼ pT (o|I ⊕ s), ∀s ∼ S}, (1)
edge from a large language model to a student model.
where ⊕ denotes fusing two pieces of text, I denotes an
The general distillation pipeline of LLMs is a structured instruction or a template for a task, skill, or domain to
and methodical process aimed at transferring knowledge steer the LLM and elicit knowledge, s ∼ S denotes an
7
example of the seed knowledge, upon which the LLM can tasks leverage LLMs to label evaluated results (Li et al.,
explore to generate novel knowledge, Parse(o, s) stands for 2024b; Wang et al., 2023b), and reasoning tasks utilize LLMs
to parse the distillation example ( e.g., (x, y)) from the for labeling Chains of Thought (CoT) explanations (Hsieh
teacher LLM’s output o (plus the input s in some cases), et al., 2023; Li et al., 2022; Ho et al., 2023; Magister et al.,
and pT represents the teacher LLM with parameters θT . 2023; Fu et al., 2023; Ramnath et al., 2023; Li et al., 2023d;
(kd)
Given the datasets DI built for distillation, we then define Liu et al., 2023g), among others. Rather than concentrating
a learning objective as on specific tasks, many current works focus on labeling
outputs based on instructions, thereby teaching student
LI (DI(kd) ; θS ),
X
L= (2) models to solve tasks in a more flexible way by following in-
I
P structions. Collections of various NLP tasks, complemented
where I denotes there could be multiple tasks or skills
by instructional templates, serve as valuable input sources
being distilled into one student model, LI (·; ·) stands for a
for x. For instance, FLAN-v2 collections (Longpre et al.,
specific learning objective, and θS parameterizes the student
2023) offers extensive publicly available sets of tasks with
model.
instructions, which are labeled with responses generated
Following our exploration of the distillation pipeline and
by teacher LLMs in Orca (Mukherjee et al., 2023; Mitra
the foundational concepts underlying knowledge distilla-
et al., 2023). The instructions from these NLP tasks are
tion in the LLM era, we now turn our focus to the specific
built from predefined templates, which lack diversity and
algorithms that have gained prominence in this era.
may have gaps between human’s natural query. The real
conversations between humans and chat models provide
3 K NOWLEDGE D ISTILLATION A LGORITHMS large-scale data with real queries and generations labeled
This section navigates through the process of knowledge by powerful LLMs, like ShareGPT. Additionally, Xu et al.
distillation. According to Section 2.4, it is categorized into (2023b) and Anand et al. (2023) label the real questions
two principal steps: ‘Knowledge,’ focusing on eliciting sampled from forums like Quora and Stack Overflow.
knowledge from teacher LLMs (Eq.1), and ‘Distillation,’ Moreover, the process of labeling could be guided by
centered on injecting this knowledge into student models instructions I or demonstrations c. A commonly used in-
(Eq.2). We will elaborate on these two processes in the struction type for guiding labeling is chain-of-thought (CoT)
subsequent sections. prompt (Hsieh et al., 2023; Fu et al., 2023; Magister et al.,
2023). Mukherjee et al. (2023) add multiple system messages
(e.g. “You must generate a detailed and long answer.” or
3.1 Knowledge
“explain like I’m five, think step-by-step”) to elicit rich
This section focuses on the approaches to elicit knowledge signals. Yue et al. (2023a) and Chenglin et al. (2023) la-
from teacher LLMs. According to the manners to acquire bel a hybrid of knowledge of chain-of-thought (CoT) and
knowledge, we divided them into Labeling, Expansion, Data program-of-thought (PoT) rationales. Xu et al. (2023b) pro-
Curation, Feature, Feedback, and Self-Knowledge. Figure 5 pose a self-chat technique that two teacher LLMs simulate
shows an illustration of these knowledge elicitation meth- the real conversational to generate multi-turn dialogues for
ods. a question from Quora and Stack Overflow.
3.1.1 Labeling
3.1.2 Expansion
Labeling knowledge refers to using a teacher LLM to label
the output y for a given input x as the seed knowledge, While the labeling approach is simple and effective, it faces
according to the instruction I or demonstrations c, where certain limitations. Primarily, it is constrained by the scale
c = (x1 , y1 ), . . . , (xn , yn ). This method of eliciting knowl- and variety of the input data. In real-world applications,
edge from teacher LLMs is straightforward yet effective and especially those involving user conversations, there are also
has been widely applied across various tasks and appli- concerns regarding the privacy of the data involved. To
cations. It requires only the collection of an input dataset address these limitations, various expansion methods have
and feeding it into LLMs to obtain the desired generations. been proposed (Wang et al., 2022a; Taori et al., 2023; Chaud-
Moreover, the generation of y is controllable through the hary, 2023; Si et al., 2023; Ji et al., 2023a; Luo et al., 2023b,a;
predefined I and c. This process can be formulated as Wu et al., 2023c; Sun et al., 2024b; Xu et al., 2023a; Guo
follows: et al., 2023c; Rozière et al., 2023; West et al., 2022). These
methods take the demonstrations as seed knowledge and
D(lab) = {x, y|x ∼ X , y ∼ pT (y|I ⊕ c ⊕ x)}. (3) aim to expand a large scale and various data by in-context
Input x could be sourced from existing NLP task learning.
datasets, which serve as typical reservoirs for distillation A key characteristic of these expansion methods is the
efforts. Numerous works have sought to harness the capa- utilization of the in-context learning ability of LLMs to gen-
bilities of powerful LLMs as teachers for annotating dataset erate data similar to the provided demonstrations c. Unlike
samples across a range of tasks. For instance, efforts in in the labeling approach, where the input x is sampled
natural language understanding involve using LLMs to cat- from the existing dataset, in the expansion approach, both x
egorize text (Gilardi et al., 2023; Ding et al., 2023a; He et al., and y are generated by teacher LLMs. This process can be
2023a), while in natural language generation, LLMs assist formulated as follows:
in generating sequences for outputs (Hsieh et al., 2023; Jung
et al., 2023; Wang et al., 2021b). Text generation evaluation D(exp) = {(x, y)|x ∼ pT (x|I ⊕ c), y ∼ pT (y|I ⊕ x)}. (4)
8
Update
Expand Guide Feedback
Complete
Expansion 𝑐
𝑥 𝑦 𝑦" ≻ 𝑦# ≻ 𝑦!
𝐼
𝑥
Feedback Correct
𝑦# 𝑦#∗
Meta Sources Generate
Expand
Data Create Complete 𝑦𝑦# 𝑥 𝑥&
𝐼 𝑦"!
Curation 𝑥 𝑦
𝑚
Sample
Feedback
𝐼 Self-
𝐼 𝑦 Knowledge
Instruction 𝑥 Input 𝑦 Output Teacher 𝑥
Filter
𝑐 Demonstrations 𝑚 Meta-Information Student
Fig. 5: An illustration of different knowledge elicitation methods from teacher LLMs. Labeling: The teacher generates
the output from the input; Expansion: The teacher generates samples similar to the given demonstrations through in-
context learning; Data Curation: The teacher synthesizes data according to meta-information, such as a topic or an entity;
Feature: Feed the data into the teacher and extract its internal knowledge, such as logits and features; Feedback: The teacher
provides feedback on the student’s generations, such as preferences, corrections, expansions of challenging samples, etc;
Self-Knowledge: The student first generates outputs, which is then filtered for high quality or evaluated by the student itself.
In this formulation, x and y represent the new input- et al., 2023b) proposes the Targeted Data Generation (TDG)
output pairs generated by the teacher LLM. The input x framework, which automatically identifies challenging sub-
is generated based on a set of input-output demonstrations groups within data and generates new samples for these
c. The output y is then generated in response to the new subgroups using LLMs through in-context learning.
input x under the guidance of an instruction I . Note that In summary, the expansion method leverages the in-
the demonstrations could be predefined or dynamically context learning strengths of LLMs to produce more var-
updated by adding the newly generated samples. ied and extensive datasets with both inputs and outputs.
Expansion techniques have been widely utilized to However, the quality and diversity of the generated data
extract extensive instruction-following knowledge from are heavily reliant on the teacher LLMs and the initial seed
teacher LLMs. Wang et al. (2022a) first introduces an iter- demonstrations. This dependence can lead to a dataset with
ative bootstrapping method, Self-Instruct, to utilize LLMs inherent bias from LLMs (Yu et al., 2023a; Wei et al., 2023)
to generate a wide array of instructions based on sev- and a homogeneity issue where the generations may be
eral demonstrations sampled from 175 manually-written in- prone to similarity ultimately, limiting the diversity this
structions. The newly generated instructions are then added method seeks to achieve (Ding et al., 2023b). Moreover, the
back to the initial pool, benefiting subsequent expansion expansion process may inadvertently amplify any biases
iterations. Subsequently, Taori et al. (2023) applies this ex- present in the seed data.
pansion method to a more powerful teacher LLM, text-
davinci-003, to distill 52K high-quality data. To improve 3.1.3 Data Curation
the diversity and coverage during expansion, Wu et al. The pursuit of high-quality and scalable data generation in
(2023c) and (Sun et al., 2024b) prompt the teacher LLM to knowledge distillation from LLMs has led to the emergence
generate instructions corresponding to some specific topics. of the Data Curation approach. This method arises in re-
Xu et al. (2023a) propose an Evol-Instruct method to ex- sponse to the limitations observed in both the Labeling and
pand the instructions from two dimensions: difficulty (e.g. Expansion approaches. These methods often yield data of
rewriting the question to be more complex) and diversity variable quality and face constraints in quantity. In Labeling,
(e.g. generating more long-tailed instructions). This Evol- the seed knowledge is sourced from task datasets, leading
Instruct method is domain-agnostic and has been used to to potential noise and dirty data. Meanwhile, in Expansion,
expand the distillation of coding (Luo et al., 2023a) and the input x is derived from seed demonstrations, which
math (Luo et al., 2023b). Additionally, expansion methods can result in homogeneous data when generated in large
can significantly augment NLP task datasets with similar quantities. To overcome these challenges, the Data Curation
samples, thereby enhancing task performance. For instance, method curates high-quality or large-scale data by extensive
AugGPT (Dai et al., 2023a) leverages a teacher LLM to meta-information as seed knowledge (Ding et al., 2023b;
rephrase each sentence in the training samples into multi- Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023; Liu et al.,
ple conceptually similar, but semantically varied, samples 2023d; Wei et al., 2023; Yu et al., 2024; Ye et al., 2022; Gao
to improve classification performance. Similarly, TDG (He et al., 2023a; Yang and Nicolai, 2023).
9
A distinct feature of Data Curation is its approach to create synthetic datasets will become a crucial technical
to synthesize data from scratch. Numerous diverse meta- skill and a key area of focus in AI (Li et al., 2023a).
information, such as topics or knowledge points, could be
incorporated into this process to generate controllable x 3.1.4 Feature
and y . Thus, this process can be meticulously controlled The previously discussed knowledge elicitation methods
to yield datasets that are not only large in scale but also are typically applied to powerful black-box models, which
of high quality. The formulation for Data Curation can be are expensive and somewhat unreproducible due to calling
represented as: API. In contrast, white-box distillation offers a more trans-
parent and accessible approach for researchers. It involves
D(cur) = {(x, y)|x ∼ pT (x|I ⊕ m), y ∼ pT (y|I ⊕ x)}. (5) leveraging the output distributions, intermediate features,
or activations from teacher LLMs, which we collectively
In this formulation, m represents the diverse meta- refer to as Feature knowledge. White-box KD approaches
information used to guide the synthesis of x, and I is the have predominantly been studied for smaller encoder-based
instruction guiding teacher LLMs to generate x or y . LMs, typically those with fewer than 1 billion parameters
Different studies primarily vary in their source and (cf. Gou et al. (2021) for detail). However, recent research
method of leveraging meta-information. UltraChat (Ding has begun to explore white-box distillation in the context of
et al., 2023b) effectively demonstrates the process of curating generative LLMs (Timiryasov and Tastet, 2023; Liang et al.,
both high-quality and diverse data by distilled knowledge. 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a;
They collect extensive meta-information across three do- Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin
mains: Questions about the World, Creation and Generation, et al., 2023b; Boizard et al., 2024; Zhong et al., 2024).
and Assistance on Existing Materials. For example, under The typical method for acquiring this feature knowledge
Questions about the World, they explore 30 meta-topics like involves teacher LLMs annotating the output sequence y
”Technology” and ”Food and Drink.” the teacher LLMs with its internal representations. These annotations are then
then use this meta-information to distill a broad array distilled into the student model using methods such as
of instructions and conversations, achieving a substantial Kullback-Leibler Divergence (KLD). The process of eliciting
scale of 1.5 million instances. UltraChat stands out with its feature knowledge can be formulated as follows:
lexical and topical diversity. The UltraLLaMA model, fine-
tuned on this data, consistently surpasses other open-source D(feat) = {(x, y, ϕfeat (x, y; θT )) | x ∼ X , y ∼ Y}. (6)
models. Another notable series, phi (Gunasekar et al., 2023; In this formulation, Y is the output set, which can be
Li et al., 2023a; Mar, 2023), focuses on distilling smaller, generated by teacher LLMs, the student model, or directly
high-quality datasets akin to ”textbooks.” Phi-1(Gunasekar sourced from the dataset. ϕfeat (·; θT ) represents the opera-
et al., 2023) experiments with synthesizing ”textbook qual- tion of extracting feature knowledge (such as output distri-
ity” data in the coding domain. Their approach involves bution) from the teacher LLM.
distilling clear, self-contained, instructive, and balanced con- The most straightforward method to elicit feature knowl-
tent from LLMs, guided by random topics or function names edge of teacher is to label a fixed dataset of sequences with
to enhance diversity. The distilled data is a synthesis of 1 token-level probability distributions (Sanh et al., 2019; Wen
billion tokens of Python textbooks, complete with natural et al., 2023). To leverage the rich semantic and syntactic
language explanations and code snippets, as well as 180 mil- knowledge in intermediate layers of the teacher model,
lion tokens of Python exercises with solutions. Remarkably, TED (Liang et al., 2023a) designs task-aware layer-wise
the phi-1 model, despite its smaller size, outperforms nearly distillation. They align the student’s hidden representations
all open-source models on coding benchmarks like Hu- with those of the teacher at each layer, selectively extracting
manEval and MBPP while being 10 times smaller in model knowledge pertinent to the target task. Gu et al. (2024) and
size and 100 times smaller in dataset size. MFTCoder (Liu Agarwal et al. (2024) introduce a novel approach where
et al., 2023d) utilizes hundreds of Python knowledge points the student model first generates sequences, termed ‘self-
as meta-information to create a CodeExercise Dataset. In generated sequences.’ The student then learns by using
contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu feedback (i.e. output distribution) from teacher on these
et al., 2024) get raw code collections from open-source sequences. This method is particularly beneficial when the
code datasets, using this as meta-information for generating student model lacks the capacity to mimic teacher’s distri-
instructional data. In the context of NLU tasks, certain bution. Moreover, various LLM-quantization methods with
studies (Ye et al., 2022; Gao et al., 2023a; Wang et al., 2021a) distilling feature knowledge from teacher LLMs have been
explore the use of labels as meta-information to synthesize proposed (Tao et al., 2022a; Liu et al., 2023a; Kim et al.,
corresponding samples for data augmentation. Similarly, in 2023b). These methods aim to preserve the original output
information retrieval tasks, there are efforts to utilize docu- distribution when quantizing the LLMs, ensuring minimal
ments as meta-information for generating potential queries, loss of performance. Additionally, feature knowledge could
thereby constructing large-scale retrieval pairs (Bonifacio serve as a potent source for multi-teacher knowledge distil-
et al., 2022; Meng et al., 2023). lation. Timiryasov and Tastet (2023) leverages an ensemble
In conclusion, Data Curation through teacher LLMs has of GPT-2 and LLaMA as teacher models to extract output
emerged as a promising technique for synthesizing datasets distributions. Similarly, FuseLLM (Wan et al., 2024a) inno-
that are not only high-quality and diverse but also large vatively combines the capabilities of various LLMs through
in scale. The success of models like phi-1 in specialized a weighted fusion of their output distributions, integrating
domains underscores the efficacy of this method. The ability them into a singular LLM. This approach has the potential
10
to significantly enhance the student model’s capabilities, models, UltraFeedback. It compiles various instructions and
surpassing those of any individual teacher LLM. models to produce comparative data. Then, GPT-4 is used
In summary, feature knowledge offers a more transpar- to score candidates from various aspects of preference,
ent alternative to black-box methods, allowing for deeper including instruction-following, truthfulness, honesty and
insight into and control over the distillation process. By helpfulness.
utilizing feature knowledge from teacher LLMs, such as out- Beyond merely assessing student generations, teachers
put distributions and intermediate layer features, white-box can also furnish extensive feedback on instances where
approaches enable richer knowledge transfer. While show- students underperform. In Lion (Jiang et al., 2023b), teacher
ing promise, especially in smaller models, its application model pinpoints instructions that pose challenges to the
is not suitable for black-box LLMs where internal parame- student model, generating new, more difficult instructions
ters are inaccessible. Furthermore, student models distilled aimed at bolstering the student’s abilities. PERsD (Chen
from white-box LLMs may underperform compared to their et al., 2023a) showcases a method where teacher offers
black-box counterparts, as the black-box teacher LLMs (e.g. tailored refinement feedback on incorrect code snippets gen-
GPT-4) tend to be more powerful. erated by students, guided by the specific execution errors
encountered. Similarly, SelFee (Ye et al., 2023) leverages
3.1.5 Feedback ChatGPT to generate feedback and revise the student’s
Most previous works predominantly focus on one-way answer based on the feedback. In contrast, FIGA (Guo et al.,
knowledge transfer from the teacher to the student for 2024) revises the student’s response by comparing it to
imitation, without considering feedback from the teacher the ground-truth response. Furthermore, teacher model’s
on the student’s generation. The feedback from the teacher distribution over the student’s generations can itself act
typically offers guidance on student-generated outputs by as a form of feedback. MiniLLM (Gu et al., 2024) and
providing preferences, assessments, or corrective informa- GKD (Agarwal et al., 2024) present an innovative strategy
tion. For example, a common form of feedback involves wherein the student model initially generates sequences,
teacher ranking the student’s generations and distilling this followed by teacher model producing an output distribution
preference into the student model through Reinforcement as feedback. This method leverages the teacher’s insight
Learning from AI Feedback (RLAIF) (Bai et al., 2022a). to directly inform and refine the student model’s learning
Here is a generalized formulation for eliciting feedback process.
knowledge:
3.1.6 Self-Knowledge
D(fb) = {(x, y, ϕfb (x, y; θT ))|x ∼ X , y ∼ pS (y|x)}, (7)
The knowledge could also be elicited from the student itself,
where y denotes the output generated by the student which we refer to as Self-Knowledge. In this setting, the same
model in response to x, and ϕfb (·; θT )) represents providing model acts both as the teacher and the student, iteratively
feedback from teacher LLMs. This operation evaluates the improving itself by distilling and refining its own previously
student’s output y given the input x, by offering assess- generated outputs. This knowledge uniquely circumvents
ment, corrective information, or other forms of guidance. the need for an external, potentially proprietary, powerful
This feedback knowledge can not only be distilled into teacher model, such as GPT-series LLMs. Furthermore, it
the student to also generate feedback (such as creating a allows the model to surpass the limitations or “ceiling”
student preference model) but, more importantly, enable inherent in traditional teacher-student methods. Eliciting
the student to refine its responses based on the feedback. self-knowledge could be formulated as:
Various methods have been explored to elicit this advanced
D(sk) = {(x, y, ϕsk (x, y))|x ∼ S, y ∼ pS (y|I ⊕ x)}, (8)
knowledge (Bai et al., 2022a; Luo et al., 2023b; Cui et al.,
2023a; Kwon et al., 2023; Jiang et al., 2023b; Chen et al., where ϕsk (·) is a generalized function that represents an
2023a; Gu et al., 2024; Agarwal et al., 2024; Chen et al., 2024b; additional process to the self-generated outputs y , which
Guo et al., 2024; Ye et al., 2023; Hong et al., 2023; Lee et al., could include but is not limited to filtering, rewarding, or
2023a). any other mechanisms for enhancing or evaluating y . It
Preference, as previously discussed, represents a notable could be governed by external tools or the student itself θS .
form of feedback knowledge from teacher models. Various Recent research in this area has proposed various innovative
knowledge of preferences could be distilled from teachers methodologies to elicit self-knowledge, demonstrating its
by prompting it with specific criteria. Bai et al. (2022a) in- potential for creating more efficient and autonomous learn-
troduce RLAIF for distilling harmlessness preferences from ing systems. (Allen-Zhu and Li, 2020; Wang et al., 2022a;
LLMs. This involves using an SFT-trained LLM to generate Sun et al., 2024b; Yang et al., 2024; Jung et al., 2023; Huang
response pairs for each prompt, then ranking them for et al., 2023a; Gulcehre et al., 2023; Yuan et al., 2024a; Xu
harmlessness to create a preference dataset. This dataset is et al., 2023b; Zelikman et al., 2022; Chen et al., 2024a; Zheng
distilled into a Preference Model (PM), which then guides et al., 2024; Li et al., 2024c; Zhao et al., 2024; Singh et al.,
the RL training of a more harmless LLM policy. Wizard- 2023; Chen et al., 2024c; Hosseini et al., 2024)
Math (Luo et al., 2023b) places emphasis on mathematical A notable example of this methodology is Self-
reasoning. They employ ChatGPT as teacher to directly Instruct (Wang et al., 2022a), which utilizes GPT-3 for
provide process supervision and evaluate the correctness data augmentation through the Expansion approach, gen-
of each step in the generated solutions. To scale up high- erating additional data samples to enhance the dataset.
quality distilled preference data, Cui et al. (2023a) develop a This enriched dataset subsequently fine-tunes the original
large-scale preference dataset for distilling better preference model. Other methods aim to elicit targeted knowledge
11
from student models by modifying prompts, and leveraging Divergence Type D(p, q) Function
these data for further refinement. In Self-Align (Sun et al., Forward KLD
P p(t)
p(t) log q(t)
2024b), they find that models fine-tuned by Self-Instruct P q(t)
data tend to generate short or indirect responses. They Reverse KLD q(t) log p(t)
TABLE 3: A summary of skill distillation works. IF: Instruction Following, MD: Multi-turn Dialoue, TP: Think Pattern,
RAG: Retrieval-Augmented Generation, NLU: Natural Language Understanding, NLG: Natural Language Generation, IR:
Information Retrieval, SFT: Supervised Fine-Tuning, D&S: Divergence and Similarity, RL: Reinforcement Learning, RO:
Ranking Optimization.
formats with templates, such as prefacing machine transla- relevant works use OpenAI’s GPT series models to generate
tion data with ”Translate this sentence to Spanish:”. However, prompt-response data pairs and then train the student LLMs
these approaches have limitations. Manual data creation is by supervised fine-tuning (Wang et al., 2022a; Taori et al.,
labor-intensive, while template-based transformation lacks 2023; Chiang et al., 2023; Wu et al., 2023c; Xu et al., 2023a;
diversity in instructions and may not align well with natural Mukherjee et al., 2023; Mitra et al., 2023; Luo et al., 2023b;
human input. LLMs like GPT-4 offer an efficient alternative Peng et al., 2023a).
for creating diverse and controlled SFT data by their capabil-
Basic Instructions. Self-Instruct (Wang et al., 2022a) lever-
ities of in-context learning and instruction following. Most
ages the in-context learning capability of GPT-3 to expand
15
a seed pool of 175 tasks to 52K task-agnostic instructions, prompts GPT-4 to provide explanation traces that eluci-
ensuring a broad spectrum of general instructions. Addi- date the teacher’s reasoning process. Orca 2 (Mitra et al.,
tionally, a filtering and post-processing stage is introduced 2023) further trains the student model to identify the most
to eliminate redundant or similar instructions. Notably, effective solution strategy for each task, guided by Orca’s
through training with this enriched dataset, GPT-3 acquires performance. This approach significantly improves the abil-
the ability to follow instructions, enabling it to perform ity of smaller models to follow instructions that involve
comparably to InstructGPT in zero-shot instruction tasks reasoning.
and when provided with expert-written instructions for
novel tasks. Based on the self-instruct method, Taori et al.
High-Quality Instructions. As demonstrated in Zhou et al.
(2023) train an Alpaca model using the Llama 7B model
(2023a) and (Li et al., 2024f), the data quality is crucial
on 52K instruction-following demonstrations, generated in
for instruction following training. UltraChat (Ding et al.,
a similar style as self-instruct but utilizing the more robust
2023b) distills large-scale data with high-quality and di-
text-davinci-003 model. To enhance the diversity of instruc-
verse instructions from teacher LLMs by various meta-
tional data, Wu et al. (2023c) introduce a technique known
information. The UltraLLaMA model, fine-tuned on this
as Topic-Guided Instruction Generation. This method involves
data, consistently surpasses other open-source models. The
gathering 3.5K common topics from Wikipedia to serve as
Phi series models (Gunasekar et al., 2023; Li et al., 2023a;
guidance during the generation process.
Mar, 2023) prioritize data quality and employ synthetic
Complex Instructions. Some works promote students to methods to generate data of “textbook quality” to enhance
solve more complex instructions (Xu et al., 2023a; Luo et al., the learning experience for smaller models. Notably, Phi
2023b,a; Guo et al., 2023c). According to Xu et al. (2023a), in- exhibits the ability to follow instructions effectively even
struction datasets derived from human-written seeds often without specific instruction fine-tuning. What’s particularly
exhibit low to moderate complexity. To enhance the com- remarkable is that Phi-2, with just 2.7 billion parameters,
plex instruction-following capabilities of smaller models, outperforms Mistral and Llama-2 models with 7B and 13B
WizardLM (Xu et al., 2023a) introduces Evol-Instruct. This parameters across various benchmark evaluations.
method gradually transforms instructions into more com-
plex forms through a multi-step evolution process, focusing Improved Instructions. Another line of work focuses on
on both increasing difficulty levels and expanding the di- improving the quality of existing instruction data, including
versity of topics. They conducted four rounds of evolution both the improvement of instruction and corresponding
using the OpenAI ChatGPT API, resulting in a dataset of response. SelFee (Ye et al., 2023) utilizes the ChatGPT to iter-
250k complex instructions. Subsequently, they trained the atively improve the quality of responses. ExpertLLaMA (Xu
LLaMA 7B model, referred to as WizardLM, on this dataset. et al., 2023f) improves the quality of responses by augment-
In the high-difficulty section of test instructions, WizardLM ing vanilla instructions with specialized Expert Identity
even outperformed ChatGPT, achieving a win rate 7.9% descriptions. Reflection-Tuning (Li et al., 2023e) improves
higher than ChatGPT. Zhao et al. (2023e) further conduct both the instruction and response sequentially by reflecting
preliminary studies revealing the effectiveness of increasing on specific criteria. DEITA (Liu et al., 2023h) proposes to
instruction complexity. Instruction Fusion (Guo et al., 2023c) enhance and score instructions in three directions includ-
further uses teacher LLMs to increase the complexity by ing complexity, quality, and diversity to get high-quality
fusing two distinct evolved instructions. Furthermore, this distillation data. MUFFIN (Lou et al., 2023) proposes to
concept of “evolving” instructions has been extended to scale the instruction according to the input by diversifying
distill specific skills such as coding (Luo et al., 2023a) and these tasks with various input facets. Selective Reflection-
mathematics (Luo et al., 2023b). Tuning (Li et al., 2024d) first involves the student model
Human Instructions. In contrast to works that rely on gener- in the data improvement pipeline with a novel student-
ating instructions from ChatGPT, which may lack diversity selection module, in which the student model is able to
and have gaps with real human instructions, Vicuna (Chiang decide the data learn from.
et al., 2023) and Koala (Geng et al., 2023) showcase impres-
sive performance by using human conversations and natu- In summary, distilling instruction data from teachers
ral instructions from community-contributed conversations. presents a promising avenue for training cheap and re-
These conversations, found in platforms like ShareGPT, pro- producible instruction-following language models. Cur-
vide a forum for users to share their interactions with Chat- rent small models have made strides in enhancing var-
GPT. It’s important to note, however, that models trained ious aspects of instruction-following ability, like diver-
on such natural conversations might mimic the style but sity, complexity and explanation. However, student mod-
may not fully capture the reasoning process of the original els trained on instruction data expanded by ChatGPT of-
teacher (Gudibande et al., 2023; Mukherjee et al., 2023). ten mimic ChatGPT’s style without replicating its factual
accuracy (Gudibande et al., 2023). Achieving a more ca-
System Instructions. To encourage student models to learn pable instruction-following capability requires a stronger
the reasoning process, Orca and Orca 2 (Mukherjee et al., teacher LLM (Gudibande et al., 2023) and access to di-
2023; Mitra et al., 2023) enhance the prompt, response data verse, high-quality instruction data, such as the one used
pairs by introducing a system message (e.g., ”explain like in Orca (Mukherjee et al., 2023; Mitra et al., 2023), which
I’m five, think step-by-step”) to encourage student mod- incorporates extensive task instructions from the Flan 2022
els to grasp the reasoning process. This system message Collection (Longpre et al., 2023).
16
4.1.2 Multi-turn Dialogue promising technique to decrease this issue. Handling the
While instruction following focuses on single-instance com- augmented context of retrieved information is also a non-
mand execution, multi-turn dialogue extends this to com- trivial skill of LLMs. Several approaches to distill RAG
prehend and maintain context through ongoing interactions. capabilities have been proposed (Kang et al., 2023a; Luo
This skill is vital for models to engage meaningfully in et al., 2023c; Asai et al., 2023).
human-like conversations and respond coherently over suc- SAIL (Luo et al., 2023c) starts by retrieving search results
cessive dialogue turns. Some works have been dedicated for each training case using search APIs, creating search-
to train to small chat models by distilling multi-turn knowl- augmented instructions that include both the instruction
edge from teacher LLMs (Chiang et al., 2023; Xu et al., 2023b; and grounding information. To encourage the language
Ding et al., 2023b; Li et al., 2023b; Wang et al., 2023c; Tunstall model to prioritize informative retrieval results, they input
et al., 2023). each retrieved passage along with the ground truth response
ShareGPT serves as a platform for users to share their into the entailment model to label each retrieval result for
conversations with ChatGPT, offering a vast repository of relevance. Subsequently, the search-augmented instructions
multi-turn conversations readily available. Some small chat and relevance labels are fed into teacher LLMs (like GPT-
models are trained using this data to acquire the capability 4) for generating responses. Following fine-tuning on this
for engaging in multi-turn dialogues (Chiang et al., 2023; Ye training set, the student model becomes proficient at de-
et al., 2023; Wang et al., 2023c). For example, Vicuna (Chiang noising search results and generating accurate responses.
et al., 2023) is a chat model exclusively trained on ShareGPT KARD (Kang et al., 2023b) distills rationales r from the
data. Despite its sole training source being ShareGPT, Vi- teacher LLM in response to questions x. These rationales
cuna achieves a high MT-Bench (Zheng et al., 2023a) score are then utilized to train two models: a student LM and a
assigned by GPT-43 . In the study conducted by Wang et al. Reranker. For training the student LM, the rationales serve
(2023c), GPT-3.5 and GPT-4 are employed to generate mixed as a means to retrieve relevant knowledge d, and the student
responses using ShareGPT data. They assign higher rewards LM is subsequently fine-tuned using the rationales along-
to responses generated by GPT-4, aiming to incentivize side questions and knowledge. However, during inference,
student models to produce high-quality responses. Addi- only questions are available. To address this, the Reranker
tionally, Ye et al. (2023) enhance the quality of multi-turn is trained to mimic how the retriever scores passages with
data from ShareGPT by generating self-feedback on model the rationale by minimizing the KL divergence between
responses and iteratively refining the responses based on Retriever(d|r) and Reranker(d|x). However, the integra-
the received feedback. tion of a fixed number of passages in language models,
To enhance the multi-turn capabilities of student models, without considering their necessity or relevance, can reduce
another line of research focuses on expanding conversa- versatility and lead to the generation of unhelpful responses.
tional datasets through self-chat and using them to train To equip student LMs with adaptive RAG capabilities, Self-
smaller models (Xu et al., 2023b; Ding et al., 2023b; Tunstall Rag (Asai et al., 2023) distills this adaptive ability from
et al., 2023). For instance, Xu et al. (2023b) initiate their work teacher LLMs into a small critic model. This critic model
by using questions sourced from Quora and Stack Overflow determines whether retrieval is necessary and evaluates the
as seeds, resulting in the collection of 111.5k dialogues quality of the retrieved results by generating ‘reflection to-
through self-chat. Subsequently, they employ parameter- kens.’ For instance, Self-Rag initiates the retrieval operation
efficient tuning to train a chat model named Baize. Ding when generating the reflection token Retrieve . To distill
et al. (2023b) first construct a significantly larger dataset this critic data, GPT-4 is prompted to assess the need for
called UltraChat, comprising 1.5 million high-quality multi- retrieval using few-shot demonstrations I , the task input
turn dialogues. They achieve this by distilling instructions x, and output y to predict a reflection token r as follows:
and dialogues from ChatGPT. Notably, UltraChat encom- p(r|I, x, y).
passes a wide range of topics and instructions. Building
upon the UltraChat dataset, they fine-tune a LLaMA model, 4.2 Alignment
resulting in the creation of a powerful chat model known as
4.2.1 Thinking Pattern
UltraLLaMA. UltraLLaMA consistently outperforms other
open-source chat models, including Vicuna and Baize. Fur- Most existing methods mainly focus on directly aligning the
thermore, UltraChat is employed in conjunction with an direct responses of the student models to the responses of
AI preference-aligned chat model named Zephyr (Tunstall teacher models (Taori et al., 2023). Though effective, these
et al., 2023). Zephyr enhances intent alignment through models might suffer the problems that they tend to learn to
the application of distilled direct preference optimization imitate the response style of the teacher models, but not the
(dDPO). reasoning process (Mukherjee et al., 2023). Thus in order to
better distill from the teacher models, methods are proposed
4.1.3 RAG Capbility that not only imitate the pure responses but some novel
thinking patterns (Ye et al., 2023; Mukherjee et al., 2023;
LLMs are known to lack the ability to utilize up-to-date
Mitra et al., 2023; Wang et al., 2023d; Cheng et al., 2023;
knowledge, and often produce responses containing factual
Zhang et al., 2023a).
inaccuracies due to their sole reliance on the parametric
Motivated by the effectiveness of LLMs in generat-
knowledge. Retrieval-Augmented Generation (RAG) is a
ing their own feedback without relying on external mod-
3. MT-Bench: a multi-turn question set, where the generations of els (Schick et al., 2022; Madaan et al., 2023; Saunders
models are evaluated by LLM, like GPT-4. et al., 2022), SelFee (Ye et al., 2023) proposes to train a
17
model that has been fine-tuned to continuously revise its lished ground truth. Scheurer et al. (2023) propose Imitation
own answer until it provides a high-quality response in a Learning from Language Feedback, in which a language
single inference. During training, it utilizes both the final model is utilized to improve various outputs generated by
response and feedback chain as the fitting target. This pat- a model. This refinement is based on a reference provided
tern, response with the revision process, shows a promising by a human. Following this process, the most effectively
performance gain. Following SelFee, Reflection-Tuning (Li refined output is chosen to be used in further supervised
et al., 2023e, 2024d) also utilizes the reflection process as the fine-tuning. As outlined by Kim et al. (2023a), ALMoST in-
learning pattern. Noticing the lack of reasoning imitation volves condensing human preferences into a set of heuristic
of the previous methods, Orca (Mukherjee et al., 2023) guidelines. An example of such a rule is the idea that larger
first proposes Explanation tuning, which aims to learn the LLMs that utilize more comprehensive and higher-quality
reasoning steps, including explanation traces, step-by-step prompts are likely to yield superior responses. Based on
thought processes, and other complex instructions, from the these established guidelines, comparison data is generated
teacher model, rather than just the vanilla styles. Extensive using responses from LLMs of different sizes and with
experiments verify the effectiveness of distilling with this varying prompts. This data is then used to train a reward
thinking pattern. The following Orca2 (Mitra et al., 2023) model. Yang et al. (2024) propose Reinforcement Learning
further presents to equip the student models with the ability from Contrast Distillation, which aims to align language
to utilize different solution strategies for different tasks, mo- models without relying on human feedback. This approach
tivated by the capability discrepancies between the smaller involves training a preference model using simulated pairs
and larger models. By employing this training pattern, the of preferences, including both high-quality and low-quality
student models are able to gain a better reasoning ability. Be- examples which are generated through contrasting prompts,
sides learning with the corresponding revision or reflection positive and negative.
process, another thinking pattern that recently appeared is Lee et al. (2023a) further highlight the effectiveness of
generating both responses and preferences. Zhang et al. RLAIF. This work proposes that RLAIF not only matches but
(2023a) propose to learn both the knowledge and corre- in some cases surpasses RLHF, and interestingly, RLAIF can
sponding preference for domain-specific QA with LLMs. also enhance the performance of Supervised Fine-Tuning.
Recently, DEBATunE (Li et al., 2024e) proposes to improve Another notable discovery is that directly prompting the
the controllability of LLMs in generating statements on LLM for reward scores during reinforcement learning can
controversial topics. By engaging two agents in a structured be more effective than the conventional approach of training
multi-round debate on controversial topics, salient and in- a reward model based on LLM preferences. Wang et al.
depth statements can be obtained and further distilled into (2023f) propose Conditioned-RLFT, which treats different
the student models. data sources as coarse-grained reward labels and develops
a class-conditioned policy to effectively utilize the varying
4.2.2 Preference qualities of data, which is a Reinforcement Learning-free
The previously mentioned methods primarily focus on the supervised learning approach. Cui et al. (2023a) propose a
basic capability of student models to produce outcomes large-scale, high-quality, and diversified preference dataset
that are strictly accurate but may not align with human labeled by GPT4 for comprehensive feedback. Tunstall et al.
preferences, reaching alignment at this level enables these (2023), by proposing distilled Direct Preference Optimiza-
models to aid in various tasks without meeting higher-level tion (Rafailov et al., 2023) on UltraFeedback, obtaining a
demands. Early methods mainly utilize human feedback for small by powerful LLM.
the alignment of human preferences (Ziegler et al., 2019;
Stiennon et al., 2020; Wu et al., 2021; Ouyang et al., 2022; Bai 4.2.3 Value
et al., 2022b; Köpf et al., 2023; Yuan et al., 2023b). However, Attaining alignment with human preferences allows large
obtaining human feedback is costly and labor-intensive, models to optimize human satisfaction by operating in a
thus methods that learn from AI feedback are also proposed manner that aligns with human preferences. However, to
to align with human preferences (Bai et al., 2022a; Kwon establish trustworthy LLMs, the notion of ’aligning LLMs
et al., 2023; Scheurer et al., 2023; Kim et al., 2023a; Roit et al., with human values’ is proposed and the key principles of
2023; Yang et al., 2024; Lee et al., 2023a; Tunstall et al., 2023; alignment are often summarized as the “HHH” criteria:
Cui et al., 2023a; Wang et al., 2023f). helpful, harmless, honest (Weidinger et al., 2021; Askell
The concept of RLAIF, introduced by Bai et al. (2022a), et al., 2021). Numerous methods have been undertaken for
involves the integration of preferences labeled by LLMs building trustworthy LLMs. However, due to the intrinsic
with those labeled by humans. This approach is designed difficulty of this aim, which is still an unsolved problem
to simultaneously optimize two key objectives: ensuring for proprietary models (Sun et al., 2024a), most existing
the helpfulness of the output and minimizing any potential methods rely on constructing high-quality human prefer-
harm, making the responses of LLMs more aligned with ence datasets (Ji et al., 2023b; Solaiman and Dennison, 2021;
Human preferences. Kwon et al. (2023) develop a proxy Bai et al., 2022b; Qiu et al., 2022; Kiesel et al., 2022; Liu et al.,
reward function using LLMs like GPT-3, which is created by 2022a), utilizing human-written rules as constrains (Glaese
first providing the LLM with a description of the behaviors et al., 2022; Sun et al., 2023b, 2024b), etc. For detailed
desired by the user, along with a small number of examples. progress on trustworthy LLMs, please further refer to Yao
The LLM then produces rewards by evaluating how closely et al. (2023a); Liu et al. (2023i); Sun et al. (2024a).
the outputs of a model align with the provided descrip- Though slightly under-explored, aligning LLMs with
tions, essentially measuring their relevance to the estab- human values by distilling is still possible (Bai et al., 2022a;
18
Cui et al., 2023a; Yang et al., 2024; Sun et al., 2024b). For aimed at enhancing the tool-use capabilities of compact
instance, Bai et al. (2022a) propose RLAIF, utilizing AI- language models for embodied intelligence. It creates a
generated labels to interactively improve both helpfulness dataset with 3938 instances from over 400 real-world tool
and harmlessness. Sun et al. (2024b) prompt the student APIs across 50 categories and utilizes ChatGPT to generate
model with 16 principles as guidelines for generating help- documentation for each prompt for later training. ToolLLM
ful, ethical, and reliable responses. Similarly, both harmless (Qin et al., 2023a) proposes a comprehensive framework for
and harmful generations could be elicited by modifying enhancing LLMs with tool-use proficiency, focusing on data
the prompts, and then are used to train the preference creation, model training, and evaluation by distilling from
model (Yang et al., 2024). Cui et al. (2023a) utilize GPT- chatGPT. Their ToolLLaMA shows impressive performance
4 to rank generations regarding helpfulness, truthfulness, in executing complex instructions and handling new APIs,
and honesty. Liu et al. (2023b) advance the alignment of rivaling ChatGPT. CRAFT (Yuan et al., 2023a) builds a
LLMs with societal values by incorporating simulated social general tool creation and retrieval framework, which uti-
interactions into the training process. This approach encom- lizes GPT4 to generate code snippets as the created tools.
passes a range of elements, including demonstrations that During the inference, other small LLMs could select and
are both in alignment and in conflict with social norms, as retrieve from the generated code snippets to execute or
well as collective ratings, in-depth feedback, and responses generate other methods conditioned on the given snippets.
that are revised iteratively. Confucius (Gao et al., 2023b) introduces a tiered training
strategy for LLMs to master tool usage through a graduated
curriculum and an innovative method called Iterative Self-
4.3 Agent
instruction from Introspective Feedback (ISIF) for dynamic
4.3.1 Tool Using dataset enhancement to handle complex tools. MLLM-Tool
While recent LLMs have shown proficiency in solving var- (Wang et al., 2024) is a multi-modal tool agent capable
ious tasks, they still tend to make mistakes when handling of interpreting instructions embedded in visual or audio
large numerical values or executing intricate mathematical content through the integration of multi-modal encoders
calculations (Qian et al., 2022; She et al., 2023; Manikandan with open-source large language models. As a trainable
et al., 2023; Liang et al., 2023b; Mialon et al., 2023). Thus method, the initial instruction-answer pairs are generated
equipping LLM agents with the capability to utilize tools by utilizing GPT4. Shen et al. (2024) demonstrate that small
has been increasingly focused on. Commonly used methods LLMs are weak tool learners and proposes a multi-LLM
mainly relied on human-curated data for training (Parisi framework that decomposes the tool-use ability of a single
et al., 2022; Nakano et al., 2022; Qin et al., 2023c; Song model into a planner, caller, and summarizer for the tool
et al., 2023b) or prompt designing(Cai et al., 2023; Shen using, leading to a supreme performance. The two-stage
et al., 2023a; Hao et al., 2024). Recently, distillation-based training strategy introduced by this work is powered by
methods are also proposed (Schick et al., 2023; Zhang, 2023; ChatGPT and GPT4 for collecting execution trajectories for
Patil et al., 2023; Tang et al., 2023a; Qin et al., 2023a; Yuan the training set. Yuan et al. (2024b) notice the potential
et al., 2023a; Gao et al., 2023b; Wang et al., 2024; Shen et al., issue of the current lengthy tool documentation, which
2024; Yuan et al., 2024b). hinders LLMs from understanding how to utilize a tool,
Toolformer (Schick et al., 2023) utilizes a self-supervised thus proposing EASYTOOL to purify the important infor-
manner, avoiding large human annotations, to obtain the mation from extensive documentation. The ground truth
most required APIs to use and further distill this capability summarization of the training documents is obtained by
to the model itself. The performance of the GPT-J-based using ChatGPT.
Toolformer surpasses OPT (66B) (Zhang et al., 2022) and
GPT3 (175B) (Brown et al., 2020) greatly. Graph-ToolFormer 4.3.2 Planning
(Zhang, 2023) aims to equip LLMs with the ability to process Another important aspect for LLM agents is the ability to
and reason over complex graph data, which is designed decompose high-level tasks to a chosen set of actionable
to enhance LLMs with graph reasoning skills using exter- steps (Huang et al., 2022b), which is especially useful when
nal graph reasoning API tools by adopting ChatGPT to acting in interactive environments. Huang et al. (2022b) first
annotate and augment a larger graph reasoning statement demonstrate that LLMs can generate plausible goal-driven
dataset for training. Gorilla (Patil et al., 2023) addresses the action plans without training, introduces non-invasive tools
limitations of current LLMs in generating accurate input to enhance model executability, and assesses these methods
arguments and reduces the problem of ”hallucination” or through human evaluation to balance executability and
generating incorrect API usage and it collects thousands of semantic accuracy. Most existing methods utilize prompting
models from platforms like HuggingFace and Torch Hub strategies for task planning (Singh et al., 2022; Zhou et al.,
as the API calls and utilizes GPT4 to generate synthetic 2023b; Song et al., 2023c; Wang et al., 2023g; Yao et al.,
instruction data for training. GPT4Tools (Yang et al., 2023b) 2023b; Liu et al., 2023j; Hao et al., 2023; Hu et al., 2023a), or
introduces to enable open-source LLMs like LLaMA and building human-curated data for training (Lin et al., 2023a;
OPT to use multimodal tools, a capability previously limited Valmeekam et al., 2023). Recently, there have also been some
to advanced proprietary models like ChatGPT and GPT-4. distilling methods emerging (Chen et al., 2023b; Zeng et al.,
The approach involves generating an instruction-following 2023a; Yin et al., 2023a; Qiao et al., 2024; Kong et al., 2023).
dataset by prompting an advanced teacher model with mul- FireAct (Chen et al., 2023b) introduces an innovative ap-
timodal contexts, using the Low-Rank Adaptation optimiza- proach for refining LLMs. This method involves fine-tuning
tion. ToolAlpaca (Tang et al., 2023a) proposes a framework smaller-scale LLMs using agent trajectories that are derived
19
from a variety of tasks and prompting techniques. Applying ing human language. The knowledge distilled from LLMs,
this method with trajectories generated by GPT4 has been such as through data labeling or augmentation, is typi-
shown to consistently enhance performance. AgentTuning cally transferred into encoder-based language models like
(Zeng et al., 2023a) aims to enhance the performance of BERT (Vaswani et al., 2017) and RoBERTa (Liu et al., 2019).
LLMs in executing agent tasks without sacrificing their Regarding the task of classification, certain studies have
wide-ranging capabilities. By utilizing a new dataset called been noteworthy (Dai et al., 2023a; Gilardi et al., 2023; He
AgentInstruct, which includes high-quality interaction tra- et al., 2023b; Gao et al., 2023a; Chenglin et al., 2023; Li
jectories, it applies a hybrid instruction-tuning approach et al., 2023g). AugGPT (Dai et al., 2023a) focuses on both
that merges these trajectories with general domain instruc- general and clinical domain text classification. To address
tions. Lumos (Yin et al., 2023a) pertains to a novel frame- the limitations of small-scale clinical datasets, which often
work designed to train agents using a unified data format lack expert annotation and are subject to stringent privacy
and modular architecture based on open-source LLMs. This regulations, AugGPT utilizes knowledge from teacher LLMs
system comprises three key modules: planning, grounding, to rephrase each sentence in the training samples. This
and execution, enabling the decomposition of tasks into process creates multiple conceptually similar but seman-
subgoals and actionable steps. TPTU-v2 (Kong et al., 2023) tically distinct samples, enhancing the dataset’s richness
focuses on improving the task planning and tool usage abili- and diversity. Another approach is demonstrated by Gilardi
ties of LLMs in real-world scenarios, by utilizing data gener- et al. (2023), who employ ChatGPT as an annotator to cate-
ated by human experts or LLMs. It introduces a framework gorize inputs. This method has been shown to outperform
comprising three components: an API Retriever, an LLM crowd-workers in several tasks, including relevance, stance,
Finetuner, and a Demo Selector. AUTOACT (Qiao et al., topics, and frame detection. Furthermore, He et al. (2023b)
2024) proposes an agent learning framework that does not propose Targeted Data Generation (TDG), a novel approach
require large-scale annotated data or synthetic trajectories for identifying challenging subgroups within a dataset. TDG
from high-resource models like GPT-4. Instead, it uses a self- leverages LLMs, along with human-in-the-loop, to generate
instruct method to generate its own planning trajectories new data specifically tailored for these subgroups, thereby
with limited initial data. It then applies a division-of-labor enriching the dataset and improving model performance
strategy, creating sub-agents specialized in different aspects in sentiment analysis and natural language inference tasks.
of the task completion process. To facilitate the clinical information extraction task, Tang
Distillation also works out for the training of embodied et al. (2023b) elicit diverse samples from LLMs by providing
multi-modal agents (Sumers et al., 2023; Yang et al., 2023c; examples and different seeds of clinical entities, i.e. the
Ma et al., 2023a; Du et al., 2023a; Sumers et al., 2023). For Curation manner.
instance, Sumers et al. (2023) aim to enhance the ability of Several studies have also focused on multiple NLU
AI agents to follow instructions by using pretrained vision- tasks (Ding et al., 2023a; He et al., 2023a; Wang et al.,
language models to provide supervision for understanding 2021a; He et al., 2022; Ye et al., 2022; Meng et al., 2022).
and acting upon language within their operational environ- For example, He et al. (2023a) utilize the knowledge in
ment, leveraging model distillation and hindsight experi- GPT-3.5 to annotate inputs with labels and explanations
ence replay to teach them contextually relevant interactions for various NLU tasks, including user input and keyword
in a simulated 3D setting. Emma (Yang et al., 2023c) evalu- relevance assessment, BoolQ, and WiC. Wang et al. (2021a)
ates the challenges and inefficiency of training an embodied employ few-shot prompts to expand high-quality training
agent in a noisy visual world without expert guidance, and data using GPT-3, i.e. the Expansion manner. Beyond merely
proposes to train them in a simulated environment using employing a single approach to elicit NLP task knowledge,
imitation learning, guided by an expert Language Model Ding et al. (2023a) explore a combination of Labeling, Ex-
(like ChatGPT), which operates in a corresponding text- pansion, and Curation methods to extract knowledge from
based simulation, focusing on the same tasks. GPT-3 for distilling data for both sequence- and token-level
NLP tasks.
4.4 NLP Task Specialization
NLP tasks often grapple with challenges like data scarcity, 4.4.2 Natural Language Generation
interpretability issues, privacy concerns, and noisy data.
The “Knowledge” section of our survey illustrates various Natural Language Generation (NLG) is a key aspect of eval-
methods for distilling knowledge from LLMs, effectively uating the capabilities of LLMs, encompassing tasks such as
setting the stage for student models to adapt to a range summarization, machine translation, and other open-ended
of NLP tasks. This knowledge provides supervision for text generation tasks. Known for their potent generative
the training of student models through information aug- abilities and creativity, LLMs excel in these areas, making
mentation (e.g., CoT and explanation), data augmentation, them prime sources for distilling knowledge into student
and semantic representation. By transferring the distilled models tailored for NLG tasks (Xu et al., 2023c, 2024b;
knowledge from LLMs, student models can better handle Ramnath et al., 2023; Agarwal et al., 2024). Additionally,
diverse NLP challenges, improving task performance and the knowledge distilled from LLMs can be effectively used
addressing data limitations more robustly. for NLG task-specific data augmentation (Jung et al., 2023;
Wang et al., 2021b; Guo et al., 2023a; Yang and Nicolai,
4.4.1 Natural Language Understanding 2023; Wang et al., 2023h; Yang et al., 2023d). While the
Natural Language Understanding (NLU) is a fundamen- previous sections have focused on the works about open-
tal NLP task that involves comprehending and interpret- ended generation and multi-turn dialogue, this part will
20
specifically highlight the distillation techniques relevant to and expressiveness of user queries by refining or modifying
other NLG tasks. the initial query to more accurately align with the user’s
Although automatic metrics often favor smaller, fine- information needs. One notable approach is QUILL (Srini-
tuned models in summarization tasks, human evaluators vasan et al., 2022), which introduces a two-stage distillation
tend to prefer the summaries generated by LLMs. Address- method for query intent understanding. Initially, a retrieval-
ing this discrepancy, Xu et al. (2023c) develop a student sum- augmented LLM, serving as the ‘professor,’ is distilled into
marization model by distilling a GPTSUMM dataset, which a non-retrieval augmented teacher LLM, aiming to bolster
comprises over 4 million paragraph-summary pairs gener- its understanding capabilities. Subsequently, this enhanced
ated by querying GPT-3.5. In a different approach, Jung et al. teacher LLM is distilled into a final student model using a
(2023) introduce ‘Impossible Distillation,’ a method that large dataset, further refining the process. Incorporating the
creates high-quality summarization-specific dataset from QR into IR systems, Ma et al. (2023c) develop a ’Rewrite-
weak teacher LLMs. This method involves training a stu- Retrieve-Read’ framework. This process begins with an
dent model on the generated dataset and enhancing its LLM rewriting the queries via prompting, followed by a
capabilities through Self-Knowledge. Turning to the task of retrieval-augmented reading stage. To integrate the rewrit-
machine translation, where creating parallel corpora is tra- ten queries effectively into the IR system, the knowledge
ditionally expensive and time-consuming, Yang and Nicolai gleaned from the LLM is distilled into a compact student
(2023) propose a three-step distillation process. This process rewriter. This rewriter is then fine-tuned using feedback
involves generating seeds of verbs and nouns, forming sen- from the LLM reader through reinforcement learning.
tences, and then translating these sentences. Their findings
suggest that while the distilled dataset may lack diversity, Retriever and Reranker. In IR systems, the Retriever is
it effectively improves the translation signal for training designed to efficiently locate the top-k relevant texts from
student translation models. To distill high-quality content- a large corpus. It encodes both queries and documents into
grounded data automatically, Genie (Yehudai et al., 2024) vector representations and performs retrieval by computing
proposes a general methodology containing three key steps: the dot product between these vectors. The Reranker further
(a) preparation of the content, (b) distillation of responses refines the order of the retrieved documents to improve
from a teacher LLM corresponding to the content, and (c) the overall quality of the output. This is achieved in two
filtering mechanism to ensure the quality and faithfulness of primary ways, including Pointwise Reranker and Listwise
the generated data. Genie demonstrates that student models Reranker. Pointwise Reranker takes both the query and a
trained through this distilled data can match or even surpass single candidate document as input to directly generate a
models trained on human-generated data. relevance score. Listwise Reranker directly reorders a list of
input documents in terms of their relevance.
4.4.3 Information Retrieval Retriever and Pointwise Reranker. For the retriever and
pointwise reranker, a common application of KD from LLMs
Information Retrieval (IR) represents a crucial branch of is the generation of pseudo-queries for given documents.
computer science, focused on efficiently retrieving infor- This approach aims to expand the pairwise data, enhancing
mation relevant to user queries from extensive reposito- the training of dense retrievers or rerankers. For example,
ries (Cai et al., 2022; Liu et al., 2022b; Feng et al., 2023; InPars (Bonifacio et al., 2022) utilizes GPT-3 to generate
Shen et al., 2023b). A typical IR system encompasses three multiple pseudo-queries for an unlabeled document. To
main components: the query rewriter, the retriever, and ensure the relevance of these queries, the system filters
the reranker. Recent studies have highlighted the effective- them based on the highest log probabilities of generating a
ness of employing LLMs in IR systems, e.g. in enhancing query conditioned on the documents. Subsequently, InPars
the reranking stage through both point-wise and list-wise fine-tunes a reranker based on monoT5 (Raffel et al., 2020).
ranking methods (Ma et al., 2023b; Sun et al., 2023a; Qin Another similar approach, Promptagator (Dai et al., 2023b),
et al., 2023d). However, the practical application of LLMs in introduces a few-shot dense retrieval method that leverages
IR systems faces challenges, primarily due to their slower a small number of demonstrations from the target domain
generation speed, which conflicts with the low-latency re- for pseudo-query generation. Diverging from the reliance
quirements of IR tasks (Sun et al., 2023a). As a result, on unlabeled documents, Sachan et al. (2022) distill knowl-
the KD of LLMs emerges as a more promising approach edge from GPT-4 to curate diverse synthetic data for text
for IR, offering a way to infuse the distilled knowledge embedding tasks across nearly 100 languages. They fine-
from LLMs into various stages of the IR pipeline without tune powerful decoder-only LLMs, such as Mistral-7b (Jiang
compromising on speed. There has been a significant body et al., 2023a), on this synthetic data using standard con-
of work demonstrating how knowledge distilled from LLMs trastive loss. Remarkably, this method demonstrates strong
can benefit each component of the IR system, including the performance on text embedding and multilingual retrieval
Query Rewriter (Srinivasan et al., 2022; Ma et al., 2023c), the benchmarks without any labeled data. Beyond generating
Retriever (Dai et al., 2023b; Sachan et al., 2022, 2023; Schick pseudo-queries, teacher LLMs can also be employed to gen-
and Schütze, 2021; Meng et al., 2023; Peng et al., 2023b), and erate relevance scores as soft labels. These scores are used
the Reranker (Bonifacio et al., 2022; Sun et al., 2023a; Pradeep to train the retriever by minimizing the KL-divergence loss
et al., 2023a,b; Saad-Falcon et al., 2023; Ferraretto et al., 2023; between the teacher and student distributions, as explored
Jeronymo et al., 2023; Sun et al., 2023c). by Sachan et al. (2023).
Query Rewriter. The Query Rewriter (QR) is a pivotal com- Listwise Reranker. A distinct set of studies focuses on
ponent in IR systems, tasked with enhancing the precision listwise reranking, where its advantage lies in compar-
21
ing multiple documents simultaneously to determine the to the increasing use of LLMs in NLG evaluation (detailed
optimal reorder. RankGPT (Sun et al., 2023a) leverages further in (Li et al., 2024b)). Through KD of LLMs, student
GPT-4 to generate permutations for a group of candidate evaluators could enhance inference efficiency and achieve
passages. To distill this listwise ranking knowledge into a more flexible and highly customized evaluation (Wang et al.,
pointwise student reranker, various training loss functions 2023b; Kim et al., 2024; Xu et al., 2023d; Jiang et al., 2023c; Li
are employed, such as Listwise Cross-Entropy (Bruch et al., et al., 2024a).
2019), RankNet (Burges et al., 2005), and LambdaLoss (Wang PandaLM (Wang et al., 2023b) concentrates on a pairwise
et al., 2018). Building upon RankGPT’s framework, RankVi- evaluator designed to compare two pieces of generated
cuna (Pradeep et al., 2023a) and RankZephyr (Pradeep content. It utilizes a teacher LLM (GPT-3.5) to judge which
et al., 2023b) further refine this approach by directly fine- response is better for a given instruction and input, provid-
tuning a listwise reranker using teacher-generated textual ing reasons for its decision. Addressing the need for cus-
permutations. This enables the student reranker to produce tomized and flexible criteria to meet realistic user demands,
sequences of ranked results directly, bypassing the interme- Prometheus (Kim et al., 2024) distills GPT-4 to construct a
diate step of calculating individual relevance scores. training dataset that includes reference answers and a vari-
ety of customized scoring rubrics. This dataset is then used
4.4.4 Recommendation to tune LLaMA for evaluating model-generated responses.
Recommender systems are integral to enhancing user ex- Instructscore (Xu et al., 2023d) takes a more fine-grained ap-
perience in various online services, providing personalized proach by using GPT-4 to create detailed analysis data. This
content based on user preferences and behaviors. Many data is employed to tune LLaMA, enabling it to perform
works have demonstrated that LLMs could be directly used error analysis on generated texts compared to reference
as recommenders without fine-tuning (Wang et al., 2023i; texts. The system further refines its evaluation capabilities
Dai et al., 2023c) or generate auxiliary textual features to through self-training with real model-generated response-
benefit recommender systems (Xi et al., 2023; Ren et al., reference pairs. For reference-free evaluation across diverse
2023; Wei et al., 2024). (Wang et al., 2023j; Ren et al., 2023; domains, TigerScore (Jiang et al., 2023c) samples data from
Wei et al., 2024). However, the real-time nature of online rec- a variety of text generation datasets, such as summariza-
ommender systems demands rapid response times, posing tion, translation, and data-to-text. It distills error analysis
a challenge with the inherent inference latency associated knowledge from GPT-4 and uses this to fine-tune LLaMA.
with LLMs. To address this, several studies have explored Lastly, to adapt evaluation to real-world scenarios beyond
ways to distill and integrate the knowledge from LLMs into conventional NLP tasks, Auto-J (Li et al., 2024a) collects
recommender systems, thereby leveraging their advanced real-world user queries and their evaluations from a teacher
capabilities while mitigating latency issues for efficient real- LLM. This massive dataset of real-world scenarios is then
time recommendations (Mysore et al., 2023; Zhang et al., used to distill evaluation knowledge into LLaMA through
2023b; Liu et al., 2023c). fine-tuning, enhancing its practical applicability.
Mysore et al. (2023) tackle data scarcity in narrative-
driven recommendation (NDR), where users provide de- 4.4.6 Code
tailed descriptions of their preferences. They utilize GPT-3 LLMs, trained on extensive corpora containing code, are
to create synthetic narrative queries from user-item interac- highlighted for their proficiency in code-related tasks. Their
tions via few-shot prompting, then distill this data into re- capabilities extend beyond direct code generation to include
trieval models for NDR. Similarly, GENRE (Liu et al., 2023c) the provision of external knowledge and data, which is
employs GPT-3.5 to augment datasets with new knowledge crucial in distilling their expertise into smaller, more effi-
about news summarization, user profiles, and personalized cient models. Several works have successfully distilled code
content, aiding the training of content-based recommenda- knowledge from LLMs into those compact and specialized
tion models. To bridge the gap between language models code models (Chaudhary, 2023; Rozière et al., 2023; Gu-
and recommender systems, some research views behavior nasekar et al., 2023; Wei et al., 2023; Chen et al., 2023a;
modeling as an extension of language modeling (Cui et al., Liu et al., 2023d; Yu et al., 2024; Jain et al., 2023; Su and
2022; Liu et al., 2023k). InstructRec (Zhang et al., 2023b), McMillan, 2023; Guo et al., 2023d).
for instance, interprets recommendation as instruction fol- A primary focus in these student code models is on
lowing. They use ChatGPT to distill a wealth of user- code generation, a task of both common utility and practical
personalized instruction data reflecting diverse preferences significance. For instance, Code Alpaca (Chaudhary, 2023)
and intentions based on real historical interactions. This fine-tunes Llama using self-instruct with ChatGPT-distilled
data is then used to fine-tune a 3B student language model instructions specifically for code generation tasks. Similarly,
specifically for recommendation purposes. Code Llama-instruct (Rozière et al., 2023) is fine-tuned via
self-instruct, prompting Llama-2 (Touvron et al., 2023) with
4.4.5 Text Generation Evaluation coding problems, and further refined with unit tests. Phi-
Text generation evaluation, i.e. NLG evaluation, focuses on 1 (Gunasekar et al., 2023) aims to enhance the quality of dis-
assessing the quality of generated content. Unlike tradi- tilled code data by extracting “textbook quality” data from
tional NLG evaluation metrics like BLEU (Papineni et al., a teacher LLM, incorporating Python textbook and exercise
2002) or ROUGE (Lin, 2004), which primarily rely on data. Magicoder (Wei et al., 2023) addresses potential biases
surface-level text comparisons, LLMs, trained on extensive in teacher LLMs by referencing a wealth of open-source
corpora and refined through techniques like RLHF, offer a code, yielding more diverse and grounded data for code
more human-aligned assessment. This sophistication has led generation. To consider the capability of the student model
22
and leverage the feedback of the teacher, PERsD (Chen et al., GPT-4 to distill referential question-answer pairs from
2023a) introduces a Personalized Distillation method where the Flickr30K (Plummer et al., 2015) dataset, enhancing
the teacher LLM refines the student’s generated code based the understanding of referential regions within images.
on the execution feedback of the executor. LSKD (Park et al., 2023) introduces localized references
However, these models primarily target the code gener- to specific image regions, prompting the teacher LLM
ation task, lacking generalizability across a broader range to generate commonsense inferences about these areas.
of code-related tasks. To address this issue, MFTCoder (Liu To enhance the visual instruction tuning pipeline with
et al., 2023d) utilizes self-instruct to distill diverse code data text-rich images, LLaVAR (Zhang et al., 2023d) employs
from teacher LLMs for various tasks, such as code comple- the text-only GPT-4 as a teacher, using recognized texts
tion and text-to-code generation, training a student model and image captions to generate 16K conversation pairs for
via multi-task learning. WaveCoder (Yu et al., 2024), in text-rich images. The resultant student MLLM demonstrates
contrast, creates a comprehensive instruction tuning dataset enhanced interaction skills in content that combines both
covering four universal code-related tasks distilled from text and imagery.
GPT-3.5-turbo. WaveCoder first selects a diverse coreset of
Multiple Modalities. To extend knowledge distillation
raw data using the KCenterGreedy (Sener and Savarese,
of LLMs to encompass more modalities, such as audio
2018) clustering method, then employs the teacher LLM
and video, several innovative approaches have been in-
for generating task definitions and outputs. The teacher
troduced. These methods typically involve transforming
model also plays a role in evaluating and filtering this data.
these modalities into a textual format comprehensible to
Notably, WaveCoder demonstrates superior generalization
teacher LLMs, followed by the distillation of the teacher.
across different code-related tasks compared to other open-
Macaw-LLM (Lyu et al., 2023) leverages GPT-4 to generate
source models.
instruction-response pairs corresponding to the content of
images or videos. MIMIC-IT (Li et al., 2023f) aims to broaden
4.5 Multi-Modality the scope to language, image, and video understanding,
Multimodal Large Language Models (MLLMs) surpass tra- creating a substantial dataset with 2.8 million multimodal
ditional language-only LLMs by understanding and pro- instruction-response pairs distilled from ChatGPT. Chat-
cessing information across multiple modalities, more closely Bridge (Zhao et al., 2023d), on the other hand, represents
mirroring human perception and enabling a broader range a novel approach in multimodal language modeling. It
of real-world applications. There is a growing trend towards translates various non-textual modalities into text, combin-
developing MLLMs that follow multimodal instructions, ing fine-grained and global descriptions. This information
facilitating tasks with enhanced levels of interactivity. To ad- is then used to distill responses from ChatGPT or GPT-4
dress the scarcity of multimodal instruction-following data through an in-context learning process, effectively bridging
and to harness the commonsense and world knowledge the gap between different modalities.
embedded in teacher LLMs, numerous studies have focused
Others. Beyond distilling instruction-following data, sev-
on multimodal knowledge distillation from LLMs (Liu et al.,
eral methods have emerged that concentrate on harnessing
2023e; Zhao et al., 2023b; Wang et al., 2023e; Chen et al.,
different aspects of knowledge from LLMs. For instance,
2023c; Park et al., 2023; Pi et al., 2023; Zhao et al., 2023c; Liu
EMMA (Yang et al., 2023c) trains an MLLM to act as
et al., 2023f; Wu et al., 2023b; Luo et al., 2023d; Jiang et al.,
an embodied reflex agent within a visual environment.
2023d; Li et al., 2023c; Xu et al., 2023e).
It achieves this by distilling GPT-4’s skills in a parallel
Vision-Language. In the vision-language domain, textual world, generating actions and providing reflective
LLaVA (Liu et al., 2023e) pioneers the extension of the feedback. Silkie (Li et al., 2023h) takes a unique approach by
Self-Instruct approach from the language to the multimodal distilling preferences from GPT-4V, focusing on criteria like
field. It translates images into textual descriptions, helpfulness and visual faithfulness. Ha et al. (2023) represent
including captions and bounding boxes, and distills another innovative direction, where it generates, labels,
GPT-4 for generating new data in the context of seed and distills diverse robot-centric exploration experiences by
examples. This approach creates a LLaVA-Instruct-150k LLMs into a multi-task visuo-linguo-motor policy.
dataset, which serves as the foundation for further
developments like LLaVA-1.5 (Liu et al., 2023l) and 5 D OMAIN - SPECIFIED V ERTICAL D ISTILLATION
GPT4ROI (Zhang et al., 2023e), enhancing the instruction-
This section shifts from skill distillation to examine KD of
following capabilities of MLLMs. To expand the dataset’s
LLMs in various vertical domains, including Law, Medical
scale, SVIT (Zhao et al., 2023b) introduces a 4.2 million
& Healthcare, Finance, and Science, etc. It delves into cus-
image dataset, distilled from GPT-4 by leveraging manual
tomizing distilled LLMs for these fields, showing its signifi-
image annotations. It employs a novel data recipe to select
cant role in enhancing domain-specific AI applications. The
an informative, diverse, and balanced subset of training
taxonomy of these works is shown in Figure 7.
data. LVIS-Instruct4V (Wang et al., 2023e) leverages GPT-
4V (OpenAI, 2023), a powerful large multimodal model,
as a teacher to distill a more accurate and context-aware 5.1 Law
instruction-following dataset, focusing on fine-grained Law holds a crucial position in molding societies, over-
understanding. Further advancements include integrating seeing human interactions, and ensuring justice prevails.
specific region referencing in image-based instruction Informed decision-making, legal interpretation, and the pro-
following. For instance, Shikra (Chen et al., 2023c) uses vision of legal advice by professionals hinge on precise
23
Law LawyerLLaMA (Huang et al., 2023b), LawGPT (Cui et al., 2023b), Fuzi (Wu et al., 2023d)
Huatuogpt (Zhang et al., 2023c), Huatuogpt-II (Chen et al., 2023d), Doctorglm (Xiong et al., 2023),
Medical and Healthcare Alpacare (Zhang et al., 2023f), Huatuo (Wang et al., 2023a), ChatDoctor (Li et al., 2023i),
MedAlpaca (Han et al., 2023), PMC-LLaMA (Wu et al., 2023e), DISC-MedLLM (Bao et al., 2023a)
and current information. Legal intelligent applications in 5.2 Medical and Healthcare
different scenarios usually require combinations of multiple The integration of LLMs holds great potential for trans-
fundamental capabilities of legal text retrieval, understand- forming medicine and healthcare. Extensive research has
ing, reasoning and generating (Zhang et al., 2023g; Sun, focused on adapting general-purpose LLMs to the medical
2023; Lai et al., 2023). To address challenges like legal ter- domain (Singhal et al., 2023), such as electronic health
minology, subtle interpretations, and the constant evolution records, and healthcare applications like patient care (Zhu
of legislation presents distinctive challenges that demand et al., 2023). Recent work has focused on enhancing medi-
customized resolutions. To handle the above challenges, cal instruction-following data with advanced teacher LLMs
several studies have investigated the customization of LLMs to better align with complex user instructions. Given the
for intelligent legal services (Cui et al., 2023b; Yue et al., abundance of medical data, most studies combine real-
2023b; Huang et al., 2023b; Wu et al., 2023d). This involves world data with distilled instruction data from teacher
a continued pre-training process on extensive legal corpora, LLMs (Zhang et al., 2023c; Xiong et al., 2023; Zhang et al.,
followed by fine-tuning with self-constructed instructions or 2023f; Wang et al., 2023a; Li et al., 2023i; Han et al., 2023; Wu
augmented data using advanced LLMs. et al., 2023f; Bao et al., 2023a; Chen et al., 2023d).
While existing studies predominantly concentrate on
training using dedicated medical dialogue datasets com-
prising medical textbooks (Wu et al., 2023e), biomedical
Huang et al. (2023b) have unveiled a Chinese legal papers (Luo et al., 2023e) medical knowledge-graphs (Bao
large model named LawyerLLaMA. The model undergoes et al., 2023b), or authentic doctor-patient interactions (Bao
an initial pre-training phase on an extensive legal corpus, et al., 2023b), an expanding body of research is delv-
systematically assimilating knowledge of the Chinese legal ing into the augmentation of medical instruction-following
system. Subsequently, fine-tuning occurs through the analy- data with advanced LLMs to enhance the alignment with
sis of objective questions from the Chinese National Judicial practical user instructions. Zhang et al. (2023c) introduce
Examination (Zhong et al., 2020) and the gathering of re- HuatuoGPT specifically tailored for medical consultations.
sponses to legal consultations using ChatGPT. This process The model leverages both distilled data from ChatGPT and
equips the model with the ability to apply legal knowledge real-world data from doctors during the supervised fine-
to specific scenarios. Cui et al. (2023b) present LawGPT, tuning stage. In a parallel effort, Xiong et al. (2023) con-
built upon the foundation of OpenLLAMA. The model is struct a dataset of medical dialogues in Chinese, em-
trained using a construction process that incorporates real- ploying ChatGPT’s assistance. Their methodology encom-
world legal text, legal regulations, judicial interpretations, passed various techniques to train DoctorGLM, an easily
and actual legal consultation data. Additionally, the authors deployable LLM designed for tasks such as diagnoses,
utilize the ChatGPT API for assisted construction, enabling drug recommendations, and other medical advice. Zhang
the generation of supplementary data derived from the et al. (2023f) fine-tune LLaMA-series models using 52k
existing dataset. Wu et al. (2023d) have developed a large- diverse, machine-generated, medical instruction-following
scale Chinese legal model (named Fuzi) with ChatGLM data named MedInstruct-52k. This effort resulted in the
as its foundation. This model undergoes training on an development of AlpaCare, a model demonstrating robust
extensive Chinese legal corpus, which incorporates unsu- medical proficiency and generalizability across both general
pervised judicial language data, including diverse judgment and medical-specific domain free-form instruction evalu-
documents and legal regulations. Additionally, it undergoes ations. In a different vein, Wang et al. (2023a) propose
supervised judicial fine-tuning with data encompassing le- HuaTuo, a LLaMA-based model that undergoes supervised
gal QA and case retrieval. Fuzi’s training also involves both fine-tuning with generated QA instances. This refinement
general instruction fine-tuning datasets, such as Alpaca, process enhances the model’s possession of more reliable
and domain-specific instruction fine-tuning datasets from medical knowledge. Li et al. (2023i) introduce ChatDoctor,
LawyerLLaMA (Huang et al., 2023b) and LawGPT (Cui which was first trained as a generic conversation model
et al., 2023b). based on LLaMA. It utilized 52K instruction-following data
24
from Stanford University’s Alpaca project (Taori et al., Specifically, XuanYuan (Zhang and Yang, 2023) lever-
2023). Subsequently, the conversation model underwent ages self-instruct over seed data and self-QA over struc-
fine-tuning on a dataset of 100K patient-physician conver- tured/unstructured data to generate instruction data in the
sations collected from an online medical consultation web- finance domain, which is used to train a finance LLM.
site. This two-step training process underscores the model’s
adaptability to diverse conversational contexts, particularly 5.4 Science
those specific to patient-physician interactions.
The integration of LLMs into the science domain (Taylor
Built upon existing datasets, MedAlpaca (Han et al.,
et al., 2022; Yin et al., 2023b) represents a paradigm shift
2023) proposes to reconstruct the data with GPT-3.5-Turbo,
in research, knowledge discovery, and the dissemination
which is then used to fine-tune LLMs for effective medical
of scientific information. In science, LLMs are leveraged to
applications. Furthermore, PMC-LLaMA (Wu et al., 2023f)
digest and synthesize vast amounts of literature, aiding in
proposes a training framework (i.e., continual pre-training
the identification of new research opportunities and the ac-
and domain-specific multi-task supervised fine-tuning) to
celeration of scientific breakthroughs. They facilitate the un-
adapt a general LLM to the medicine domain, where GPT-
derstanding of complex scientific concepts by summarizing
4 is leveraged to write synonymous sentences for data
research papers, generating hypotheses, and even drafting
augmentation in the SFT. To adapt LLMs to real-world
research proposals and manuscripts, thus significantly re-
medical consultation, DISC-MedLLM (Bao et al., 2023a)
ducing the time researchers spend on literature review and
leverages GPT-3.5 to 1) construct 50K QA pairs in a few-
enabling them to focus more on experimental work. LLMs
shot manner and 2) re-generate the 420k dialogues based
also democratize access to scientific knowledge by pro-
on real cases, which are then used to train LLMs in a
viding layperson summaries of complex research findings,
supervised fine-tuning manner. More recently, HuatuoGPT-
making science more accessible to non-experts and fostering
II (Chen et al., 2023d) proposes a one-stage training with
a broader public understanding of scientific advancements.
instruction-formatting unification of domain data collection
By enhancing the efficiency of research workflows and
for medical adaption upon LLMs, where GPT-4 is used to
fostering interdisciplinary collaborations, LLMs are poised
formulate medical questions to fine-tuning instructions.
to accelerate the pace of scientific discovery and innovation
These diverse studies collectively contribute to the ad-
across various fields. To distill knowledge from an LLM,
vancing field of the medical domain, facilitated by knowl-
DARWIN Series (Xie et al., 2023a) utilizes a semi self-
edge distillation from advanced LLMs. Through the ex-
instruct for instruction generation for science papers, which
ploration of various methodologies, these approaches pro-
is then used to fine-tune an LLM. SciGLM (Zhang et al.,
vide valuable insights into the challenges and potential
2024) proposes to train a scientific LLM, which prompts a
breakthroughs at the intersection of cutting-edge language
teacher LLM to generate detailed answers for unlabelled
models and medical applications.
scientific questions, as well as a self-reflective critic-and-
revise to improve data quality. Besides the above knowledge
5.3 Finance distillation methods to adapt LLMs to science, we will also
delve into how the distillation happens in sub-domains, e.g.,
The application of LLMs to the finance domain (Xue et al.,
mathematics, astronautics, chemistry, etc.
2023) significantly transforms how financial data is ana-
lyzed, decisions are made, and customer interactions are Mathematics. The application of LLMs within the sub-
managed. In finance, LLMs offer unprecedented capabil- domain of mathematics heralds a transformative era in
ities in understanding complex financial documents, pre- mathematical research, education, and problem-solving
dicting market trends, and automating risk assessment, (Azerbayev et al., 2023; Yu et al., 2023b). LLMs in mathemat-
thus enabling more informed and faster decision-making ics facilitate the exploration and understanding of complex
processes. By processing and analyzing vast amounts of mathematical theories and problems by providing intuitive
unstructured financial data, such as news articles, reports, explanations, proofs, and solutions that can bridge the
and real-time market feeds, LLMs can identify patterns gap between advanced mathematical concepts and learn-
and insights that were previously inaccessible, leading to ers at various levels. These models have shown potential
more accurate forecasts and strategic financial planning. in conjecturing new mathematical theorems and patterns,
Furthermore, LLMs enhance customer experiences through thus opening new avenues for research and discovery that
personalized financial advice, automated customer service, might not have been readily accessible to humans alone.
and sophisticated chatbots that can handle complex queries. In education, they serve as personalized tutors, offering
This level of automation and insight has the potential to students step-by-step guidance through mathematical prob-
increase efficiency, reduce operational costs, and improve lems and adapting explanations to the learner’s level of un-
compliance and risk management practices in financial derstanding. This democratizes access to high-quality math-
institutions, making LLMs a transformative force in the ematical education and fosters a deeper appreciation and
finance sector. Knowledge distillation from a proprietary understanding of mathematics among a broader audience.
LLM is still under-explored, and most existing works focus By enhancing collaborative efforts through the generation
on adapting LLMs to finance applications by continual pre- of new ideas and the simplification of complex concepts,
training on finance-specific corpora (Wu et al., 2023g; Lu LLMs are poised to significantly advance the field of math-
et al., 2023) or fine-tuning in a supervised manner on multi- ematics, making it more accessible, efficient, and innova-
task finance-specific instructions (Yang et al., 2023e; Xie tive. WizardMath (Luo et al., 2023b) enhances the mathe-
et al., 2023b; Wang et al., 2023k). matical reasoning capabilities of Llama-2 by applying the
25
novel Reinforcement Learning from Evol-Instruct Feedback physical and electronic properties of crystalline solids from
(RLEIF) method, significantly outperforming other open- text descriptions. This approach underscores the potential of
source LLMs on the GSM8k and MATH benchmarks, as text-based methods in materials science, offering significant
well as surpassing several closed-source LLMs including improvements in prediction accuracy while also contribut-
ChatGPT-3.5 and Minerva. MAmmoTH (Yue et al., 2023a) is ing a benchmark dataset, TextEdge, to foster further re-
a series of open-source LLMS specifically developed for gen- search in this emerging field. InstructMol (Cao et al., 2023a)
eral math problem-solving, achieving superior performance integrates multi-modal data, aligning molecular structures
on nine mathematical reasoning datasets. Utilizing a novel with natural language instructions for drug discovery tasks.
instruction tuning dataset called MathInstruct, which com- Through a novel two-stage instruction-tuning approach,
bines chain-of-thought and program-of-thought rationales, it significantly enhances performance in molecule-related
MAmmoTH models demonstrate substantial improvements tasks, establishing a reliable molecular assistant that outper-
over existing models. TORA (Gou et al., 2024), a series of forms existing LLMs and reduces the performance gap with
Tool-integrated Reasoning Agents, significantly advances specialized models. This demonstrates the value of multi-
mathematical problem-solving by combining natural lan- modal integration in developing versatile tools for complex
guage reasoning with the use of external computational domains like drug discovery.
tools. It markedly outperforms existing open-source models
Biology. In the field of Biology, particularly in the study
on 10 mathematical reasoning datasets, showcasing notable
of proteins, DNA, and RNA, LLMs are revolutionizing our
improvements over both rationale-based and program-
understanding of the fundamental molecules of life. By an-
based approaches, and introduces innovative training tech-
alyzing vast datasets of biological sequences and structures,
niques such as output space shaping to enhance model rea-
LLMs can predict the three-dimensional shapes of proteins,
soning capabilities. G-LLaVA (Gao et al., 2023c) introduces
potential functions, and interactions at a scale and speed
a significant advancement in geometric problem-solving for
beyond traditional computational methods. This capability
LLMs by leveraging a multimodal approach that combines
is critical for unraveling the complexities of biological sys-
text and image data. This model, utilizing the Geo170K
tems, advancing drug discovery by identifying targets and
dataset comprising over 170,000 geometric image-caption
designing molecules with high precision, and understand-
and question-answer pairs, demonstrates remarkable im-
ing genetic diseases through the interpretation of genomic
provements over GPT-4V on the MathVista benchmark.
variations.
Astronautics. The application of LLMs in astronau- Prot2Text (Abdine et al., 2023) introduces a novel multi-
tics (Nguyen et al., 2023) propels the field forward. modal framework for generating protein function descrip-
AstroLLaMA-Chat (Perkowski et al., 2024) is an ad- tions in free text by combining GNNs and LLMs. This
vancement of the AstroLLaMA model, leveraging a 7B- approach, which integrates structural and sequential protein
parameter LLaMA-2 model and targeted continual pre- information, highlights the transformative impact of knowl-
training on a curated astronomy corpus to enhance per- edge distillation through the fusion of GNNs and LLMs
formance in astronomy-focused question-answering. This for accurate protein function prediction, potentially revolu-
model demonstrates significant improvements in special- tionizing research in bioinformatics and biological sciences.
ized topic comprehension and introduces a chat-enabled BioMedGPT (Luo et al., 2023e) introduces a multimodal
version for the astronomy community, highlighting the generative pre-trained transformer specifically designed for
effectiveness of domain-specific knowledge distillation in the biomedicine domain, emphasizing the significance of
achieving superior performance on specialized topics. aligning molecular, protein, and natural language modal-
ities to enhance biomedical question-answering, molecule,
Chemistry and Materials Science. The integration of LLMs
and protein QA tasks. This framework showcases the critical
into Chemistry and Materials Science has revolutionized
role of knowledge distillation in bridging the gap between
the way researchers approach the discovery and develop-
complex biological data and human language, thereby fa-
ment of new compounds and materials. By analyzing vast
cilitating groundbreaking advancements in drug discovery
datasets and scientific literature, LLMs can predict the prop-
and therapeutic target identification. xTrimoPGLM (Chen
erties and behaviors of substances, significantly accelerating
et al., 2024e), a unified 100B-scale pre-trained transformer
the innovation cycle.
model, addresses both protein understanding and genera-
GIMLET (Zhao et al., 2023f), Graph Instruction based
tion tasks by integrating autoencoding and autoregressive
MolecuLe zEro-shoT learning, is a novel approach to
pre-training objectives. Its significant advancements over
molecule property prediction that integrates graph and text
existing models in 18 protein understanding benchmarks
data within a single language model framework, aiming
and its capability in de novo protein sequence generation
to improve instruction-based zero-shot learning for molec-
highlight the model’s importance in advancing the field of
ular tasks. By leveraging a transformer mechanism with
protein science through knowledge distillation.
generalized position embedding and decoupled attention,
GIMLET significantly outperforms traditional molecule-text Geography, Geology, and Environmental Science. The inte-
baselines in zero-shot learning scenarios, demonstrating gration of LLMs into Geography, Geology, and Environmen-
the model’s effectiveness in generalizing from instructions tal Science is revolutionizing these fields by enhancing data
to a broad range of molecule-related tasks without prior analysis, predictive modeling, and interdisciplinary research
explicit task-specific training. LLM-Prop (Rubungo et al., (Roberts et al., 2023; Lin et al., 2023b; Wang et al., 2023l).
2023), leveraging the T5 model, showcases how LLMs can K2 (Deng et al., 2023), the first-ever LLM specialized in
outperform SoTA graph neural networks in predicting the the geoscience domain, demonstrates the significant impact
26
of knowledge distillation in vertical domain specialization. linear function to select the most effective data based on
By adapting the general-domain LLaMA-7B model with a their statistical properties. Li et al. (2023j) propose a data
5.5B token geoscience corpus and introducing the GeoSignal selection pipeline similar to self-distillation, in which the
instruction tuning dataset, K2 showcases enhanced perfor- LLM firstly learns from a small subset of the data to get the
mance in geoscience knowledge understanding and uti- basic ability, and then further uses this learned model to rate
lization. The model’s development highlights a novel ap- for the original dataset. Du et al. (2023b) propose to consider
proach to efficiently gather domain-specific data and align three aspects including quality, coverage, and necessity for
model responses to specialized user queries. OceanGPT (Bi the filtering process. Li et al. (2023k) select instruction data
et al., 2023), introduced as the first LLM for ocean sci- by evaluating their one-shot improvement on a hold-out
ence tasks, underscores the vital role of knowledge distil- set. Li et al. (2024f) recently propose Superfiltering, which is
lation in the vertical domain of oceanography. It leverages able to utilize small language models like GPT2 to filter out
DOINSTRUCT, a novel framework for generating domain- the high-quality subset from a given high-quality dataset.
specific instruction data through multi-agent collaboration, Despite the emergence of these works working on data fil-
and establishes OCEANBENCH, a benchmark for evaluat- tering, How to efficiently select the optimal distillation data
ing LLMs in the ocean domain. MarineGPT (Zheng et al., for LLMs, and How much data is required for distillation
2023b) showcases the transformative potential of knowl- are still unsolved.
edge distillation in the marine domain by leveraging a
novel vision-language model tailored for marine science. Reduce the Distillation Cost (Lightweight Methods) De-
Utilizing the Marine-5M dataset, which includes over 5 spite the remarkable abilities of the latest LLMs, their sig-
million marine image-text pairs, MarineGPT excels in pro- nificant resource requirements underscore the urgent need
viding detailed, accurate, and domain-specific responses. to find efficient solutions to overcome these challenges.
GeoGalactica (Lin et al., 2024) represents a pioneering step Common ways to further reduce the distillation cost include
in specializing LLMs for geoscience, leveraging a 30 billion Model Compression and Efficient Fine-Tuning. In the realm
parameter model pre-trained on a vast geoscience corpus. of Model Compression, Quantization (Frantar et al., 2023;
This model is notable for being the largest of its kind within Dettmers et al., 2022; Kim et al., 2023c; Tao et al., 2022b; Yao
the geoscience domain. et al., 2022; Xiao et al., 2023), Parameter Pruning (Ma et al.,
2023d; Zhang et al., 2023h; Frantar and Alistarh, 2023), and
Low-Rank Approximation (Xu et al., 2023g; Li et al., 2023l)
5.5 Miscellaneous are commonly utilized. In the realm of Efficient Fine-Tuning,
Knowledge distillation of LLMs has vast potential across Parameter Efficient Fine-Tuning (Hu et al., 2023b; Liu et al.,
various verticals beyond the ones previously discussed, 2022c; Wang et al., 2022b; Hu et al., 2021; Li and Liang,
highlighting their versatility and transformative impact 2021; Liu et al., 2022d), and Memory Efficient Fine-Tuning
across different industries. For instance, in the education (Dettmers et al., 2023; Kim et al., 2023d; Malladi et al., 2024)
sector, EduChat (Dan et al., 2023) exemplifies a chatbot are utilized. A detailed survey on Efficient Large Language
system that provides tailored support to teachers, students, Models can be found here in Wan et al. (2024b). The problem
and parents. KD is central to its design, leveraging pre- that remains is how can we further compress the model and
training on educational data followed by fine-tuning with build effective distillation algorithms.
custom instructions to deliver capabilities such as essay
evaluation and emotional support. Similarly, Owl (Guo Multi-Teacher Distillation Most of the existing distilled
et al., 2023b), an LLM designed for IT operations, boosts models are distilled from a single teacher model, how-
operational efficiency using the Owl-Instruct dataset, which ever, it is widely accepted that models trained with dif-
is distilled from ChatGPT. By applying a mixture-of-adapter ferent sources of data have various capabilities. Thus a
strategy for domain-specific tuning, it enhances analysis and question arises: Is it possible to distill knowledge from
performance in IT-related tasks. different teacher models into one student model? BabyL-
lama (Timiryasov and Tastet, 2023) proposes to distill the
knowledge from both the GPT2 and LLaMA into the small-
6 O PEN P ROBLEMS size student models. Ensemble-Instruct (Lee et al., 2023b)
tries to generate both instructions and responses ensembled
Further Data Selection How much data is required for LLM from several different LLMs with RougeL as the indicator.
distillation and how to filter out the low-quality data remain FUSELLM (Wan et al., 2024a) externalizes the collective
open-domain questions. In the field of instruction tuning, knowledge and unique strengths by leveraging the genera-
one of the most commonly used methods for distillation, tive distributions of different LLMs aiming to train a student
Zhou et al. (2023a) propose that only 1000 human-curated model beyond those of any individual source LLM. Despite
high-quality data is enough for the alignment of LLMs, the recent progress in this topic, it still remains an under-
hypothesizing that LLMs have learned the required knowl- explored topic.
edge from pretraining and only a small amount of data is
required for the alignment. Its finding further raises a new Explore Richer Knowledge from Teacher LLMs As indicated
question, how to automatically select the data for better in Table 3, the majority of teacher LLMs are closed-source
distillation? Chen et al. (2023e) directly apply ChatGPT to due to their advanced capabilities. Consequently, current
rate each data sample together with explanations, and then methodologies primarily focus on using the generations
the data is selected based on the rating. Cao et al. (2023b) from these models as hard labels, training student models
split the existing instruction-tuning datasets and trains a through simple supervised fine-tuning. However, beyond
27
the straightforward imitation of output behaviors via hard bines task distribution modeling and knowledge distillation
labels, there is a growing interest in harnessing richer to mitigate catastrophic forgetting without requiring access
knowledge from teacher LLMs, including feedback and to the old data. To evaluate the effectiveness of instruction
feature knowledge, as well as exploring diverse combina- tuning in the context of continuous learning tasks, Zhang
tions of knowledge elicitation methods. As highlighted in et al. (2023i) introduce a more challenging yet practical
the Feedback section, teachers can provide various types of problem called Continual Instruction Tuning (CIT) and also
feedback based on the student’s outputs (Lee et al., 2023a; establish a benchmark suite consisting of learning and eval-
Jiang et al., 2023b; Chen et al., 2023a). Similarly, the Feature uation protocols. Although current research has explored
section discusses how knowledge based on features, such some simple methods to alleviate knowledge forgetting dur-
as logits serving as soft labels, can offer deeper, intrinsic ing model fine-tuning or knowledge distillation processes,
insights into the teacher model (Gu et al., 2024; Agarwal effectively avoiding catastrophic forgetting across domains
et al., 2024). These explorations have demonstrated promis- and skills remains a challenging issue. How to retain the
ing outcomes, suggesting that access to a broader spectrum original model’s capabilities effectively during knowledge
of knowledge can significantly enhance student model per- distillation or transfer processes is still a challenging prob-
formance beyond what is achievable through simple SFT lem.
distillation alone. This highlights the critical need for further
Trustworthy Knowledge Distillation Trustworthiness in
research into varied knowledge extraction methods from
LLMs is paramount, encompassing attributes such as truth-
teacher LLMs to augment the effectiveness of KD processes.
fulness, safety, fairness, robustness, privacy, and adherence
to machine ethics (Sun et al., 2024a). The rapid advancement
Overcoming Catastrophic Forgetting During Distillation
of LLMs brings to the forefront concerns regarding their
Previous research has delved into the fine-tuning of LLMs
trustworthiness, stemming from their complex outputs, the
to acquire the ability to follow instructions or transfer
biases present in vast training datasets, and the potential
knowledge for forthcoming tasks, skills, or domains, lever-
inclusion of private information. Current efforts in KD
aging advancements in LLM technology. Nevertheless, in-
of LLMs primarily focus on distilling various skills from
vestigations have revealed that the continual fine-tuning of
LLMs, with relatively little attention paid to trustworthiness
LLMs on particular datasets (skills, domains) can lead to
aspects. Existing studies tend to concentrate on a subset of
a phenomenon known as catastrophic forgetting, wherein
trustworthiness aspects, such as helpfulness, honesty, and
previously acquired knowledge and problem-solving abil-
harmlessness (Bai et al., 2022a; Yang et al., 2024; Cui et al.,
ities for earlier tasks are compromised (Chen et al., 2023f;
2023a). Consequently, in the distillation process, student
Kotha et al., 2023; Koloski et al., 2023; Wu et al., 2024;
models may inherit issues related to trustworthiness from
Luo et al., 2023f). Earlier studies in machine learning and
their teacher LLMs. As assessed in Sun et al. (2024a), smaller
deep learning have investigated various techniques to help
open-source LLMs generally fall short of their proprietary
mitigate forgetting during the fine-tuning or continue learn-
counterparts in trustworthiness metrics. Therefore, consid-
ing process, such as rehearsal, which entails periodically
ering trustworthiness alongside the distillation of capabil-
revisiting and training on past data (Kirkpatrick et al., 2017;
ities into student models is crucial. It is imperative that
Rostami et al., 2019; Rolnick et al., 2019), as well as reg-
future research on KD not only enhances the capabilities
ularization methods like elastic weight consolidation (Lee
of student models but also ensures that broader aspects of
et al., 2017), or dynamic architecture methods (Mallya et al.,
trustworthiness are meticulously addressed.
2018; Wang et al., 2022c; Hu et al., 2023c; Chen et al., 2023f).
To address the challenges of catastrophic forgetting and to Weak-to-strong Distillation. The concept of “weak-to-
enhance the diversity of generated instructions in knowl- strong generalization” in LLMs (Burns et al., 2023) empha-
edge distillation for LLMs, Jiang et al. (2023b) randomly sizes the potential to leverage weak supervision to elicit
sample an instruction from the easy instructions and also the advanced capabilities of more powerful models. This
prompt the generator to generate a new instruction that approach challenges the traditional distillation paradigm by
belongs to the same domain as the sampled one. In a similar suggesting that even with limited or imperfect supervision,
vein, Li et al. (2023m) study the problem of instruction- it is possible to enhance the performance of LLMs sig-
tuning in multi-modal LLMs knowledge distillation and nificantly. This necessitates exploring innovative strategies
introduce a competitive distillation framework. The model that enable weaker models to guide the learning process
tries to produce new instructions that differ in content but of stronger ones effectively, highlighting the importance
are similar in difficulty to the original pictures in the multi- of developing methods that can bridge the gap between
modal augmentation phase, so as to alleviate catastrophic these models. Such research could unlock new avenues
forgetting of the model and enhance the diversity of the for improving LLMs’ efficiency and effectiveness, making
instruction tuning pool. Chen et al. (2023f) propose the the pursuit of “weak-to-strong distillation” a crucial area
Lifelong-MoE (Mixture-of Experts) architecture based on for future investigations in this LLM era. Initially, Burns
general language models, which dynamically adds model et al. (2023) investigates whether weak model supervision
capacity via adding experts with regularized pretraining. can unlock the full capabilities of much stronger models.
Additionally, the model also introduces implicit regulariza- Through experiments with pre-trained language models in
tion via distillation of the knowledge from old experts and the GPT-4 family across NLP, chess, and reward modeling
gatings to effectively preserve old knowledge. Zeng et al. tasks, it finds that finetuning strong models on weak labels
(2023b) propose a new generative-based rehearsal method leads to better performance than their weak supervisors,
as Dirichlet Continual Learning (DCL). This method com- demonstrating weak-to-strong generalization. Then, Li et al.
28
(2024g) introduce Superfiltering, a method that employs of training and deployment. Our review emphasizes vari-
smaller, weaker models like GPT-2 to select high-quality ous KD approaches, from algorithmic innovations to skill
data for fine-tuning larger, more capable models such as enhancement and vertical distillation. Notably, data aug-
LLaMA2. This approach is rooted in discovering a strong mentation and synthesis within KD emerge as vital tools
consistency in evaluating instruction tuning data difficulty for improving distillation, revealing the powerful synergy
across models of varying sizes. More recently, Ji et al. (2024) between enriched training data and effective model distil-
introduce Aligner, a novel approach for aligning LLMs with lation. As the AI landscape evolves, rapid advancements
human values and intentions by utilizing weak supervisory in model architectures and training methods present both
signals from smaller models to improve the performance challenges and research opportunities for KD of LLMs.
of larger models. However, Burns et al. (2023) find that Future innovation will need to focus on achieving efficiency,
achieving the full capabilities of strong models requires transparency, and ethics while maintaining model trust-
more than naive finetuning, suggesting the need for further worthiness. Furthermore, promising areas such as weak-
research in this area. Therefore, open questions still remain to-strong generalization, self-alignment, and multi-modal
about 1) What are the theoretical and practical limits of LLMs offer the potential to enhance the capabilities of
weak-to-strong distillation? Can weak supervision reliably distilled models. In conclusion, the KD of LLMs is set to play
extract and enhance the full spectrum of capabilities in a pivotal role in the future of AI research. As highlighted
stronger models across all domains, or are there inherent in this survey, sustained research efforts will be critical in
limitations based on model architecture or task specificity? developing accessible, efficient, and responsible AI for all.
2) How do we identify or design the optimal weak su- Importantly, when conducting KD of LLMs like ChatGPT
pervisors for distilling knowledge into stronger models? Is or Llama, it’s essential to comply with the model providers’
there a framework or criteria to predict which weak models terms4 , such as the restrictions on developing competitive
would be most effective in guiding the learning process of products.
more complex models for specific tasks? 3) To what extent
are weak-to-strong distillation techniques transferable and R EFERENCES
scalable across different sizes and types of models? How
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright,
can these methods be adapted to ensure efficacy and ef-
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray
ficiency in distilling knowledge from very large models to
et al., “Training language models to follow instructions
significantly smaller ones, especially in resource-constrained
with human feedback,” Advances in Neural Information
environments?
Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
Self-Alignment. Aligning LLMs traditionally relies heavily OpenAI, :, J. Achiam, S. Adler, S. Agarwal, L. Ahmad,
on human or teacher LLMs to supply extensive preference I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
data. Consequently, the alignment of the student model S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji,
is limited by the quantity of distilled preference data and V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum,
the teacher’s capabilities. Self-alignment offers a promising I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bog-
alternative, aiming to enhance alignment beyond the con- donoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman,
straints of teacher-provided preferences. In self-alignment, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell,
the student model endeavors to autonomously improve A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan,
and align its responses with desired behaviors, including C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen,
generating model-written feedback, critiques, and explana- M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung,
tions. Several studies have explored utilizing the student D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry,
model’s inherent capabilities to generate knowledge for N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling,
alignment (Bai et al., 2022a; Sun et al., 2024b; Li et al., 2024c; S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi,
Yuan et al., 2024a). Beyond merely producing improved L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford,
responses (Bai et al., 2022a; Sun et al., 2024b), implemen- L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni,
tations of self-alignment include employing the student as G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray,
its reward model to offer feedback (Yuan et al., 2024a), a R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han,
strategy that merges Self-Knowledge with Feedback methods J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse,
of eliciting knowledge. We advocate for increasingly lever- A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu,
aging the student model itself to provide feedback, thereby S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang,
enhancing self-alignment capabilities. This approach not R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaf-
only facilitates moving beyond traditional human/teacher tan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S.
preference-based rewards but also opens avenues for con- Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim,
tinual self-improvement and alignment. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz
Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic,
7 C ONCLUSION AND D ISCUSSION G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike,
J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin,
This survey has explored the diverse landscape of knowl-
M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Mal-
edge distillation for LLMs, highlighting key techniques,
facini, S. Manning, T. Markov, Y. Markovski, B. Mar-
applications, and challenges. KD plays a crucial role in
democratizing access to advanced LLM capabilities, pro- 4. OpenAI Business Terms: https://openai.com/policies/business-
viding cutting-edge advancements without the high costs terms
29
guage models, December 2023. [Online]. Avail- white-box models for better human alignment,” arXiv
able: https://www.microsoft.com/en-us/research/blog/ preprint arXiv:2310.16271, 2023.
phi-2-the-surprising-power-of-small-language-models/ H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop,
Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: V. Carbune, and A. Rastogi, “Rlaif: Scaling reinforcement
Source code is all you need,” 2023. learning from human feedback with ai feedback,” arXiv
Z. Yu, X. Zhang, N. Shang, Y. Huang, C. Xu, Y. Zhao, W. Hu, preprint arXiv:2309.00267, 2023.
and Q. Yin, “Wavecoder: Widespread and versatile en- Y. Jiang, C. Chan, M. Chen, and W. Wang, “Lion: Adversarial
hanced instruction tuning with refined data generation,” distillation of closed-source large language model,” arXiv
2024. preprint arXiv:2305.12870, 2023.
J. Ye, J. Gao, Q. Li, H. Xu, J. Feng, Z. Wu, T. Yu, and H. Chen, A. Saha, S. Hoi, and S. Joty, “Personalized
L. Kong, “Zerogen: Efficient zero-shot learning via dataset distillation: Empowering open-sourced LLMs with
generation,” in EMNLP. Association for Computational adaptive learning for code generation,” in The 2023
Linguistics, 2022, pp. 11 653–11 669. Conference on Empirical Methods in Natural Language
J. Gao, R. Pi, Y. Lin, H. Xu, J. Ye, Z. Wu, W. Zhang, Processing, 2023. [Online]. Available: https://openreview.
X. Liang, Z. Li, and L. Kong, “Self-guided noise-free data net/forum?id=alxWMBcNVN
generation for efficient zero-shot learning,” in The Eleventh K. Yang, D. Klein, A. Celikyilmaz, N. Peng, and Y. Tian,
International Conference on Learning Representations, ICLR “RLCD: Reinforcement learning from contrastive distilla-
2023, Kigali, Rwanda, May 1-5, 2023, 2023. [Online]. tion for LM alignment,” in The Twelfth International Confer-
Available: https://openreview.net/pdf?id=h5OpjGd lo6 ence on Learning Representations, 2024. [Online]. Available:
L. H. Bonifacio, H. Q. Abonizio, M. Fadaee, and R. F. https://openreview.net/forum?id=v3XXtxWKi6
Nogueira, “Inpars: Data augmentation for information J. Jung, P. West, L. Jiang, F. Brahman, X. Lu, J. Fisher,
retrieval using large language models,” CoRR, vol. T. Sorensen, and Y. Choi, “Impossible distillation: from
abs/2202.05144, 2022. [Online]. Available: https://arxiv. low-quality model to high-quality dataset & model for
org/abs/2202.05144 summarization and paraphrasing,” 2023.
I. Timiryasov and J.-L. Tastet, “Baby llama: knowledge J. Huang, S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
distillation from an ensemble of teachers trained on J. Han, “Large language models can self-improve,” in
a small dataset with no performance penalty,” in Proceedings of the 2023 Conference on Empirical Methods
Proceedings of the BabyLM Challenge at the 27th Conference in Natural Language Processing, H. Bouamor, J. Pino, and
on Computational Natural Language Learning, A. Warstadt, K. Bali, Eds. Singapore: Association for Computational
A. Mueller, L. Choshen, E. Wilcox, C. Zhuang, J. Ciro, Linguistics, Dec. 2023, pp. 1051–1068. [Online]. Available:
R. Mosquera, B. Paranjabe, A. Williams, T. Linzen, https://aclanthology.org/2023.emnlp-main.67
and R. Cotterell, Eds. Singapore: Association for C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova,
Computational Linguistics, Dec. 2023, pp. 279–289. L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang,
[Online]. Available: https://aclanthology.org/2023.conll- C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas,
babylm.24 “Reinforced self-training (rest) for language modeling,”
C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, 2023.
P. Luo, and N. Wong, “Compression of generative pre- E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “Star: Boot-
trained language models via quantization,” arXiv preprint strapping reasoning with reasoning,” in NeurIPS, 2022.
arXiv:2203.10705, 2022. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert,
Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, a distilled version of bert: smaller, faster, cheaper and
Y. Shi, R. Krishnamoorthi, and V. Chandra, “Llm-qat: lighter,” arXiv preprint arXiv:1910.01108, 2019.
Data-free quantization aware training for large language Y. Wen, Z. Li, W. Du, and L. Mou, “f-divergence
models,” arXiv preprint arXiv:2305.17888, 2023. minimization for sequence-level knowledge distillation,”
Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, in Proceedings of the 61st Annual Meeting of the Association
A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, for Computational Linguistics (Volume 1: Long Papers),
C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Gan- A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto,
guli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, Canada: Association for Computational Linguistics, Jul.
J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, 2023, pp. 10 817–10 834. [Online]. Available: https:
M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. Das- //aclanthology.org/2023.acl-long.605
Sarma, R. Lasenby, R. Larson, S. Ringer, S. Johnston, C. Liang, S. Zuo, Q. Zhang, P. He, W. Chen, and T. Zhao,
S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen- “Less is more: Task-aware layer-wise distillation for lan-
Lawton, T. Conerly, T. Henighan, T. Hume, S. R. Bow- guage model compression,” in International Conference on
man, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, Machine Learning. PMLR, 2023, pp. 20 852–20 867.
S. McCandlish, T. Brown, and J. Kaplan, “Constitutional M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward de-
ai: Harmlessness from ai feedback,” 2022. sign with language models,” in ICLR. OpenReview.net,
L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, 2023.
Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction
et al., “Zephyr: Direct distillation of lm alignment,” arXiv tuning with gpt-4,” 2023.
preprint arXiv:2310.16944, 2023. G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and
J. Hong, Q. Tu, C. Chen, X. Gao, J. Zhang, and R. Yan, B. Ghanem, “Camel: Communicative agents for” mind”
“Cyclealign: Iterative distillation from black-box llm to exploration of large scale language model society,” arXiv
32
L. Zhao, E. Yu, Z. Ge, J. Yang, H. Wei, H. Zhou, J. Sun, model-based chatbot system for intelligent education,”
Y. Peng, R. Dong, C. Han, and X. Zhang, “Chatspot: CoRR, vol. abs/2308.02773, 2023. [Online]. Available:
Bootstrapping multimodal llms via precise referring in- https://doi.org/10.48550/arXiv.2308.02773
struction tuning,” 2023. H. Guo, J. Yang, J. Liu, L. Yang, L. Chai, J. Bai, J. Peng, X. Hu,
F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, C. Chen, D. Zhang, X. Shi, T. Zheng, L. Zheng, B. Zhang,
“Mitigating hallucination in large multi-modal models via K. Xu, and Z. Li, “OWL: A large language model for IT
robust instruction tuning,” 2023. operations,” CoRR, vol. abs/2309.09298, 2023. [Online].
S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any- Available: https://doi.org/10.48550/arXiv.2309.09298
to-any multimodal llm,” 2023. Y. Kim and A. M. Rush, “Sequence-level knowledge distil-
R. Luo, Z. Zhao, M. Yang, J. Dong, D. Li, P. Lu, T. Wang, lation,” arXiv preprint arXiv:1606.07947, 2016.
L. Hu, M. Qiu, and Z. Wei, “Valley: Video assistant with S. Han, H. Mao, and W. J. Dally, “Deep compression:
large language model enhanced ability,” 2023. Compressing deep neural networks with pruning, trained
Y. Jiang, E. Schoop, A. Swearngin, and J. Nichols, “Iluvui: quantization and huffman coding,” International Confer-
Instruction-tuned language-vision modeling of uis from ence on Learning Representations (ICLR), 2016.
machine conversations,” 2023. V. Gangal, S. Y. Feng, M. Alikhani, T. Mitamura, and
Y. Li, C. Zhang, G. Yu, Z. Wang, B. Fu, G. Lin, C. Shen, E. Hovy, “Nareor: The narrative reordering problem,” in
L. Chen, and Y. Wei, “Stablellava: Enhanced visual in- Proceedings of the AAAI Conference on Artificial Intelligence,
struction tuning with synthesized image-dialogue data,” vol. 36, no. 10, 2022, pp. 10 645–10 653.
2023. S. Longpre, Y. Lu, Z. Tu, and C. DuBois, “An exploration of
R. Xu, X. Wang, T. Wang, Y. Chen, J. Pang, and D. Lin, data augmentation and sampling techniques for domain-
“Pointllm: Empowering large language models to under- agnostic question answering,” in Proceedings of the 2nd
stand point clouds,” 2023. Workshop on Machine Reading for Question Answering,
Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen,
Z. Wu, and Y. Feng, “Lawyer llama technical report,” Eds. Hong Kong, China: Association for Computational
arXiv preprint arXiv:2305.15062, 2023. Linguistics, Nov. 2019, pp. 220–227. [Online]. Available:
J. Cui, Z. Li, Y. Yan, B. Chen, and L. Yuan, “Chatlaw: Open- https://aclanthology.org/D19-5829
source legal large language model with integrated ex- P. West, C. Bhagavatula, J. Hessel, J. Hwang, L. Jiang,
ternal knowledge bases,” arXiv preprint arXiv:2306.16092, R. Le Bras, X. Lu, S. Welleck, and Y. Choi, “Symbolic
2023. knowledge distillation: from general language models
H. Zhang, J. Chen, F. Jiang, F. Yu, Z. Chen, G. Chen, to commonsense models,” in Proceedings of the 2022
J. Li, X. Wu, Z. Zhiyi, Q. Xiao, X. Wan, B. Wang, Conference of the North American Chapter of the Association
and H. Li, “HuatuoGPT, towards taming language for Computational Linguistics: Human Language Technologies,
model to be a doctor,” in Findings of the Association M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds.
for Computational Linguistics: EMNLP 2023, H. Bouamor, Seattle, United States: Association for Computational
J. Pino, and K. Bali, Eds. Singapore: Association Linguistics, Jul. 2022, pp. 4602–4625. [Online]. Available:
for Computational Linguistics, Dec. 2023, pp. 10 859– https://aclanthology.org/2022.naacl-main.341
10 885. [Online]. Available: https://aclanthology.org/ Z. Li, X. Xu, T. Shen, C. Xu, J.-C. Gu, and C. Tao, “Leveraging
2023.findings-emnlp.725 large language models for nlg evaluation: A survey,” 2024.
J. Chen, X. Wang, A. Gao, F. Jiang, S. Chen, H. Zhang, S. Li, J. Chen, Y. Shen, Z. Chen, X. Zhang, Z. Li, H. Wang,
D. Song, W. Xie, C. Kong, J. Li, X. Wan, H. Li, and B. Wang, J. Qian, B. Peng, Y. Mao, W. Chen, and X. Yan, “Explana-
“Huatuogpt-ii, one-stage training for medical adaption tions from large language models make small reasoners
of llms,” CoRR, vol. abs/2311.09774, 2023. [Online]. better,” 2022.
Available: https://doi.org/10.48550/arXiv.2311.09774 N. Ho, L. Schmid, and S. Yun, “Large language models
X. Zhang and Q. Yang, “Xuanyuan 2.0: A large are reasoning teachers,” in ACL (1). Association for
chinese financial chat model with hundreds of billions Computational Linguistics, 2023, pp. 14 852–14 882.
parameters,” in Proceedings of the 32nd ACM International L. C. Magister, J. Mallinson, J. Adamek, E. Malmi,
Conference on Information and Knowledge Management, and A. Severyn, “Teaching small language models to
CIKM 2023, Birmingham, United Kingdom, October 21- reason,” in Proceedings of the 61st Annual Meeting of the
25, 2023, I. Frommholz, F. Hopfgartner, M. Lee, Association for Computational Linguistics (Volume 2: Short
M. Oakes, M. Lalmas, M. Zhang, and R. L. T. Santos, Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki,
Eds. ACM, 2023, pp. 4435–4439. [Online]. Available: Eds. Toronto, Canada: Association for Computational
https://doi.org/10.1145/3583780.3615285 Linguistics, Jul. 2023, pp. 1773–1781. [Online]. Available:
T. Xie, Y. Wan, W. Huang, Z. Yin, Y. Liu, S. Wang, https://aclanthology.org/2023.acl-short.151
Q. Linghu, C. Kit, C. Grazian, W. Zhang, I. Razzak, Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, “Specializ-
and B. Hoex, “DARWIN series: Domain specific ing smaller language models towards multi-step reason-
large language models for natural science,” CoRR, ing,” 2023.
vol. abs/2308.13565, 2023. [Online]. Available: https: L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi,
//doi.org/10.48550/arXiv.2308.13565 “Symbolic chain-of-thought distillation: Small models can
Y. Dan, Z. Lei, Y. Gu, Y. Li, J. Yin, J. Lin, L. Ye, Z. Tie, also “think” step-by-step,” in Proceedings of the 61st Annual
Y. Zhou, Y. Wang, A. Zhou, Z. Zhou, Q. Chen, J. Zhou, Meeting of the Association for Computational Linguistics
L. He, and X. Qiu, “Educhat: A large-scale language (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber,
35
and N. Okazaki, Eds. Toronto, Canada: Association G. Guo, R. Zhao, T. Tang, X. Zhao, and J.-R. Wen, “Beyond
for Computational Linguistics, Jul. 2023, pp. 2665–2679. imitation: Leveraging fine-grained quality signals for
[Online]. Available: https://aclanthology.org/2023.acl- alignment,” in The Twelfth International Conference
long.150 on Learning Representations, 2024. [Online]. Available:
W. Liu, G. Li, K. Zhang, B. Du, Q. Chen, X. Hu, H. Xu, https://openreview.net/forum?id=LNLjU5C5dK
J. Chen, and J. Wu, “Mind’s mirror: Distilling self- Z. Allen-Zhu and Y. Li, “Towards understanding ensemble,
evaluation capability and comprehensive thinking from knowledge distillation and self-distillation in deep learn-
large language models,” 2023. ing,” arXiv preprint arXiv:2012.09816, 2020.
S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, T. Zheng, S. Guo, X. Qu, J. Guo, W. Zhang, X. Du, C. Lin,
D. Zhou, Q. V. Le, B. Zoph, J. Wei et al., “The flan collec- W. Huang, W. Chen, J. Fu et al., “Kun: Answer polish-
tion: Designing data and methods for effective instruction ment for chinese self-alignment with instruction back-
tuning,” arXiv preprint arXiv:2301.13688, 2023. translation,” arXiv preprint arXiv:2401.06477, 2024.
Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and X. Li, P. Yu, C. Zhou, T. Schick, O. Levy, L. Zettlemoyer, J. E.
A. Mulyar, “Gpt4all: Training an assistant-style chatbot Weston, and M. Lewis, “Self-alignment with instruction
with large scale data distillation from gpt-3.5-turbo,” backtranslation,” in The Twelfth International Conference
GitHub, 2023. on Learning Representations, 2024. [Online]. Available:
Q. Si, T. Wang, Z. Lin, X. Zhang, Y. Cao, and W. Wang, https://openreview.net/forum?id=1oijHJBRsT
“An empirical study of instruction-tuning large language B. Zhao, H. Hajishirzi, and Q. Cao, “Apt: Adaptive pruning
models in chinese,” in EMNLP (Findings). Association and tuning pretrained language models for efficient train-
for Computational Linguistics, 2023, pp. 4086–4107. ing and inference,” arXiv preprint arXiv:2401.12200, 2024.
Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, L. Zhang, B. Ma, and A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, P. J.
X. Li, “Exploring the impact of instruction data scaling on Liu, J. Harrison, J. Lee, K. Xu, A. Parisi et al., “Beyond hu-
large language models: An empirical study on real-world man data: Scaling self-training for problem-solving with
use cases,” 2023. language models,” arXiv preprint arXiv:2312.06585, 2023.
M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. W. Chen, D. Song, and B. Li, “Grath: Gradual self-truthifying
Aji, “Lamini-lm: A diverse herd of distilled models from for large language models,” 2024.
large-scale instructions,” 2023. A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni,
W. Guo, J. Yang, K. Yang, X. Li, Z. Rao, Y. Xu, and D. Niu, and R. Agarwal, “V-star: Training verifiers for self-taught
“Instruction fusion: Advancing prompt evolution through reasoners,” 2024.
hybridization,” 2023. A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,
Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. Ratner, R. Krishna, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma,
J. Shen, and C. Zhang, “Large language model as at- N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion,
tributed training data generator: A tale of diversity and K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark,
bias,” 2023. S. McCandlish, C. Olah, and J. Kaplan, “A general lan-
F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi, guage assistant as a laboratory for alignment,” 2021.
“Knowledge fusion of large language models,” in The J. Huang, S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
Twelfth International Conference on Learning Representations, J. Han, “Large language models can self-improve,” in
2024. [Online]. Available: https://openreview.net/forum? Proceedings of the 2023 Conference on Empirical Methods
id=jiDsk12qcz in Natural Language Processing, H. Bouamor, J. Pino, and
Q. Zhao and B. Zhu, “Towards the fundamental K. Bali, Eds. Singapore: Association for Computational
limits of knowledge transfer over finite domains,” Linguistics, Dec. 2023, pp. 1051–1068. [Online]. Available:
in NeurIPS 2023 Workshop on Mathematics of Modern https://aclanthology.org/2023.emnlp-main.67
Machine Learning, 2023. [Online]. Available: https: H. Chen, X. Quan, H. Chen, M. Yan, and J. Zhang, “Knowl-
//openreview.net/forum?id=9qxoXqxa0N edge distillation for closed-source language models,”
C. Qin, W. Xia, F. Jiao, and S. Joty, “Improving in-context arXiv preprint arXiv:2401.07013, 2024.
learning via bidirectional alignment,” 2023. I. Sason and S. Verdú, “f -divergence inequalities,” IEEE
N. Boizard, K. El-Haddad, C. Hudelot, and P. Colombo, Transactions on Information Theory, vol. 62, no. 11, pp. 5973–
“Towards cross-tokenizer distillation: the universal logit 6006, 2016.
distillation loss for llms,” arXiv preprint arXiv:2402.12030, S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge
2024. distillation for bert model compression,” 2019.
Q. Zhong, L. Ding, L. Shen, J. Liu, B. Du, and D. Tao, “Revis- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and
iting knowledge distillation for autoregressive language D. Zhou, “MobileBERT: a compact task-agnostic BERT
models,” 2024. for resource-limited devices,” in Proceedings of the 58th
M. Kim, S. Lee, J. Lee, S. Hong, D.-S. Chang, W. Sung, Annual Meeting of the Association for Computational
and J. Choi, “Token-scaled logit distillation for ternary Linguistics, D. Jurafsky, J. Chai, N. Schluter, and
weight generative language models,” arXiv preprint J. Tetreault, Eds. Online: Association for Computational
arXiv:2308.06744, 2023. Linguistics, Jul. 2020, pp. 2158–2170. [Online]. Available:
Z. Chen, K. Zhou, W. X. Zhao, J. Wan, F. Zhang, D. Zhang, https://aclanthology.org/2020.acl-main.195
and J.-R. Wen, “Improving large language models via fine- X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang,
grained reinforcement learning with minimum editing and Q. Liu, “TinyBERT: Distilling BERT for natural
constraint,” 2024. language understanding,” in Findings of the Association for
36
Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, for diverse people? tuning llms via debate to generate
and Y. Liu, Eds. Online: Association for Computational controllable controversial statements,” 2024.
Linguistics, Nov. 2020, pp. 4163–4174. [Online]. Available: M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang,
https://aclanthology.org/2020.findings-emnlp.372 “Knowledge-augmented reasoning distillation for small
L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and language models in knowledge-intensive tasks,” 2023.
Q. Liu, “Dynabert: Dynamic bert with adaptive width and R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan,
depth,” Advances in Neural Information Processing Systems, “Gpt4tools: Teaching large language model to use tools
vol. 33, pp. 9782–9793, 2020. via self-instruction,” 2023.
S. Zuo, Q. Zhang, C. Liang, P. He, T. Zhao, and W. Chen, A. Yehudai, B. Carmeli, Y. Mass, O. Arviv, N. Mills,
“Moebert: from bert to mixture-of-experts via importance- A. Toledo, E. Shnarch, and L. Choshen, “Genie: Achieving
guided adaptation,” arXiv preprint arXiv:2204.07675, 2022. human parity in content-grounded datasets generation,”
K. J. Liang, W. Hao, D. Shen, Y. Zhou, W. Chen, C. Chen, and 2024.
L. Carin, “Mixkd: Towards efficient distillation of large- Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang, and
scale language models,” in 9th International Conference on T. Sun, “Llavar: Enhanced visual instruction tuning for
Learning Representations, ICLR 2021, Virtual Event, Austria, text-rich image understanding,” 2023.
May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and
https://openreview.net/forum?id=UFGEelJkLu5 Z. Tu, “Macaw-llm: Multi-modal language modeling with
Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- image, audio, video, and text integration,” arXiv preprint
yaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: arXiv:2306.09093, 2023.
Human-level reward design via coding large language B. Li, Y. Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li,
models,” 2023. and Z. Liu, “Mimic-it: Multi-modal in-context instruction
J.-C. Pang, P. Wang, K. Li, X.-H. Chen, J. Xu, Z. Zhang, and tuning,” 2023.
Y. Yu, “Language model self-improvement by reinforce- Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan,
ment learning contemplation,” 2023. and J. Liu, “Chatbridge: Bridging modalities with large
Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, language model as a language catalyst,” 2023.
A. Gupta, and J. Andreas, “Guiding pretraining in Y. Zhao, B. Yu, B. Hui, H. Yu, F. Huang, Y. Li, and N. L.
reinforcement learning with large language models,” Zhang, “A preliminary study of the intrinsic relationship
in Proceedings of the 40th International Conference on between complexity and alignment,” 2023.
Machine Learning, ser. Proceedings of Machine Learning A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu,
Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, P. Abbeel, S. Levine, and D. Song, “The false
S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, promise of imitating proprietary llms,” arXiv preprint
23–29 Jul 2023, pp. 8657–8677. [Online]. Available: arXiv:2305.15717, 2023.
https://proceedings.mlr.press/v202/du23f.html C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma,
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and A. Efrat, P. Yu, L. YU, S. Zhang, G. Ghosh, M. Lewis,
O. Klimov, “Proximal policy optimization algorithms,” L. Zettlemoyer, and O. Levy, “LIMA: Less is more
2017. for alignment,” in Thirty-seventh Conference on Neural
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Man- Information Processing Systems, 2023. [Online]. Available:
ning, and C. Finn, “Direct preference optimization: Your https://openreview.net/forum?id=KBMOKmX2he
language model is secretly a reward model,” 2023. M. Li, Y. Zhang, S. He, Z. Li, H. Zhao, J. Wang, N. Cheng,
F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang, and T. Zhou, “Superfiltering: Weak-to-strong data filtering
“Preference ranking optimization for human alignment,” for fast instruction-tuning,” 2024. [Online]. Available:
arXiv preprint arXiv:2306.17492, 2023. https://api.semanticscholar.org/CorpusID:267365346
Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and
F. Huang, “Rrhf: Rank responses to align language mod- Z. Mao, “Expertprompting: Instructing large language
els with human feedback without tears,” arXiv preprint models to be distinguished experts,” 2023.
arXiv:2304.05302, 2023. W. Liu, W. Zeng, K. He, Y. Jiang, and J. He, “What makes
M. Li, L. Chen, J. Chen, S. He, and T. Zhou, good data for alignment? a comprehensive study of auto-
“Reflection-tuning: Recycling data for better instruction- matic data selection in instruction tuning,” 2023.
tuning,” in NeurIPS 2023 Workshop on Instruction Tuning R. Lou, K. Zhang, J. Xie, Y. Sun, J. Ahn, H. Xu, Y. Su, and
and Instruction Following, 2023. [Online]. Available: W. Yin, “Muffin: Curating multi-faceted instructions for
https://openreview.net/forum?id=xaqoZZqkPU improving instruction-following,” 2023.
M. Li, L. Chen, J. Chen, S. He, J. Gu, and T. Zhou, “Selective T. Schick, J. Dwivedi-Yu, Z. Jiang, F. Petroni, P. Lewis,
reflection-tuning: Student-selected data recycling for G. Izacard, Q. You, C. Nalmpantis, E. Grave, and S. Riedel,
llm instruction-tuning,” 2024. [Online]. Available: https: “Peer: A collaborative language model,” 2022.
//api.semanticscholar.org/CorpusID:267682220 A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao,
X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang,
S. Levine, and D. Song, “Koala: A dialogue model S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yaz-
for academic research,” Blog post, April 2023. [Online]. danbakhsh, and P. Clark, “Self-refine: Iterative refinement
Available: https://bair.berkeley.edu/blog/2023/04/03/ with self-feedback,” 2023.
koala/ W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward,
M. Li, J. Chen, L. Chen, and T. Zhou, “Can llms speak and J. Leike, “Self-critiquing models for assisting human
37
Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, S. Kambhampati, “On the planning abilities of
“Hugginggpt: Solving ai tasks with chatgpt and its friends large language models - a critical investigation,”
in hugging face,” 2023. in Thirty-seventh Conference on Neural Informa-
S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Augment- tion Processing Systems, 2023. [Online]. Available:
ing frozen language models with massive tools via tool https://openreview.net/forum?id=X6dEqXIsEW
embeddings,” 2024. T. Sumers, K. Marino, A. Ahuja, R. Fergus, and I. Dasgupta,
S. Yuan, K. Song, J. Chen, X. Tan, Y. Shen, R. Kan, D. Li, “Distilling internet-scale vision-language models into em-
and D. Yang, “Easytool: Enhancing llm-based agents with bodied agents,” in Proceedings of the 40th International
concise tool instruction,” 2024. Conference on Machine Learning, ser. ICML’23. JMLR.org,
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, 2023.
C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, Y. Yang, T. Zhou, K. Li, D. Tao, L. Li, L. Shen, X. He, J. Jiang,
S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, and Y. Shi, “Embodied multi-modal agent trained by an
T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained llm from a parallel textworld,” 2023.
transformer language models,” 2022. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, is all you need,” Advances in neural information processing
A. Askell et al., “Language models are few-shot learners,” systems, vol. 30, 2017.
Advances in neural information processing systems, vol. 33, Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
pp. 1877–1901, 2020. M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Lan- robustly optimized bert pretraining approach,” 2019.
guage models as zero-shot planners: Extracting actionable J. Li, L. Gui, Y. Zhou, D. West, C. Aloisi, and Y. He, “Dis-
knowledge for embodied agents,” in International Confer- tilling chatgpt for explainable automated student answer
ence on Machine Learning. PMLR, 2022, pp. 9118–9147. assessment,” in EMNLP (Findings). Association for Com-
I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Trem- putational Linguistics, 2023, pp. 6007–6026.
blay, D. Fox, J. Thomason, and A. Garg, “Progprompt: R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic
Generating situated robot task plans using large language data generation of llms help clinical text mining?” arXiv
models,” 2022. preprint arXiv:2303.04360, 2023.
D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, X. He, I. Nassar, J. Kiros, G. Haffari, and M. Norouzi,
D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Generate, annotate, and learn: NLP with synthetic text,”
“Least-to-most prompting enables complex reasoning in Trans. Assoc. Comput. Linguistics, vol. 10, pp. 826–842,
large language models,” 2023. 2022. [Online]. Available: https://transacl.org/ojs/index.
C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, php/tacl/article/view/3811
and Y. Su, “Llm-planner: Few-shot grounded planning for Y. Meng, J. Huang, Y. Zhang, and J. Han,
embodied agents with large language models,” in Proceed- “Generating training data with language models:
ings of the IEEE/CVF International Conference on Computer Towards zero-shot language understanding,” in
Vision, 2023, pp. 2998–3009. Advances in Neural Information Processing Systems 35:
Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, Annual Conference on Neural Information Processing
explain, plan and select: Interactive planning with large Systems 2022, NeurIPS 2022, New Orleans, LA,
language models enables open-world multi-task agents,” USA, November 28 - December 9, 2022, 2022.
arXiv preprint arXiv:2302.01560, 2023. [Online]. Available: http://papers.nips.cc/paper files/
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, paper/2022/hash/0346c148ba1c21c6b4780a961ea141dc-
and K. Narasimhan, “Tree of thoughts: Deliberate prob- Abstract-Conference.html
lem solving with large language models,” arXiv preprint J. Wang, Z. Yao, A. Mitra, S. Osebe, Z. Yang, and H. Yu,
arXiv:2305.10601, 2023. “UMASS BioNLP at MEDIQA-chat 2023: Can LLMs
B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, generate high-quality synthetic note-oriented doctor-
and P. Stone, “Llm+ p: Empowering large language mod- patient conversations?” in Proceedings of the 5th Clinical
els with optimal planning proficiency,” arXiv preprint Natural Language Processing Workshop, T. Naumann,
arXiv:2304.11477, 2023. A. Ben Abacha, S. Bethard, K. Roberts, and A. Rumshisky,
S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Eds. Toronto, Canada: Association for Computational
Z. Hu, “Reasoning with language model is planning with Linguistics, Jul. 2023, pp. 460–471. [Online]. Available:
world model,” arXiv preprint arXiv:2305.14992, 2023. https://aclanthology.org/2023.clinicalnlp-1.49
M. Hu, Y. Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, Z. Yang, S. Cherian, and S. Vucetic, “Data augmentation
B. Wang, Y. Qiao, and P. Luo, “Tree-planner: Efficient for radiology report simplification,” in Findings of the
close-loop task planning with large language models,” Association for Computational Linguistics: EACL 2023,
arXiv preprint arXiv:2310.08582, 2023. A. Vlachos and I. Augenstein, Eds. Dubrovnik,
B. Y. Lin, C. Huang, Q. Liu, W. Gu, S. Sommerer, and Croatia: Association for Computational Linguistics,
X. Ren, “On grounded planning for embodied tasks with May 2023, pp. 1922–1932. [Online]. Available: https:
language models,” in Proceedings of the AAAI Conference //aclanthology.org/2023.findings-eacl.144
on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 192– Z. Cai, C. Tao, T. Shen, C. Xu, X. Geng, X. A. Lin, L. He, and
13 200. D. Jiang, “Hyper: Multitask hyper-prompted training en-
K. Valmeekam, M. Marquez, S. Sreedharan, and ables large-scale retrieval generalization,” in The Eleventh
39
International Conference on Learning Representations, 2022. language models as efficient dataset generators for infor-
C. Liu, C. Tao, X. Geng, T. Shen, D. Zhao, C. Xu, B. Jiao, mation retrieval,” arXiv preprint arXiv:2301.01820, 2023.
and D. Jiang, “Adam: Dense retrieval distillation with W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren, Z. Chen,
adaptive dark examples,” arXiv preprint arXiv:2212.10192, D. Yin, and Z. Ren, “Instruction distillation makes large
2022. language models efficient zero-shot rankers,” 2023.
J. Feng, C. Tao, X. Geng, T. Shen, C. Xu, G. Long, D. Zhao, C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
and D. Jiang, “Knowledge refinement via interaction be- M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
tween search engines and large language models,” arXiv the limits of transfer learning with a unified text-to-text
preprint arXiv:2305.07402, 2023. transformer,” J. Mach. Learn. Res., vol. 21, no. 1, jan 2020.
T. Shen, G. Long, X. Geng, C. Tao, T. Zhou, and D. Jiang, S. Bruch, X. Wang, M. Bendersky, and M. Najork, “An
“Large language models are strong zero-shot retriever,” analysis of the softmax cross entropy loss for learning-
arXiv preprint arXiv:2304.14233, 2023. to-rank with binary relevance,” in Proceedings of the
X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot listwise 2019 ACM SIGIR International Conference on Theory of
document reranking with a large language model,” 2023. Information Retrieval, ICTIR 2019, Santa Clara, CA, USA,
Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, J. Shen, October 2-5, 2019, 2019, pp. 75–78. [Online]. Available:
T. Liu, J. Liu, D. Metzler, X. Wang, and M. Bendersky, https://doi.org/10.1145/3341981.3344221
“Large language models are effective text rankers with C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,
pairwise ranking prompting,” 2023. N. Hamilton, and G. Hullender, “Learning to rank
X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, “Query using gradient descent,” in Proceedings of the 22nd
rewriting in retrieval-augmented large language models,” International Conference on Machine Learning, ser. ICML
in Proceedings of the 2023 Conference on Empirical Methods ’05. New York, NY, USA: Association for Computing
in Natural Language Processing, H. Bouamor, J. Pino, and Machinery, 2005, p. 89–96. [Online]. Available: https:
K. Bali, Eds. Singapore: Association for Computational //doi.org/10.1145/1102351.1102363
Linguistics, Dec. 2023, pp. 5303–5315. [Online]. Available: X. Wang, C. Li, N. Golbandi, M. Bendersky, and M. Najork,
https://aclanthology.org/2023.emnlp-main.322 “The lambdaloss framework for ranking metric
D. Sachan, M. Lewis, M. Joshi, A. Aghajanyan, W.- optimization,” in Proceedings of the 27th ACM International
t. Yih, J. Pineau, and L. Zettlemoyer, “Improving Conference on Information and Knowledge Management,
passage retrieval with zero-shot question generation,” ser. CIKM ’18. New York, NY, USA: Association for
in Proceedings of the 2022 Conference on Empirical Computing Machinery, 2018, p. 1313–1322. [Online].
Methods in Natural Language Processing, Y. Goldberg, Available: https://doi.org/10.1145/3269206.3271784
Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi, W. Wang, X. Lin, F. Feng, X. He, and T.-S. Chua, “Generative
United Arab Emirates: Association for Computational recommendation: Towards next-generation recommender
Linguistics, Dec. 2022, pp. 3781–3797. [Online]. Available: paradigm,” 2023.
https://aclanthology.org/2022.emnlp-main.249 S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun,
D. S. Sachan, M. Lewis, D. Yogatama, L. Zettlemoyer, X. Zhang, and J. Xu, “Uncovering chatgpt’s capabilities
J. Pineau, and M. Zaheer, “Questions are all you need in recommender systems,” in Proceedings of the 17th
to train a dense passage retriever,” Transactions of the ACM Conference on Recommender Systems, ser. RecSys
Association for Computational Linguistics, vol. 11, pp. ’23. New York, NY, USA: Association for Computing
600–616, 2023. [Online]. Available: https://aclanthology. Machinery, 2023, p. 1126–1132. [Online]. Available:
org/2023.tacl-1.35 https://doi.org/10.1145/3604915.3610646
T. Schick and H. Schütze, “Generating datasets with Y. Xi, W. Liu, J. Lin, X. Cai, H. Zhu, J. Zhu, B. Chen,
pretrained language models,” in Proceedings of the 2021 R. Tang, W. Zhang, R. Zhang, and Y. Yu, “Towards open-
Conference on Empirical Methods in Natural Language world recommendation with knowledge augmentation
Processing, M.-F. Moens, X. Huang, L. Specia, and from large language models,” 2023.
S. W.-t. Yih, Eds. Online and Punta Cana, Dominican X. Ren, W. Wei, L. Xia, L. Su, S. Cheng, J. Wang, D. Yin, and
Republic: Association for Computational Linguistics, C. Huang, “Representation learning with large language
Nov. 2021, pp. 6943–6951. [Online]. Available: https: models for recommendation,” 2023.
//aclanthology.org/2021.emnlp-main.555 W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng, J. Wang,
Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning for D. Yin, and C. Huang, “Llmrec: Large language models
augmenting dense retrieval with large language models,” with graph augmentation for recommendation,” 2024.
arXiv preprint arXiv:2307.08303, 2023. L. Wang, S. Zhang, Y. Wang, E.-P. Lim, and Y. Wang,
J. Saad-Falcon, O. Khattab, K. Santhanam, R. Florian, “LLM4Vis: Explainable visualization recommendation
M. Franz, S. Roukos, A. Sil, M. A. Sultan, and C. Potts, using ChatGPT,” in Proceedings of the 2023 Conference
“UDAPDR: unsupervised domain adaptation via LLM on Empirical Methods in Natural Language Processing:
prompting and distillation of rerankers,” in Proceedings Industry Track, M. Wang and I. Zitouni, Eds. Singapore:
of the 2023 Conference on Empirical Methods in Natural Association for Computational Linguistics, Dec. 2023, pp.
Language Processing, EMNLP 2023, Singapore, December 675–692. [Online]. Available: https://aclanthology.org/
6-10, 2023, 2023, pp. 11 265–11 279. [Online]. Available: 2023.emnlp-industry.64
https://aclanthology.org/2023.emnlp-main.693 Z. Cui, J. Ma, C. Zhou, J. Zhou, and H. Yang, “M6-rec:
V. Jeronymo, L. Bonifacio, H. Abonizio, M. Fadaee, Generative pretrained language models are open-ended
R. Lotufo, J. Zavrel, and R. Nogueira, “Inpars-v2: Large recommender systems,” 2022.
40
and Z. Nie, “Biomedgpt: Open multimodal generative C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Pmc-
pre-trained transformer for biomedicine,” arXiv preprint llama: Further finetuning llama on medical papers,”
arXiv:2308.09442, 2023. CoRR, vol. abs/2304.14454, 2023. [Online]. Available:
B. Chen, X. Cheng, P. Li, Y. Geng, J. Gong, S. Li, Z. Bei, https://doi.org/10.48550/arXiv.2304.14454
X. Tan, B. Wang, X. Zeng, C. Liu, A. Zeng, Y. Dong, Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong, J. Peng,
J. Tang, and L. Song, “xtrimopglm: Unified 100b-scale X. Huang, and Z. Wei, “Disc-medllm: Bridging general
pre-trained transformer for deciphering the language large language models and real-world medical consulta-
of protein,” CoRR, vol. abs/2401.06199, 2024. [Online]. tion,” arXiv preprint arXiv:2308.14346, 2023.
Available: https://doi.org/10.48550/arXiv.2401.06199 S. Xue, F. Zhou, Y. Xu, H. Zhao, S. Xie, Q. Dai,
C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, C. Jiang, J. Zhang, J. Zhou, D. Xiu, and H. Mei,
W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He, “K2: “Weaverbird: Empowering financial decision-making
A foundation language model for geoscience knowledge with large language model, knowledge base, and search
understanding and utilization,” 2023. engine,” CoRR, vol. abs/2308.05361, 2023. [Online].
Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, and Available: https://doi.org/10.48550/arXiv.2308.05361
H. Chen, “Oceangpt: A large language model for ocean S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
science tasks,” CoRR, vol. abs/2310.02031, 2023. [Online]. S. Gehrmann, P. Kambadur, D. S. Rosenberg, and
Available: https://doi.org/10.48550/arXiv.2310.02031 G. Mann, “Bloomberggpt: A large language model
Z. Zheng, J. Zhang, T. Vu, S. Diao, Y. H. W. Tim, and for finance,” CoRR, vol. abs/2303.17564, 2023. [Online].
S. Yeung, “Marinegpt: Unlocking secrets of ocean to Available: https://doi.org/10.48550/arXiv.2303.17564
the public,” CoRR, vol. abs/2310.13596, 2023. [Online]. D. Lu, H. Wu, J. Liang, Y. Xu, Q. He, Y. Geng, M. Han,
Available: https://doi.org/10.48550/arXiv.2310.13596 Y. Xin, and Y. Xiao, “Bbt-fin: Comprehensive construction
Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, of chinese financial domain pre-trained language model,
Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, T. Shi, corpus and benchmark,” CoRR, vol. abs/2302.09432, 2023.
T. Huang, Y. Xu, S. Wang, L. Fu, W. Zhang, J. He, [Online]. Available: https://doi.org/10.48550/arXiv.2302.
C. Ma, Y. Zhu, X. Wang, and C. Zhou, “Geogalactica: 09432
A scientific large language model in geoscience,” Y. Yang, Y. Tang, and K. Y. Tam, “Investlm: A large language
CoRR, vol. abs/2401.00434, 2024. [Online]. Available: model for investment using financial domain instruction
https://doi.org/10.48550/arXiv.2401.00434 tuning,” CoRR, vol. abs/2309.13064, 2023. [Online].
D. Zhang, A. Petrova, D. Trautmann, and F. Schilder, “Un- Available: https://doi.org/10.48550/arXiv.2309.13064
leashing the power of large language models for legal Q. Xie, W. Han, X. Zhang, Y. Lai, M. Peng, A. Lopez-
applications,” in Proceedings of the 32nd ACM International Lira, and J. Huang, “PIXIU: A large language model,
Conference on Information and Knowledge Management, 2023, instruction data and evaluation benchmark for finance,”
pp. 5257–5258. CoRR, vol. abs/2306.05443, 2023. [Online]. Available:
Z. Sun, “A short survey of viewing large language models https://doi.org/10.48550/arXiv.2306.05443
in legal aspect,” arXiv preprint arXiv:2303.09136, 2023. N. Wang, H. Yang, and C. D. Wang, “Fingpt: Instruction
J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu, “Large language tuning benchmark for open-source large language models
models in law: A survey,” arXiv preprint arXiv:2312.03718, in financial datasets,” CoRR, vol. abs/2310.04793, 2023.
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.
S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, 04793
Y. Xiao, S. Yun, W. Lin et al., “Disc-lawllm: Fine-tuning R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn,
large language models for intelligent legal services,” arXiv E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic,
preprint arXiv:2309.11325, 2023. “Galactica: A large language model for science,”
H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and M. Sun, CoRR, vol. abs/2211.09085, 2022. [Online]. Available:
“Jec-qa: a legal-domain question answering dataset,” in https://doi.org/10.48550/arXiv.2211.09085
Proceedings of the AAAI Conference on Artificial Intelligence, J. Yin, S. Dash, F. Wang, and M. Shankar, “FORGE:
vol. 34, no. 05, 2020, pp. 9701–9708. pre-training open foundation models for science,”
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, in Proceedings of the International Conference for High
L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, Performance Computing, Networking, Storage and Analysis,
M. Schaekermann, A. Wang, M. Amin, S. Lachgar, P. A. SC 2023, Denver, CO, USA, November 12-17, 2023,
Mansfield, S. Prakash, B. Green, E. Dominowska, B. A. D. Arnold, R. M. Badia, and K. M. Mohror, Eds.
y Arcas, N. Tomasev, Y. Liu, R. Wong, C. Semturs, ACM, 2023, pp. 81:1–81:13. [Online]. Available: https:
S. S. Mahdavi, J. K. Barral, D. R. Webster, G. S. //doi.org/10.1145/3581784.3613215
Corrado, Y. Matias, S. Azizi, A. Karthikesalingam, and Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos,
V. Natarajan, “Towards expert-level medical question S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and
answering with large language models,” CoRR, vol. S. Welleck, “Llemma: An open language model for
abs/2305.09617, 2023. [Online]. Available: https://doi. mathematics,” CoRR, vol. abs/2310.10631, 2023. [Online].
org/10.48550/arXiv.2305.09617 Available: https://doi.org/10.48550/arXiv.2310.10631
W. Zhu, X. Wang, H. Zheng, M. Chen, and B. Tang, F. Yu, A. Gao, and B. Wang, “Outcome-supervised
“Promptcblue: A chinese prompt tuning benchmark for verifiers for planning in mathematical reasoning,”
the medical domain,” arXiv preprint arXiv:2310.14151, CoRR, vol. abs/2311.09724, 2023. [Online]. Available:
2023. https://doi.org/10.48550/arXiv.2311.09724
42
T. D. Nguyen, Y. Ting, I. Ciuca, C. O’Neill, Z. Sun, Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and
M. Jablonska, S. Kruk, E. Perkowski, J. W. Miller, Y. He, “Zeroquant: Efficient and affordable post-training
J. Li, J. Peek, K. Iyer, T. Rózanski, P. Khetarpal, quantization for large-scale transformers,” Advances in
S. Zaman, D. Brodrick, S. J. R. Méndez, T. Bui, Neural Information Processing Systems, vol. 35, pp. 27 168–
A. Goodman, A. Accomazzi, J. P. Naiman, J. Cranney, 27 183, 2022.
K. Schawinski, and UniverseTBD, “Astrollama: Towards G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
specialized foundation models in astronomy,” CoRR, “Smoothquant: Accurate and efficient post-training quan-
vol. abs/2309.06126, 2023. [Online]. Available: https: tization for large language models,” 2023.
//doi.org/10.48550/arXiv.2309.06126 X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the struc-
J. Roberts, T. Lüddecke, S. Das, K. Han, and S. Albanie, tural pruning of large language models,” 2023.
“Gpt4geo: How a language model sees the world’s ge- M. Zhang, H. Chen, C. Shen, Z. Yang, L. Ou, X. Yu,
ography,” 2023. and B. Zhuang, “Loraprune: Pruning meets low-rank
Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, parameter-efficient fine-tuning,” 2023.
Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, T. Shi, T. Huang, E. Frantar and D. Alistarh, “Sparsegpt: Massive language
Y. Xu, S. Wang, L. Fu, W. Zhang, J. He, C. Ma, Y. Zhu, models can be accurately pruned in one-shot,” 2023.
X. Wang, and C. Zhou, “Geogalactica: A scientific large M. Xu, Y. L. Xu, and D. P. Mandic, “Tensorgpt: Efficient
language model in geoscience,” 2023. compression of the embedding layer in llms based on the
C. Wang, D. Engler, X. Li, J. Hou, D. J. Wald, K. Jaiswal, tensor-train decomposition,” 2023.
and S. Xu, “Near-real-time earthquake-induced fatality Y. Li, Y. Yu, Q. Zhang, C. Liang, P. He, W. Chen, and
estimation using crowdsourced data and large-language T. Zhao, “Losparse: Structured compression of large lan-
models,” 2023. guage models based on low-rank and sparse approxima-
L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Yadav, tion,” 2023.
Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and H. Jin, Z. Hu, L. Wang, Y. Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu,
“Alpagasus: Training a better alpaca with fewer data,” S. Poria, and R. K.-W. Lee, “Llm-adapters: An adapter
2023. family for parameter-efficient fine-tuning of large lan-
Y. Cao, Y. Kang, and L. Sun, “Instruction mining: High- guage models,” 2023.
quality instruction data selection for large language mod- H. Liu, D. Tam, M. Mohammed, J. Mohta, T. Huang,
els,” 2023. M. Bansal, and C. Raffel, “Few-shot parameter-
M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, efficient fine-tuning is better and cheaper than in-
J. Wang, T. Zhou, and J. Xiao, “From quantity to quality: context learning,” in Advances in Neural Information
Boosting llm performance with self-guided data selection Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave,
for instruction tuning,” ArXiv, vol. abs/2308.12032, and K. Cho, Eds., 2022. [Online]. Available: https:
2023. [Online]. Available: https://api.semanticscholar. //openreview.net/forum?id=rBCvMG-JsPd
org/CorpusID:261076515 Y. Wang, S. Agarwal, S. Mukherjee, X. Liu, J. Gao,
Q. Du, C. Zong, and J. Zhang, “Mods: Model-oriented data A. H. Awadallah, and J. Gao, “AdaMix: Mixture-
selection for instruction tuning,” 2023. of-adaptations for parameter-efficient model tuning,”
Y. Li, B. Hui, X. Xia, J. Yang, M. Yang, L. Zhang, S. Si, in Proceedings of the 2022 Conference on Empirical
J. Liu, T. Liu, F. Huang, and Y. Li, “One shot learning as Methods in Natural Language Processing, Y. Goldberg,
instruction data prospector for large language models,” Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi,
2023. United Arab Emirates: Association for Computational
E. Frantar, S. P. Singh, and D. Alistarh, “Optimal brain com- Linguistics, Dec. 2022, pp. 5744–5760. [Online]. Available:
pression: A framework for accurate post-training quanti- https://aclanthology.org/2022.emnlp-main.388
zation and pruning,” 2023. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, L. Wang, and W. Chen, “Lora: Low-rank adaptation of
“GPT3.int8(): 8-bit matrix multiplication for transformers large language models,” 2021.
at scale,” in Advances in Neural Information Processing X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous
Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, prompts for generation,” in Proceedings of the 59th Annual
Eds., 2022. [Online]. Available: https://openreview.net/ Meeting of the Association for Computational Linguistics and
forum?id=dXiGWqBoxaD the 11th International Joint Conference on Natural Language
Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, Processing (Volume 1: Long Papers), C. Zong, F. Xia,
“Finequant: Unlocking efficiency with fine-grained W. Li, and R. Navigli, Eds. Online: Association for
weight-only quantization for llms,” 2023. Computational Linguistics, Aug. 2021, pp. 4582–4597.
C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, [Online]. Available: https://aclanthology.org/2021.acl-
and N. Wong, “Compression of generative pre-trained long.353
language models via quantization,” in Proceedings of the X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-
60th Annual Meeting of the Association for Computational tuning: Prompt tuning can be comparable to fine-tuning
Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, across scales and tasks,” in Proceedings of the 60th Annual
and A. Villavicencio, Eds. Dublin, Ireland: Association Meeting of the Association for Computational Linguistics
for Computational Linguistics, May 2022, pp. 4821–4836. (Volume 2: Short Papers), S. Muresan, P. Nakov, and
[Online]. Available: https://aclanthology.org/2022.acl- A. Villavicencio, Eds. Dublin, Ireland: Association for
long.331 Computational Linguistics, May 2022, pp. 61–68. [Online].
43