0% found this document useful (0 votes)
146 views43 pages

Knowledge Distillation of LLM

Uploaded by

zyl1418267959
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views43 pages

Knowledge Distillation of LLM

Uploaded by

zyl1418267959
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

1

A Survey on Knowledge Distillation of Large


Language Models
Xiaohan Xu1 , Ming Li2 , Chongyang Tao3 , Tao Shen4 , Reynold Cheng1 , Jinyang Li1 ,
Can Xu5 , Dacheng Tao6 , Tianyi Zhou2
1 The
University of Hong Kong 2 University of Maryland 3 Microsoft
4 University
of Technology Sydney 5 Peking University 6 The University of Sydney
{shawnxxh,chongyangtao,hishentao}@gmail.com {minglii,tianyi}@umd.edu
[email protected] [email protected]
arXiv:2402.13116v4 [cs.CL] 21 Oct 2024

Abstract—In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring
advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral.
Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-
improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD’s role within the realm of
LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-
improvement. Our survey is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a
comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across
diverse fields. Crucially, the survey navigates the interaction between data augmentation (DA) and KD, illustrating how DA emerges
as a powerful paradigm within the KD framework to bolster LLMs’ performance. By leveraging DA to generate context-rich, skill-
specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness,
ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful
guide for researchers and practitioners, offering a detailed overview of current methodologies in knowledge distillation and proposing
future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the potential
for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal terms
that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at
https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.

Index Terms—Large language models, knowledge distillation, data augmentation, skill distillation, supervised fine-tuning

1 I NTRODUCTION extends far beyond current applications, promising to revo-


lutionize industries, augment human creativity, and redefine
In the evolving landscape of artificial intelligence (AI), our interaction with technology.
proprietary1 Large Language Models (LLMs) such as GPT- Despite the remarkable capabilities of proprietary LLMs
3.5 (Ouyang et al., 2022), GPT-4 (OpenAI et al., 2023), like GPT-4 and Gemini, they are not without their shortcom-
Gemini (Team et al., 2023) and Claude2 have emerged as ings, particularly when viewed in light of the advantages
groundbreaking technologies, reshaping our understanding offered by open-source models. A significant drawback is
of natural language processing (NLP). These models, char- their limited accessibility and higher cost (OpenAI et al.,
acterized by their vast scale and complexity, have unlocked 2023). These proprietary models often come with substantial
new realms of possibility, from generating human-like text usage fees and restricted access, making them less attain-
to offering sophisticated problem-solving capabilities. The able for individuals and smaller organizations. In terms of
core significance of these LLMs lies in their emergent abil- data privacy and security (Wu et al., 2023a), using these
ities (Wei et al., 2022a,b; Xu et al., 2024a), a phenomenon proprietary LLMs frequently entails sending sensitive data
where the models display capabilities beyond their explicit to external servers, which raises concerns about data pri-
training objectives, enabling them to tackle a diverse array vacy and security. This aspect is especially critical for users
of tasks with remarkable proficiency. These models excel handling confidential information. Moreover, the general-
in understanding and generation, driving applications from purpose design of proprietary LLMs, while powerful, may
creative generation to complex problem-solving (OpenAI not always align with the specific needs of niche applica-
et al., 2023; Liang et al., 2022). The potential of these models tions. The constraints of accessibility, cost, and adaptability
thus present significant challenges in leveraging the full
1. For simplicity, we use ‘proprietary’ to represent both versatile yet potential of proprietary LLMs.
close-source LLMs like GPT-4 and open-source yet huge LLMs like
LLaMA-2-70B, which encapsulate rich knowledge with a large number
In contrast to proprietary LLMs, open-source models
of parameters. like LLaMA (Touvron et al., 2023) and Mistral (Jiang et al.,
2. https://www.anthropic.com/claude-in-slack 2023a) bring several notable advantages. One of the primary
2

benefits of open-source models is their accessibility and ③


adaptability. Without the constraints of licensing fees or Self-Improvement
restrictive usage policies, these models are more readily
① ②
available to a broader range of users, from individual re- Compress
Advance
searchers to smaller organizations. This openness fosters a
more collaborative and inclusive AI research environment,
encouraging innovation and diverse applications. Addition-
ally, the customizable nature of open-source LLMs allows Closed-Source LLMs Open-Source LLMs Smaller LMs
for more tailored solutions, addressing specific needs that Direction of KD
generic, large-scale models may not meet.
However, the open-source LLMs also have their own Fig. 1: KD plays three key roles in LLMs: 1) Primarily
set of drawbacks, primarily stemming from their relatively enhancing capabilities, 2) offering traditional compression
limited scale and resources compared to their proprietary for efficiency, and 3) an emerging trend of self-improvement
counterparts. One of the most significant limitations is via self-generated knowledge.
the smaller model scale, which often results in lower per-
formance on real-world tasks with a bunch of instruc-
tions (Zheng et al., 2023a). These models, with fewer pa-
rameters, may struggle to capture the depth and breadth (e.g., in-context learning (Huang et al., 2022a) and in-
of knowledge embodied in larger models like GPT-4. Ad- struction following (Taori et al., 2023)), improved align-
ditionally, the pre-training investment in these open-source ment with user intents (e.g., human values/principles (Cui
models is typically less substantial. This reduced investment et al., 2023a), and thinking patterns like chain-of-thought
can lead to a narrower range of pre-training data, poten- (CoT) (Mukherjee et al., 2023)), and NLP task specialization
tially limiting the models’ understanding and handling of (e.g., semantic understanding (Ding et al., 2023a), and code
diverse or specialized topics (Liang et al., 2022; Sun et al., generation (Chaudhary, 2023)). These skills are crucial for
2024a). Moreover, open-source models often undergo fewer the wide array of applications that LLMs are expected
fine-tuning steps due to resource constraints. Fine-tuning to perform, ranging from casual conversations to com-
is crucial for optimizing a model’s performance for spe- plex problem-solving in specialized domains. For instance,
cific tasks or industries, and the lack thereof can hinder in vertical domains like healthcare (Wang et al., 2023a),
the model’s effectiveness in specialized applications. This law (LAW, 2023), or science (Zhang et al., 2024), where
limitation becomes particularly evident when these models accuracy and context-specific knowledge are paramount,
are compared to the highly fine-tuned proprietary LLMs, knowledge distillation allows open-source models to sig-
which are often tailored to excel in a wide array of complex nificantly improve their performance by learning from the
scenarios (OpenAI et al., 2023). proprietary models that have been extensively trained and
Primarily, recognizing the disparities between propri- fine-tuned in these areas.
etary and open-source LLMs, KD techniques have surged The benefits of knowledge distillation in the era of
as a means to bridge the performance gap between these LLMs are multifaceted and transformative (Gu et al., 2024).
models (Gou et al., 2021; Gupta and Agrawal, 2022). Knowl- Through a suite of distillation techniques, the gap between
edge distillation, in this context, involves leveraging the proprietary and open-source models is significantly nar-
more advanced capabilities of leading proprietary models rowed (Chiang et al., 2023; Xu et al., 2023a) and even
like GPT-4 or Gemini as a guiding framework to enhance filled (Zhao et al., 2023a). This process not only streamlines
the competencies of open-source LLMs. This process is computational requirements but also enhances the environ-
akin to transferring the ‘knowledge’ of a highly skilled mental sustainability of AI operations, as open-source mod-
teacher to a student, wherein the student (e.g., open-source els become more proficient with lesser computational over-
LLM) learns to mimic the performance characteristics of head. Furthermore, knowledge distillation fosters a more
the teacher (e.g., proprietary LLM). Compared to traditional accessible and equitable AI landscape, where smaller enti-
knowledge distillation algorithms (Gou et al., 2021), data ties and individual researchers gain access to state-of-the-art
augmentation (DA) (Feng et al., 2021) has emerged as a capabilities, encouraging wider participation and diversity
prevalent paradigm to achieve knowledge distillation of in AI advancements. This democratization of technology
LLMs, where a small seed of knowledge is used to prompt leads to more robust, versatile, and accessible AI solutions,
the LLM to generate more data with respect to a specific catalyzing innovation and growth across various industries
skill or domain (Taori et al., 2023). Secondly, KD still retains and research domains.
its fundamental role in compressing LLMs, making them The escalating need for a comprehensive survey on the
more efficient without significant loss in performance. (Gu knowledge distillation of LLMs stems from the rapidly
et al., 2024; Agarwal et al., 2024). More recently, the strategy evolving landscape of AI (OpenAI et al., 2023; Team et al.,
of employing open-source LLMs as teachers for their own 2023) and the increasing complexity of these models. As AI
self-improvement has emerged as a promising approach, continues to penetrate various sectors, the ability to effi-
enhancing their capabilities significantly (Yuan et al., 2024a; ciently and effectively distill knowledge from proprietary
Chen et al., 2024a). Figure 1 provides an illustration of these LLMs to open-source ones becomes not just a technical
three key roles played by KD in the context of LLMs. aspiration but a practical necessity. This need is driven by
A key aspect of the knowledge distillation is the en- the growing demand for more accessible, cost-effective, and
hancement of skills such as advanced context following adaptable AI solutions that can cater to a diverse range
3

Sec. 4 Skills Sec. 5 Vertical Domains


Supervised Fine-tuning
Context Following Alignment Agent Law Medical & Healthcare
X,Y
NLP Task Specialization Multi-Modality Finance Science Misc.

Skill Domain Divergence and Similarity


Seed ②
Knowledge Instructions
steer ① Train ④ guide
Input Set feature feature
Teacher LLM Student Model
Generated
Dataset drive GPT-4 Knowledge
③ Llama
Claude GPT Reinforcement Learning
Demonstrations Vicuna
reward
Llama OPT
Raw data
Gemini ……
…… outputs distill
Sec. 3.1 Knowledge Elicitation Sec. 3.2 Distillation Algorithm RM! (·)

Labeling Expansion Data Curation Feature Feedback Self-Knowledge Rank Optimization


input demonstrations raw data input, output input feedback input preference

y2≻ y3≻ y1
rank
label expand synthesize extract y,1 y,2y3
Y X,Y X,Y feature output output

Fig. 2: An overview of this survey on knowledge distillation of large language models. Note that ‘Section’ is abbreviated
as ‘Sec.’ in this figure. RMS (·) denotes the student reward model. ⃝
1⃝2⃝
3⃝4 denote the steps in KD of LLMs.

of applications and users. A survey in this field is vital illustrating the practical implications and transformative
for synthesizing the current methodologies, challenges, and impact of these approaches. The survey suggests open
breakthroughs in knowledge distillation. It may serve as a problems in §6, identifying current challenges and gaps in
beacon for researchers and practitioners alike, guiding them knowledge distillation research that offer opportunities for
to distill complex AI capabilities into more manageable and future work. Finally, the conclusion and discussion in §7
accessible forms. Moreover, such a survey can illuminate the synthesize the insights gained, reflecting on the implica-
path forward, identifying gaps in current techniques and tions for the broader AI and NLP research community and
proposing directions for future research. proposing directions for future research. Figure 2 shows an
overview of this survey.
Survey Organization. The remainder of this survey is orga-
nized into several comprehensive sections, each designed to 2 OVERVIEW
offer a deep dive into the multifaceted aspects of knowledge
distillation within the realm ofLLMs. Following this intro- 2.1 Comparing Traditional Recipe
duction, §2 provides a foundational overview of knowledge The concept of knowledge distillation in the field of AI
distillation, comparing traditional techniques with those and deep learning (DL) refers to the process of transferring
emerging in the era of LLMs and highlighting the role of knowledge from a large, complex model (teacher) to a
data augmentation (DA) in this context. §3 delves into the smaller, more efficient model (student) (Gou et al., 2021).
approaches to elicit knowledge from teacher LLMs and core This technique is pivotal in mitigating the challenges posed
distillation algorithms, examining methods from supervised by the computational demands and resource constraints of
fine-tuning to more complex strategies involving divergence deploying large-scale models in practical applications.
and similarity, reinforcement learning, and ranking opti- Historically, knowledge distillation techniques, prior to
mization. Then, §4 focuses on skill distillation, exploring the era of LLMs, primarily concentrated on transferring
how student models can be enhanced to improve context knowledge from complex, often cumbersome neural net-
understanding, alignment with user intentions, and perfor- works to more compact and efficient architectures (Sanh
mance across a variety of NLP tasks. This includes discus- et al., 2019; Kim and Rush, 2016). This process was largely
sions on natural language understanding (NLU), genera- driven by the need to deploy machine learning models in
tion (NLG), information retrieval, recommendation systems, resource-constrained environments, such as mobile devices
and the evaluation of text generation. In §5, we venture or edge computing platforms, where the computational
into domain-specific vertical distillation, showcasing how power and memory are limited. The focus was predomi-
knowledge distillation techniques are applied within spe- nantly on ad-hoc neural architecture selection and training
cialized fields such as law, healthcare, finance, and science, objectives tailored for single tasks. These earlier methods
4

AnnoLLM (He et al., 2023a), PandaLM (Wang et al., 2023b), CoT-Distill (Hsieh et al., 2023)
Labeling Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), Baize (Xu et al., 2023b),
Mammoth (Yue et al., 2023a), Mixed Distill (Chenglin et al., 2023)

Self-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023)
Expansion Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a),
WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b)

UltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a),
Curation Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), WaveCoder (Yu et al., 2024)
ZeroGen (Ye et al., 2022), SunGen (Gao et al., 2023a), InPars (Bonifacio et al., 2022)
Knowledge
BabyLlama (Timiryasov and Tastet, 2023), MiniLLM (Gu et al., 2024),
Feature
GKD (Agarwal et al., 2024), QuantGPT (Tao et al., 2022a), LLM-QAT (Liu et al., 2023a),

CAI (Bai et al., 2022a), WizardMath (Luo et al., 2023b), UltraFeedback (Cui et al., 2023a),
Feedback Zephyr (Tunstall et al., 2023), CycleAlign (Hong et al., 2023), RLAIF (Lee et al., 2023a),
Lion (Jiang et al., 2023b), PERsD (Chen et al., 2023a), GKD (Agarwal et al., 2024)

KD Algorithms Self-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024),
Self-Knowledge ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023),
Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022)

Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a),
Supervised Fine-Tuning
Self-Instruct (Wang et al., 2022a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022),

DistilGPT (Sanh et al., 2019), f-Distill (Wen et al., 2023), MiniLLM (Gu et al., 2024)
Divergence and Similarity
TED (Liang et al., 2023a), GKD (Agarwal et al., 2024),BabyLlama(Timiryasov and Tastet, 2023)
Distillation
CAI (Bai et al., 2022a), UltraFeedback (Cui et al., 2023a), WizardMath (Luo et al., 2023b),
Reinforcement Learning
MiniLLM (Gu et al., 2024), GKD (Agarwal et al., 2024), GPT3 Reward (Kwon et al., 2023)

Rank Optimization Zephyr (Tunstall et al., 2023), CycleAlign (Hong et al., 2023),

Self-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023),
Instruction Following WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023),
WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a),

Context Following Vicuna (Chiang et al., 2023), Baize (Xu et al., 2023b), UltraLLaMA (Ding et al., 2023b),
Multi-turn Dialogue
CAMEL (Li et al., 2023b), OpenChat (Wang et al., 2023c), Zephyr (Tunstall et al., 2023),

RAG Capbility KARD (Kang et al., 2023a), SAIL (Luo et al., 2023c), Self-RAG (Asai et al., 2023),

Selfee (Ye et al., 2023), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023),
Thinking Pattern
AFT (Wang et al., 2023d), AdaptLLM (Cheng et al., 2023), KnowPAT (Zhang et al., 2023a),
Knowledge Distillation of LLMs

CAI (Bai et al., 2022a), GPT-3 Reward (Kwon et al., 2023), ILF (Scheurer et al., 2023),
Alignment Preference ALMoST (Kim et al., 2023a), RLEF (Roit et al., 2023), RLAIF (Lee et al., 2023a),
Zephy (Tunstall et al., 2023), UltraFeedback (Cui et al., 2023a),

CAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b),
Value
Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024)

Toolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023),
Tool Using ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a),
Confucius (Gao et al., 2023b), MLLM-Tool (Wang et al., 2024), α-UMi (Shen et al., 2024),
Agent
FireAct (Chen et al., 2023b), AgentTuning (Zeng et al., 2023a), Lumos (Yin et al., 2023a),
Planning
AUTOACT (Qiao et al., 2024), TPTU-v2 (Kong et al., 2023),
Skill
Distillation AugGPT (Dai et al., 2023a), GPT Annotation (Gilardi et al., 2023), (Ding et al., 2023a),
NLU TDG (He et al., 2023b), SunGen (Gao et al., 2023a), Mix Distill (Chenglin et al., 2023),
Annollm (He et al., 2023a), UDG (Wang et al., 2021a), ZeroGen (Ye et al., 2022),

InheritSumm (Xu et al., 2023c), RECOMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023),
NLG ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a),
ChatGPT NMT (Yang and Nicolai, 2023),

QUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022),
NLP Task Information Retrieval AugTriever (Meng et al., 2023), (Sun et al., 2023a), RankVicuna (Pradeep et al., 2023a),
Specialization RankZephyr (Pradeep et al., 2023b), ExaRanker (Ferraretto et al., 2023),

Recommendation NDR (Mysore et al., 2023), InstrcutRec (Zhang et al., 2023b), ONCE (Liu et al., 2023c),

PandaLM (Wang et al., 2023b), Prometheus (Kim et al., 2024), InstructScore (Xu et al., 2023d),
Text Generation Evaluation
TigerScore (Jiang et al., 2023c), Auto-J (Li et al., 2024a),

CodeAlpaca (Chaudhary, 2023), CodeLlama (Rozière et al., 2023), Magicoder (Wei et al., 2023)
Code Phi-1 (Gunasekar et al., 2023), PERsD (Chen et al., 2023a), MFTCoder (Liu et al., 2023d),
WaveCoder (Yu et al., 2024), Code Clean (Jain et al., 2023),

LLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c),
Multi-Modality LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b),
Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e),

Verticalization Law (Huang et al., 2023b; Cui et al., 2023b); Medical & Healthcare (Zhang et al., 2023c; Chen et al., 2023d); Finance (Zhang and Yang, 2023);
Distillation Science (Xie et al., 2023a; Zhang et al., 2024) and Misc. (Dan et al., 2023; Guo et al., 2023b)

Fig. 3: Taxonomy of Knowledge Distillation of Large Language Models. The detailed taxonomy of Verticalization
Distillation is shown in Figure 7.
5

involved training a smaller student network to mimic the The relationship between DA and KD in LLMs is both
output of a larger teacher network, often through techniques symbiotic and foundational. By leveraging a set of seed
like soft target training, where the student learns from knowledge, KD employs DA to prompt LLMs to produce
the softened softmax output of the teacher. Please refer to explicit data that encapsulates specific skills or domain
the survey (Gou et al., 2021) for more details on general expertise (Chaudhary, 2023; West et al., 2022). This method
knowledge distillation techniques in AI and DL. stands out as a potent mechanism for bridging the knowl-
In contrast, the advent of LLMs has revolutionized edge and capability gap between proprietary and open-
the knowledge distillation landscape. The current era of source models. Through DA, LLMs are prompted to create
knowledge distillation in LLMs shifts the focus from mere targeted, high-quality datasets that are not merely larger in
architecture compression to knowledge elicitation and trans- volume but are also rich in diversity and specificity. This
fer (Taori et al., 2023; Chaudhary, 2023; Tunstall et al., 2023). approach enables the distillation process to be more effec-
This paradigm change is largely due to the expansive and tive, ensuring that the distilled models not only replicate
deep-seated knowledge that LLMs like GPT-4 and Gemini the teacher model’s output behavior but also embody its
possess. And the inaccessible parameters of LLMs make it deep-seated understanding and cognitive strategies.
hard to compress them by using pruning (Han et al., 2016) or DA acts as a force multiplier, enabling the distilled mod-
quantization (Liu et al., 2023a) techniques. Unlike the earlier els to acquire and refine capabilities that would otherwise
era, where the goal was to replicate the output behavior of require exponentially larger datasets and computational re-
the teacher model or reduce the model size, the current focus sources. It facilitates a more effective transfer of knowledge,
in LLM-based knowledge distillation is to elicit the specific focusing on the qualitative aspects of learning rather than
knowledge these models have. quantitative expansion. This strategic use of DA within
The key to this modern approach lies in heuristic and KD processes underscores a pivotal shift towards a more
carefully designed prompts, which are used to elicit specific efficient, sustainable, and accessible approach to harnessing
knowledge (Ding et al., 2023b) or capabilities (Chaudhary, the power of LLMs. It empowers open-source models with
2023) from the LLMs. These prompts are crafted to tap the ability to approximate the contextual adeptness, ethical
into the LLM’s understanding and capabilities in various alignment, and deep semantic insights characteristic of their
domains, ranging from natural language understanding (He proprietary counterparts, thereby democratizing access to
et al., 2023a) to more complex cognitive tasks like reason- advanced AI capabilities and fostering innovation across a
ing (Hsieh et al., 2023) and problem-solving (Qiao et al., broader spectrum of applications and users.
2024). The use of prompts as a means of knowledge elici-
tation offers a more flexible and dynamic approach to dis- 2.3 Survey Scope
tillation. It allows for a more targeted extraction of knowl-
Building on the discussions introduced earlier, this survey
edge, focusing on specific skills or domains of interest. This
aims to comprehensively explore the landscape of knowl-
method is particularly effective in harnessing the emergent
edge distillation within the context of LLMs, following
abilities of LLMs, where the models exhibit capabilities
a meticulously structured taxonomy as in Figure 3. The
beyond their explicit training objectives.
Furthermore, this era of knowledge distillation also em- survey’s scope is delineated through three primary facets:
phasizes the transfer of more abstract qualities such as KD Algorithms, Skill Distillation, and Verticalization Dis-
reasoning patterns (Mitra et al., 2023), preference align- tillation. Each facet encapsulates a range of subtopics and
ment (Cui et al., 2023a), and value alignment (Sun et al., methodologies. It’s important to note that KD algorithms
2024b). This is in stark contrast to the earlier focus on output provide the technical foundations for skill distillation and
replication (Taori et al., 2023), indicating a shift towards verticalization distillation.
a more holistic and comprehensive transfer of cognitive KD Algorithms. This segment focuses on the technical
capabilities. The current techniques involve not just the foundations and methodologies of knowledge distillation. It
replication of outputs, but also the emulation of the thought includes an in-depth exploration of the processes involved
processes (Mitra et al., 2023) and decision-making (Asai in constructing knowledge from teacher models (e.g., pro-
et al., 2023) patterns of the teacher model. This involves prietary LLMs) and integrating this knowledge into student
complex strategies like chain-of-thought prompting, where models (e.g., open-source LLMs). Under the umbrella of
the student model is trained to learn the reasoning process ‘knowledge’, we delve into strategies such as labeling (Hsieh
of the teacher, thereby enhancing its problem-solving and et al., 2023), expansion (Taori et al., 2023), curation (Gu-
decision-making capabilities. nasekar et al., 2023), feature understanding (Agarwal et al.,
2024), feedback mechanisms (Tunstall et al., 2023), and self-
2.2 Relation to Data Augmentation (DA) knowledge generation (Wang et al., 2022a). This exploration
In the era of LLMs, Data Augmentation (DA) (Wang et al., seeks to uncover the various ways in which knowledge
2022a; Ye et al., 2022) emerges as a critical paradigm integral can be identified, expanded, and curated for effective dis-
to the process of knowledge distillation. Unlike traditional tillation. The ‘distillation’ subsection examines learning ap-
DA techniques such as paraphrasing (Gangal et al., 2022) or proaches like supervised fine-tuning (SFT) (Wang et al.,
back-translation (Longpre et al., 2019), which primarily aim 2022a), divergence minimization (Agarwal et al., 2024),
at expanding the training dataset in a somewhat mechanical reinforcement learning techniques (Cui et al., 2023a), and
manner, DA within the context of LLMs focuses on the rank optimization strategies (Tunstall et al., 2023). Together,
generation of novel, context-rich training data tailored to these techniques demonstrate how KD enables open-source
specific domains and skills. models to obtain knowledge from proprietary ones.
6

Skill Distillation. This facet examines the specific compe- from a sophisticated teacher model to a less complex student
tencies and capabilities enhanced through KD. It encom- model. This pipeline is integral for leveraging the advanced
passes detailed discussions on context following (Taori et al., capabilities of models like GPT-4 or Gemini in more acces-
2023; Luo et al., 2023c), with subtopics like instruction sible and efficient open-source counterparts. The outline of
following and retrieval-augmented generation (RAG) Capa- this pipeline can be broadly categorized into four distinct
bility. In the realm of alignment (Mitra et al., 2023; Tun- stages, each playing a crucial role in the successful distilla-
stall et al., 2023), the survey investigates thinking patterns, tion of knowledge. An illustration is shown in Figure 4. The
persona/preference modeling, and value alignment. The detailed pipeline could also be seen in Figure 2.
‘agent’ category delves into skills such as Tool Using and I. Target Skill or Domain Steering Teacher LLM. The
Planning. NLP task specialization (Dai et al., 2023a; Jung first stage involves directing the teacher LLM towards a
et al., 2023; Chaudhary, 2023) is scrutinized through lenses specific target skill or domain. This is achieved through care-
like natural language understanding (NLU), natural lan- fully crafted instructions or templates that guide the LLM’s
guage generation (NLG), information retrieval, recommen- focus. These instructions are designed to elicit responses
dation systems, text generation evaluation, and code gen- that demonstrate the LLM’s proficiency in a particular area,
eration. Finally, the survey addresses multi-modality (Liu be it a specialized domain like healthcare or law, or a skill
et al., 2023e; Zhao et al., 2023b), exploring how KD enhances such as reasoning or language understanding.
LLMs’ ability to integrate multiple forms of input. II. Seed Knowledge as Input. Once the target area is
defined, the next step is to feed the teacher LLM with
Verticalization Distillation. This section assesses the ap-
seed knowledge. This seed knowledge typically comprises
plication of KD across diverse vertical domains, offering
a small dataset or specific data clues relevant to the elicit
insights into how distilled LLMs can be tailored for spe-
skill or domain knowledge from the teacher LLM. It acts
cialized fields such as Law (LAW, 2023), Medical & Health-
as a catalyst, prompting the teacher LLM to generate more
care (Wang et al., 2023a), Finance (Zhang and Yang, 2023),
elaborate and detailed outputs based on this initial infor-
Science (Zhang et al., 2024), among others. This exploration
mation. The seed knowledge is crucial as it provides a
not only showcases the practical implications of KD tech-
foundation upon which the teacher model can build and
niques but also highlights their transformative impact on
expand, thereby creating more comprehensive and in-depth
domain-specific AI solutions.
knowledge examples.
Through these facets, this survey provides a compre-
III. Generation of Distillation Knowledge. In response
hensive analysis of KD in LLMs, guiding researchers and
to the seed knowledge and steering instructions, the teacher
practitioners through methodologies, challenges, and op-
LLM generates knowledge examples. These examples are
portunities in this rapidly evolving domain.
predominantly in the form of question-and-answer (QA)
Declaration. This survey represents our earnest effort to dialogues or narrative explanations, aligning with the nat-
provide a comprehensive and insightful overview of knowl- ural language processing/understanding capabilities of the
edge distillation techniques applied to LLMs, focusing on LLM. In certain specialized cases, the outputs may also in-
algorithms, skill enhancement, and domain-specific appli- clude logits or hidden features, although this is less common
cations. Given the vast and rapidly evolving nature of due to the complexity and specific requirements of such
this field, especially with the prevalent practice of elic- data forms. The generated knowledge examples constitute
iting knowledge from training data across academia, we the core of the distillation knowledge, encapsulating the
acknowledge that this manuscript may not encompass every advanced understanding and skills of the teacher LLM.
pertinent study or development. Nonetheless, it endeavors IV. Training the Student Model with a Specific Learn-
to introduce the foundational paradigms of knowledge dis- ing Objective. The final stage involves the utilization of
tillation, highlighting key methodologies and their impacts the generated knowledge examples to train the student
across a range of applications. model. This training is guided by a loss function that aligns
with the learning objectives. The loss function quantifies
2.4 Distillation Pipeline in LLM Era the student model’s performance in replicating or adapting
the knowledge from the teacher model. By minimizing this
Learning loss, the student model learns to emulate the target skills or
Skill/Domain
Objective domain knowledge of the teacher, thereby acquiring similar
capabilities. The process involves iteratively adjusting the
steer train student model’s parameters to reduce the discrepancy be-
tween its outputs and those of the teacher model, ensuring
drive Generated
Knowledge the effective transfer of knowledge.
Seed
In essential, the above four stages can be abstracted
Knowledge
Teacher LLM Student Model as two formulations. The first formulation represents the
Knowledge Elicitation Distillation Algorithm process of eliciting knowledge:

Fig. 4: An illustration of a general pipeline to distill knowl- DI(kd) = {Parse(o, s)|o ∼ pT (o|I ⊕ s), ∀s ∼ S}, (1)
edge from a large language model to a student model.
where ⊕ denotes fusing two pieces of text, I denotes an
The general distillation pipeline of LLMs is a structured instruction or a template for a task, skill, or domain to
and methodical process aimed at transferring knowledge steer the LLM and elicit knowledge, s ∼ S denotes an
7

example of the seed knowledge, upon which the LLM can tasks leverage LLMs to label evaluated results (Li et al.,
explore to generate novel knowledge, Parse(o, s) stands for 2024b; Wang et al., 2023b), and reasoning tasks utilize LLMs
to parse the distillation example ( e.g., (x, y)) from the for labeling Chains of Thought (CoT) explanations (Hsieh
teacher LLM’s output o (plus the input s in some cases), et al., 2023; Li et al., 2022; Ho et al., 2023; Magister et al.,
and pT represents the teacher LLM with parameters θT . 2023; Fu et al., 2023; Ramnath et al., 2023; Li et al., 2023d;
(kd)
Given the datasets DI built for distillation, we then define Liu et al., 2023g), among others. Rather than concentrating
a learning objective as on specific tasks, many current works focus on labeling
outputs based on instructions, thereby teaching student
LI (DI(kd) ; θS ),
X
L= (2) models to solve tasks in a more flexible way by following in-
I
P structions. Collections of various NLP tasks, complemented
where I denotes there could be multiple tasks or skills
by instructional templates, serve as valuable input sources
being distilled into one student model, LI (·; ·) stands for a
for x. For instance, FLAN-v2 collections (Longpre et al.,
specific learning objective, and θS parameterizes the student
2023) offers extensive publicly available sets of tasks with
model.
instructions, which are labeled with responses generated
Following our exploration of the distillation pipeline and
by teacher LLMs in Orca (Mukherjee et al., 2023; Mitra
the foundational concepts underlying knowledge distilla-
et al., 2023). The instructions from these NLP tasks are
tion in the LLM era, we now turn our focus to the specific
built from predefined templates, which lack diversity and
algorithms that have gained prominence in this era.
may have gaps between human’s natural query. The real
conversations between humans and chat models provide
3 K NOWLEDGE D ISTILLATION A LGORITHMS large-scale data with real queries and generations labeled
This section navigates through the process of knowledge by powerful LLMs, like ShareGPT. Additionally, Xu et al.
distillation. According to Section 2.4, it is categorized into (2023b) and Anand et al. (2023) label the real questions
two principal steps: ‘Knowledge,’ focusing on eliciting sampled from forums like Quora and Stack Overflow.
knowledge from teacher LLMs (Eq.1), and ‘Distillation,’ Moreover, the process of labeling could be guided by
centered on injecting this knowledge into student models instructions I or demonstrations c. A commonly used in-
(Eq.2). We will elaborate on these two processes in the struction type for guiding labeling is chain-of-thought (CoT)
subsequent sections. prompt (Hsieh et al., 2023; Fu et al., 2023; Magister et al.,
2023). Mukherjee et al. (2023) add multiple system messages
(e.g. “You must generate a detailed and long answer.” or
3.1 Knowledge
“explain like I’m five, think step-by-step”) to elicit rich
This section focuses on the approaches to elicit knowledge signals. Yue et al. (2023a) and Chenglin et al. (2023) la-
from teacher LLMs. According to the manners to acquire bel a hybrid of knowledge of chain-of-thought (CoT) and
knowledge, we divided them into Labeling, Expansion, Data program-of-thought (PoT) rationales. Xu et al. (2023b) pro-
Curation, Feature, Feedback, and Self-Knowledge. Figure 5 pose a self-chat technique that two teacher LLMs simulate
shows an illustration of these knowledge elicitation meth- the real conversational to generate multi-turn dialogues for
ods. a question from Quora and Stack Overflow.
3.1.1 Labeling
3.1.2 Expansion
Labeling knowledge refers to using a teacher LLM to label
the output y for a given input x as the seed knowledge, While the labeling approach is simple and effective, it faces
according to the instruction I or demonstrations c, where certain limitations. Primarily, it is constrained by the scale
c = (x1 , y1 ), . . . , (xn , yn ). This method of eliciting knowl- and variety of the input data. In real-world applications,
edge from teacher LLMs is straightforward yet effective and especially those involving user conversations, there are also
has been widely applied across various tasks and appli- concerns regarding the privacy of the data involved. To
cations. It requires only the collection of an input dataset address these limitations, various expansion methods have
and feeding it into LLMs to obtain the desired generations. been proposed (Wang et al., 2022a; Taori et al., 2023; Chaud-
Moreover, the generation of y is controllable through the hary, 2023; Si et al., 2023; Ji et al., 2023a; Luo et al., 2023b,a;
predefined I and c. This process can be formulated as Wu et al., 2023c; Sun et al., 2024b; Xu et al., 2023a; Guo
follows: et al., 2023c; Rozière et al., 2023; West et al., 2022). These
methods take the demonstrations as seed knowledge and
D(lab) = {x, y|x ∼ X , y ∼ pT (y|I ⊕ c ⊕ x)}. (3) aim to expand a large scale and various data by in-context
Input x could be sourced from existing NLP task learning.
datasets, which serve as typical reservoirs for distillation A key characteristic of these expansion methods is the
efforts. Numerous works have sought to harness the capa- utilization of the in-context learning ability of LLMs to gen-
bilities of powerful LLMs as teachers for annotating dataset erate data similar to the provided demonstrations c. Unlike
samples across a range of tasks. For instance, efforts in in the labeling approach, where the input x is sampled
natural language understanding involve using LLMs to cat- from the existing dataset, in the expansion approach, both x
egorize text (Gilardi et al., 2023; Ding et al., 2023a; He et al., and y are generated by teacher LLMs. This process can be
2023a), while in natural language generation, LLMs assist formulated as follows:
in generating sequences for outputs (Hsieh et al., 2023; Jung
et al., 2023; Wang et al., 2021b). Text generation evaluation D(exp) = {(x, y)|x ∼ pT (x|I ⊕ c), y ∼ pT (y|I ⊕ x)}. (4)
8

Input Set 𝐼 Generate Distribution


𝑐 𝑦 𝑥 Extract Feature
Labeling
𝑥 𝑦 Intermediate
Feature

Update
Expand Guide Feedback
Complete
Expansion 𝑐
𝑥 𝑦 𝑦" ≻ 𝑦# ≻ 𝑦!
𝐼
𝑥
Feedback Correct
𝑦# 𝑦#∗
Meta Sources Generate
Expand
Data Create Complete 𝑦𝑦# 𝑥 𝑥&
𝐼 𝑦"!
Curation 𝑥 𝑦
𝑚
Sample
Feedback
𝐼 Self-
𝐼 𝑦 Knowledge
Instruction 𝑥 Input 𝑦 Output Teacher 𝑥
Filter
𝑐 Demonstrations 𝑚 Meta-Information Student

Fig. 5: An illustration of different knowledge elicitation methods from teacher LLMs. Labeling: The teacher generates
the output from the input; Expansion: The teacher generates samples similar to the given demonstrations through in-
context learning; Data Curation: The teacher synthesizes data according to meta-information, such as a topic or an entity;
Feature: Feed the data into the teacher and extract its internal knowledge, such as logits and features; Feedback: The teacher
provides feedback on the student’s generations, such as preferences, corrections, expansions of challenging samples, etc;
Self-Knowledge: The student first generates outputs, which is then filtered for high quality or evaluated by the student itself.

In this formulation, x and y represent the new input- et al., 2023b) proposes the Targeted Data Generation (TDG)
output pairs generated by the teacher LLM. The input x framework, which automatically identifies challenging sub-
is generated based on a set of input-output demonstrations groups within data and generates new samples for these
c. The output y is then generated in response to the new subgroups using LLMs through in-context learning.
input x under the guidance of an instruction I . Note that In summary, the expansion method leverages the in-
the demonstrations could be predefined or dynamically context learning strengths of LLMs to produce more var-
updated by adding the newly generated samples. ied and extensive datasets with both inputs and outputs.
Expansion techniques have been widely utilized to However, the quality and diversity of the generated data
extract extensive instruction-following knowledge from are heavily reliant on the teacher LLMs and the initial seed
teacher LLMs. Wang et al. (2022a) first introduces an iter- demonstrations. This dependence can lead to a dataset with
ative bootstrapping method, Self-Instruct, to utilize LLMs inherent bias from LLMs (Yu et al., 2023a; Wei et al., 2023)
to generate a wide array of instructions based on sev- and a homogeneity issue where the generations may be
eral demonstrations sampled from 175 manually-written in- prone to similarity ultimately, limiting the diversity this
structions. The newly generated instructions are then added method seeks to achieve (Ding et al., 2023b). Moreover, the
back to the initial pool, benefiting subsequent expansion expansion process may inadvertently amplify any biases
iterations. Subsequently, Taori et al. (2023) applies this ex- present in the seed data.
pansion method to a more powerful teacher LLM, text-
davinci-003, to distill 52K high-quality data. To improve 3.1.3 Data Curation
the diversity and coverage during expansion, Wu et al. The pursuit of high-quality and scalable data generation in
(2023c) and (Sun et al., 2024b) prompt the teacher LLM to knowledge distillation from LLMs has led to the emergence
generate instructions corresponding to some specific topics. of the Data Curation approach. This method arises in re-
Xu et al. (2023a) propose an Evol-Instruct method to ex- sponse to the limitations observed in both the Labeling and
pand the instructions from two dimensions: difficulty (e.g. Expansion approaches. These methods often yield data of
rewriting the question to be more complex) and diversity variable quality and face constraints in quantity. In Labeling,
(e.g. generating more long-tailed instructions). This Evol- the seed knowledge is sourced from task datasets, leading
Instruct method is domain-agnostic and has been used to to potential noise and dirty data. Meanwhile, in Expansion,
expand the distillation of coding (Luo et al., 2023a) and the input x is derived from seed demonstrations, which
math (Luo et al., 2023b). Additionally, expansion methods can result in homogeneous data when generated in large
can significantly augment NLP task datasets with similar quantities. To overcome these challenges, the Data Curation
samples, thereby enhancing task performance. For instance, method curates high-quality or large-scale data by extensive
AugGPT (Dai et al., 2023a) leverages a teacher LLM to meta-information as seed knowledge (Ding et al., 2023b;
rephrase each sentence in the training samples into multi- Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023; Liu et al.,
ple conceptually similar, but semantically varied, samples 2023d; Wei et al., 2023; Yu et al., 2024; Ye et al., 2022; Gao
to improve classification performance. Similarly, TDG (He et al., 2023a; Yang and Nicolai, 2023).
9

A distinct feature of Data Curation is its approach to create synthetic datasets will become a crucial technical
to synthesize data from scratch. Numerous diverse meta- skill and a key area of focus in AI (Li et al., 2023a).
information, such as topics or knowledge points, could be
incorporated into this process to generate controllable x 3.1.4 Feature
and y . Thus, this process can be meticulously controlled The previously discussed knowledge elicitation methods
to yield datasets that are not only large in scale but also are typically applied to powerful black-box models, which
of high quality. The formulation for Data Curation can be are expensive and somewhat unreproducible due to calling
represented as: API. In contrast, white-box distillation offers a more trans-
parent and accessible approach for researchers. It involves
D(cur) = {(x, y)|x ∼ pT (x|I ⊕ m), y ∼ pT (y|I ⊕ x)}. (5) leveraging the output distributions, intermediate features,
or activations from teacher LLMs, which we collectively
In this formulation, m represents the diverse meta- refer to as Feature knowledge. White-box KD approaches
information used to guide the synthesis of x, and I is the have predominantly been studied for smaller encoder-based
instruction guiding teacher LLMs to generate x or y . LMs, typically those with fewer than 1 billion parameters
Different studies primarily vary in their source and (cf. Gou et al. (2021) for detail). However, recent research
method of leveraging meta-information. UltraChat (Ding has begun to explore white-box distillation in the context of
et al., 2023b) effectively demonstrates the process of curating generative LLMs (Timiryasov and Tastet, 2023; Liang et al.,
both high-quality and diverse data by distilled knowledge. 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a;
They collect extensive meta-information across three do- Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin
mains: Questions about the World, Creation and Generation, et al., 2023b; Boizard et al., 2024; Zhong et al., 2024).
and Assistance on Existing Materials. For example, under The typical method for acquiring this feature knowledge
Questions about the World, they explore 30 meta-topics like involves teacher LLMs annotating the output sequence y
”Technology” and ”Food and Drink.” the teacher LLMs with its internal representations. These annotations are then
then use this meta-information to distill a broad array distilled into the student model using methods such as
of instructions and conversations, achieving a substantial Kullback-Leibler Divergence (KLD). The process of eliciting
scale of 1.5 million instances. UltraChat stands out with its feature knowledge can be formulated as follows:
lexical and topical diversity. The UltraLLaMA model, fine-
tuned on this data, consistently surpasses other open-source D(feat) = {(x, y, ϕfeat (x, y; θT )) | x ∼ X , y ∼ Y}. (6)
models. Another notable series, phi (Gunasekar et al., 2023; In this formulation, Y is the output set, which can be
Li et al., 2023a; Mar, 2023), focuses on distilling smaller, generated by teacher LLMs, the student model, or directly
high-quality datasets akin to ”textbooks.” Phi-1(Gunasekar sourced from the dataset. ϕfeat (·; θT ) represents the opera-
et al., 2023) experiments with synthesizing ”textbook qual- tion of extracting feature knowledge (such as output distri-
ity” data in the coding domain. Their approach involves bution) from the teacher LLM.
distilling clear, self-contained, instructive, and balanced con- The most straightforward method to elicit feature knowl-
tent from LLMs, guided by random topics or function names edge of teacher is to label a fixed dataset of sequences with
to enhance diversity. The distilled data is a synthesis of 1 token-level probability distributions (Sanh et al., 2019; Wen
billion tokens of Python textbooks, complete with natural et al., 2023). To leverage the rich semantic and syntactic
language explanations and code snippets, as well as 180 mil- knowledge in intermediate layers of the teacher model,
lion tokens of Python exercises with solutions. Remarkably, TED (Liang et al., 2023a) designs task-aware layer-wise
the phi-1 model, despite its smaller size, outperforms nearly distillation. They align the student’s hidden representations
all open-source models on coding benchmarks like Hu- with those of the teacher at each layer, selectively extracting
manEval and MBPP while being 10 times smaller in model knowledge pertinent to the target task. Gu et al. (2024) and
size and 100 times smaller in dataset size. MFTCoder (Liu Agarwal et al. (2024) introduce a novel approach where
et al., 2023d) utilizes hundreds of Python knowledge points the student model first generates sequences, termed ‘self-
as meta-information to create a CodeExercise Dataset. In generated sequences.’ The student then learns by using
contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu feedback (i.e. output distribution) from teacher on these
et al., 2024) get raw code collections from open-source sequences. This method is particularly beneficial when the
code datasets, using this as meta-information for generating student model lacks the capacity to mimic teacher’s distri-
instructional data. In the context of NLU tasks, certain bution. Moreover, various LLM-quantization methods with
studies (Ye et al., 2022; Gao et al., 2023a; Wang et al., 2021a) distilling feature knowledge from teacher LLMs have been
explore the use of labels as meta-information to synthesize proposed (Tao et al., 2022a; Liu et al., 2023a; Kim et al.,
corresponding samples for data augmentation. Similarly, in 2023b). These methods aim to preserve the original output
information retrieval tasks, there are efforts to utilize docu- distribution when quantizing the LLMs, ensuring minimal
ments as meta-information for generating potential queries, loss of performance. Additionally, feature knowledge could
thereby constructing large-scale retrieval pairs (Bonifacio serve as a potent source for multi-teacher knowledge distil-
et al., 2022; Meng et al., 2023). lation. Timiryasov and Tastet (2023) leverages an ensemble
In conclusion, Data Curation through teacher LLMs has of GPT-2 and LLaMA as teacher models to extract output
emerged as a promising technique for synthesizing datasets distributions. Similarly, FuseLLM (Wan et al., 2024a) inno-
that are not only high-quality and diverse but also large vatively combines the capabilities of various LLMs through
in scale. The success of models like phi-1 in specialized a weighted fusion of their output distributions, integrating
domains underscores the efficacy of this method. The ability them into a singular LLM. This approach has the potential
10

to significantly enhance the student model’s capabilities, models, UltraFeedback. It compiles various instructions and
surpassing those of any individual teacher LLM. models to produce comparative data. Then, GPT-4 is used
In summary, feature knowledge offers a more transpar- to score candidates from various aspects of preference,
ent alternative to black-box methods, allowing for deeper including instruction-following, truthfulness, honesty and
insight into and control over the distillation process. By helpfulness.
utilizing feature knowledge from teacher LLMs, such as out- Beyond merely assessing student generations, teachers
put distributions and intermediate layer features, white-box can also furnish extensive feedback on instances where
approaches enable richer knowledge transfer. While show- students underperform. In Lion (Jiang et al., 2023b), teacher
ing promise, especially in smaller models, its application model pinpoints instructions that pose challenges to the
is not suitable for black-box LLMs where internal parame- student model, generating new, more difficult instructions
ters are inaccessible. Furthermore, student models distilled aimed at bolstering the student’s abilities. PERsD (Chen
from white-box LLMs may underperform compared to their et al., 2023a) showcases a method where teacher offers
black-box counterparts, as the black-box teacher LLMs (e.g. tailored refinement feedback on incorrect code snippets gen-
GPT-4) tend to be more powerful. erated by students, guided by the specific execution errors
encountered. Similarly, SelFee (Ye et al., 2023) leverages
3.1.5 Feedback ChatGPT to generate feedback and revise the student’s
Most previous works predominantly focus on one-way answer based on the feedback. In contrast, FIGA (Guo et al.,
knowledge transfer from the teacher to the student for 2024) revises the student’s response by comparing it to
imitation, without considering feedback from the teacher the ground-truth response. Furthermore, teacher model’s
on the student’s generation. The feedback from the teacher distribution over the student’s generations can itself act
typically offers guidance on student-generated outputs by as a form of feedback. MiniLLM (Gu et al., 2024) and
providing preferences, assessments, or corrective informa- GKD (Agarwal et al., 2024) present an innovative strategy
tion. For example, a common form of feedback involves wherein the student model initially generates sequences,
teacher ranking the student’s generations and distilling this followed by teacher model producing an output distribution
preference into the student model through Reinforcement as feedback. This method leverages the teacher’s insight
Learning from AI Feedback (RLAIF) (Bai et al., 2022a). to directly inform and refine the student model’s learning
Here is a generalized formulation for eliciting feedback process.
knowledge:
3.1.6 Self-Knowledge
D(fb) = {(x, y, ϕfb (x, y; θT ))|x ∼ X , y ∼ pS (y|x)}, (7)
The knowledge could also be elicited from the student itself,
where y denotes the output generated by the student which we refer to as Self-Knowledge. In this setting, the same
model in response to x, and ϕfb (·; θT )) represents providing model acts both as the teacher and the student, iteratively
feedback from teacher LLMs. This operation evaluates the improving itself by distilling and refining its own previously
student’s output y given the input x, by offering assess- generated outputs. This knowledge uniquely circumvents
ment, corrective information, or other forms of guidance. the need for an external, potentially proprietary, powerful
This feedback knowledge can not only be distilled into teacher model, such as GPT-series LLMs. Furthermore, it
the student to also generate feedback (such as creating a allows the model to surpass the limitations or “ceiling”
student preference model) but, more importantly, enable inherent in traditional teacher-student methods. Eliciting
the student to refine its responses based on the feedback. self-knowledge could be formulated as:
Various methods have been explored to elicit this advanced
D(sk) = {(x, y, ϕsk (x, y))|x ∼ S, y ∼ pS (y|I ⊕ x)}, (8)
knowledge (Bai et al., 2022a; Luo et al., 2023b; Cui et al.,
2023a; Kwon et al., 2023; Jiang et al., 2023b; Chen et al., where ϕsk (·) is a generalized function that represents an
2023a; Gu et al., 2024; Agarwal et al., 2024; Chen et al., 2024b; additional process to the self-generated outputs y , which
Guo et al., 2024; Ye et al., 2023; Hong et al., 2023; Lee et al., could include but is not limited to filtering, rewarding, or
2023a). any other mechanisms for enhancing or evaluating y . It
Preference, as previously discussed, represents a notable could be governed by external tools or the student itself θS .
form of feedback knowledge from teacher models. Various Recent research in this area has proposed various innovative
knowledge of preferences could be distilled from teachers methodologies to elicit self-knowledge, demonstrating its
by prompting it with specific criteria. Bai et al. (2022a) in- potential for creating more efficient and autonomous learn-
troduce RLAIF for distilling harmlessness preferences from ing systems. (Allen-Zhu and Li, 2020; Wang et al., 2022a;
LLMs. This involves using an SFT-trained LLM to generate Sun et al., 2024b; Yang et al., 2024; Jung et al., 2023; Huang
response pairs for each prompt, then ranking them for et al., 2023a; Gulcehre et al., 2023; Yuan et al., 2024a; Xu
harmlessness to create a preference dataset. This dataset is et al., 2023b; Zelikman et al., 2022; Chen et al., 2024a; Zheng
distilled into a Preference Model (PM), which then guides et al., 2024; Li et al., 2024c; Zhao et al., 2024; Singh et al.,
the RL training of a more harmless LLM policy. Wizard- 2023; Chen et al., 2024c; Hosseini et al., 2024)
Math (Luo et al., 2023b) places emphasis on mathematical A notable example of this methodology is Self-
reasoning. They employ ChatGPT as teacher to directly Instruct (Wang et al., 2022a), which utilizes GPT-3 for
provide process supervision and evaluate the correctness data augmentation through the Expansion approach, gen-
of each step in the generated solutions. To scale up high- erating additional data samples to enhance the dataset.
quality distilled preference data, Cui et al. (2023a) develop a This enriched dataset subsequently fine-tunes the original
large-scale preference dataset for distilling better preference model. Other methods aim to elicit targeted knowledge
11

from student models by modifying prompts, and leveraging Divergence Type D(p, q) Function
these data for further refinement. In Self-Align (Sun et al., Forward KLD
P p(t)
p(t) log q(t)
2024b), they find that models fine-tuned by Self-Instruct P q(t)
data tend to generate short or indirect responses. They Reverse KLD q(t) log p(t)

prompt this model with verbose instruction to produce in-


P 
1 2p(t) P 2q(t)
JS Divergence 2
p(t) log p(t)+q(t)
+ q(t) log p(t)+q(t)
depth and detailed responses. Then, they employ context-
distillation (Askell et al., 2021) to distill these responses TABLE 1: Functional forms of D for various divergence
paired with non-verbose instructions back to the model. types. p: reference
Similarly, RLCD (Yang et al., 2024) introduces the use of
contrasting prompts to generate preference pairs from an Similarity Function LF Expression
unaligned LLM, encompassing both superior and inferior
L2-Norm Distance ∥ΦT (fT (x, y)) − ΦS (fS (x, y))∥2
examples. A preference model trained on these pairs then
L1-Norm Distance ∥ΦT (fT (x, y)) − ΦS (fS (x, y))∥1
guides the enhancement of the unaligned model through P
reinforcement learning. Several other approaches employ Cross-Entropy Loss − ΦT (fT (x, y)) log(ΦS (fS (x, y)))
filtering methods to refine self-generated data. For exam- Maximum Mean Discrepancy MMD(ΦT (fT (x, y)), ΦS (fS (x, y)))
ple, Impossible Distillation (Jung et al., 2023) targets sen-
tence summarization tasks, implementing filters based on TABLE 2: Summary of similarity functions in knowledge
entailment, length, and diversity to screen self-generated distillation.
summaries. LMSI (Huang et al., 2023a) generates multiple
CoT reasoning paths and answers for each question, and
then retains only those paths that lead to the most consistent LLMs. SFT finetunes student model by maximizing the like-
answer. lihood of sequences generated by the teacher LLMs, aligning
Note that refined self-knowledge can be iteratively ac- the student’s predictions with those of the teacher. This
quired as the student model continuously improves, further process can be mathematically formulated as minimizing
enhancing the student’s capabilities. This is Gulcehre et al. the objective function:
(2023) introduces a Reinforced Self-Training (ReST) frame- LSFT = Ex∼X ,y∼pT (y|x) [− log pS (y|x)] , (9)
work that cyclically alternates between Grow and Improve
stages to progressively obtain better self-knowledge and where y is the output sequence produced by the teacher
refine the student model. During the Grow stage, the student model. This simple yet highly effective technique forms
model generates multiple output predictions. Then, in the the basis of numerous studies in the field. Numerous re-
Improve stage, these self-generated outputs are ranked searchers have successfully employed SFT to train student
and filtered using a scoring function. Subsequently, the lan- models using sequences generated by teacher LLMs (Taori
guage model undergoes fine-tuning on this curated dataset, et al., 2023; Chiang et al., 2023; Wu et al., 2023c; Xu et al.,
employing an offline RL objective. Self-Play (Chen et al., 2023a; Luo et al., 2023b). Additionally, SFT has been ex-
2024a) introduces a framework resembling iterative DPO, plored in many self-distillation works (Wang et al., 2022a;
where the language model is fine-tuned to differentiate the Huang et al., 2023c; Xu et al., 2023b; Zelikman et al., 2022).
self-generated responses from the human-annotated data. Due to the large number of KD works applying SFT, we
These self-generated responses could be seen as “negative only list representative ones here. More detailed works can
knowledge” to promote the student to better align with be found in §4.
the target distribution. Self-Rewarding (Yuan et al., 2024a)
explores a novel and promising approach by utilizing the 3.2.2 Divergence and Similarity
language model itself as a reward model. It employs LLM- This section mainly concentrates on algorithms designed for
as-a-Judge prompting to autonomously assign rewards for distilling feature knowledge from white-box teacher LLMs,
the self-generated responses. The entire process can then including distributions and hidden state features. These
be iterated, improving instruction following and reward algorithms can be broadly categorized into two groups:
modeling capabilities. those minimizing divergence in probability distributions
and those aimed at enhancing the similarity of hidden
states.
3.2 Distillation
This section focuses on the methodologies for effectively Divergence. Divergence-based methods minimize diver-
transferring the elicited knowledge from teacher LLMs into gence between the probability distributions of the teacher
student models. We explore a range of distillation tech- and student models, represented by a general divergence
niques, from the strategies that enhance imitation by Su- function D:
pervised Fine-Tuning, Divergence and Similarity, to advanced LDiv = E [D (pT (y|x), pS (y|x))] , (10)
methods like Reinforcement Learning and Rank Optimization, x∼X ,y∼Y
as shown in Figure 3. The specific form of D varies depending on the type of
divergence employed. Table 1 outlines the functional forms
3.2.1 Supervised Fine-Tuning of D for different divergence measures. The commonly-used
Supervised Fine-Tuning (SFT), or called Sequence-Level KD standard KD objectives essentially minimize the approxi-
(SeqKD) (Kim and Rush, 2016), is the simplest and one of mated forward Kullback-Leibler divergence (KLD) between
the most effective methods for distilling powerful black-box the teacher and the student distribution (Sanh et al., 2019;
12

tion functions ΦT and ΦS are applied to these feature maps


to ensure they are in the same shape, facilitating direct
comparison. The similarity function LF is used to match
these transformed feature maps. Table 2 shows common
choices for LF . Few works have employed similarity-based
methods in the KD of LLMs. Among them, Liang et al.
(2023a) propose Task-Aware Layer-Wise Distillation (TED),
a method that utilizes task-aware filters. These filters are
p argminq KL(p||q) argminq KL(q||p) designed to selectively capture the most pertinent informa-
tion for a specific task from the teacher model. The key
Fig. 6: Comparison of Forward and Reverse KL Diver- objective is to minimize the discrepancy between the filtered
gences in Approximating a Target Distribution. Forward representations in both teacher and student models. While
KL divergence approach tends to cover all modes of the similarity-based approaches are common in encoder-based
target distribution but is less precise, i.e. “mode-covering” LMs (Sun et al., 2019, 2020; Jiao et al., 2020; Hou et al.,
behavior. Reverse KL divergence method focuses predom- 2020; Zuo et al., 2022; Liang et al., 2021), their application in
inantly on the most prominent mode, thereby exhibiting a LLM knowledge distillation is not as widespread. However,
“mode-seeking” behavior. considering their effectiveness, we anticipate an increase in
research exploring these methods for LLM distillation in the
near future.
Wen et al., 2023; Timiryasov and Tastet, 2023; Liang et al.,
2023a; Chen et al., 2024d) , which forces pS to cover all the
3.2.3 Reinforcement Learning
modes of pT . However, when a student model is unable
to learn all modes of a highly complex teacher, the re- This section explores advanced methods of distilling knowl-
sultant “mode-covering” behavior might cause the student edge into student models using reinforcement learning (RL).
to assign probability mass to tokens with low probability This approach is especially relevant for leveraging the feed-
under the teacher’s distribution (cf. Figure 6 blue curve). back from teacher to train student models (Bai et al., 2022a;
This mode-covering phenomenon can potentially lead to Cui et al., 2023a; Luo et al., 2023b; Agarwal et al., 2024; Chen
hallucinations and low-quality generations. Alternatively, et al., 2024b; Ma et al., 2023a; Pang et al., 2023; Du et al.,
mode-seeking divergences like reverse KL prioritize tokens 2023a). The RL-based distillation process typically involves
where the teacher assigns high probabilities (cf. Figure 6 two main stages:
green curve). This approach can mitigate the risk of low-
quality outputs, fostering more accurate generations. How- Distilled Reward Model Training. The first stage involves
ever, it often does so at the cost of reduced diversity. training a reward model rϕ using the feedback data D(fd)
Gu et al. (2024) adopt reverse KL divergence to prevent generated by teacher LLMs. Preference data, as one of the
students from overestimating low-probability regions of the typical feedback, is employed to train the student reward
teacher’s distribution, employing Policy Gradient methods model (Bai et al., 2022a; Cui et al., 2023a; Lee et al., 2023a;
for optimization. Both Agarwal et al. (2024) and Sason and Kim et al., 2023a). They usually consist of input-output
Verdú (2016) assess the effect of different divergence func- pairs (x, yw , yl ). Here, yw and yl represent “winning” and
tions in LLM distillation, finding the optimal divergence to “losing” outputs relative to the teacher’s preferences. The
be task-dependent. For instance, forward KL divergence is loss function for the reward model is defined as:
more suitable for tasks like Machine Translation, where the
output has fewer modes or variations, while reverse KL
divergence is preferable for tasks like dialogue generation LRM (rϕ , D(fd) ) = − E [log σ (rϕ (x, yw ) − rϕ (x, yl ))]
(x,yw ,yl )∼D (fd)
and instruction tuning, which involve multiple modes and (12)
a wider range of potential responses. Thus, the nature of the
task significantly influences the selection of the divergence This formulation guides the reward model to correctly
function for optimal performance. distinguish between more and less preferable outputs based
Similarity. Similarity-based methods in knowledge distilla- on the teacher’s criteria. Instead of learning the instance-
tion aim to align the hidden states or features of the student level rewards, RLMEC (Chen et al., 2024b) adopts a dif-
model with those of the teacher. These methods use various ferent approach by training a generative reward model. It
similarity metrics to measure and optimize the congruence is trained on an erroneous solution rewriting data distilled
of internal representations between the two models. The from a teacher LLM. This distilled reward model can pro-
objective is to ensure that the student model not only duce token-level rewards for RL training.
produces similar outputs to the teacher but also processes
Reinforcement Learning Optimization. In the second stage,
information in a comparable manner. The formulation for a
the student model, represented by a policy πθ , is optimized
similarity-based objective might look like this:
to maximize the expected reward as per the trained reward
LSim = E [LF (ΦT (fT (x, y)) , ΦS (fS (x, y)))] , (11) model. Simultaneously, it minimizes the divergence from
x∼X ,y∼Y
a reference policy πref , typically the initial policy of the
where fT (x, y) and fS (x, y) are the feature maps of the student model trained by SFT, controlled by a factor β . The
teacher and student models, respectively. The transforma- RL objective is given by:
13

comparison to handle preference rankings of any length. For


a given instruction x and a sequence of responses ordered by
max E [rϕ (x, y)] − βDKL [πθ (y | x)∥πref (y | x)] teacher preference as y1 ≻ y2 ≻ ... ≻ yn , the RPO training
πθ x∼X,y∼πθ (y|x)
(13) objective is:
n−1
This RL framework not only ensures that the student model X exp (pk )
learns the explicit content from the teacher but also effec- LPRO = − log Pn , (16)
k=1 i=k exp (pi )
tively adopts the teacher’s preference patterns. The use of
RL, particularly with the PPO (Schulman et al., 2017) algo- where pk represents the conditional log probabilities for
rithm, offers a robust mechanism for aligning the student yk under the student policy πθ . By iteratively contrasting
model’s outputs with the teacher. Alternatively, the teacher the likelihood of generating responses, PRO optimizes the
LLM can also serve as the reward model to directly assign student LM to prioritize the most preferred response while
rewards during RL, circumventing the need for training a progressively ranking the rest in the order of diminishing
reward model (Lee et al., 2023a; Kwon et al., 2023). While preference.
this approach may exhibit superior performance, it comes
at a higher computational cost compared to employing a 4 S KILL D ISTILLATION
smaller distilled reward model.
Building upon the foundation laid out in Section 3 about
3.2.4 Ranking Optimization eliciting knowledge and distillation algorithms, we shift our
focus to how these techniques facilitate the distillation of
Ranking optimization presents a stable and computationally specific skills in LLMs. Our exploration will encompass
efficient alternative to RL for injecting preference feedback a diverse range of skills exhibited by LLMs, including
into language models (Rafailov et al., 2023; Song et al., Context Following, Alignment, Agent, NLP Task Specializa-
2023a; Yuan et al., 2023b). This method, diverging from tion and Multi-Modality. Context Following focuses on the
traditional RL approaches, directly incorporates ranking student’s ability to comprehend and respond effectively
information into language models from a fixed preference to input information. Alignment delves into the student’s
dataset during fine-tuning. Intuitively, it directly updates capability to align its output with the teacher’s responses.
policy to increase the relative likelihood of preferred over Moving forward, Agent underscores the autonomous nature
less favored responses. This direct optimization of prefer- of language models. NLP Task Specialization highlights the
ences, without the need for sampling outputs, makes the LLM’s versatility in specializing across various Natural
process more stable and efficient. Recently, some works have Language Processing tasks, demonstrating its adaptability.
been proposed to explore using ranking optimization to Finally, Multi-Modality encompasses the knowledge trans-
distill teacher’s preferences into student models (Tunstall fer from teacher LLMs to multi-modal models. Table 3
et al., 2023; Hong et al., 2023; Yuan et al., 2024a). summarizes the representative works, encompassing details
Zephyr (Tunstall et al., 2023) utilizes Direct Preference such as the skills involved, seed knowledge, teacher LLM,
Optimization (DPO) (Rafailov et al., 2023) to distill the student model, knowledge elicitation method, and training
preference alignment in teacher LLMs. DPO streamlines objectives.
the objective of reinforcement learning (as in Eq. 13),
which involves reward maximization with a KL-divergence
constraint, into a single-stage policy training. Specifically, 4.1 Context Following
DPO’s training goal is to maximize the following expecta- This part concentrates on the distillation of context follow-
tion: ing skills from LLMs. This process involves transferring the
 
πθ (yw |x) πθ (yl |x)
 ability of LLMs to handle a variety of complex contexts —
E log σ β log − β log , such as few-shot demonstrations, intricate instructions, dia-
(x,yw ,yl )∼D (fd) πref (yw |x) πref (yl |x)
logue history, and retrieval-augmented information — into
(14)
smaller models. Many research efforts in this domain aim
where yw is preferred over yl according to the teacher to imbue smaller models with these sophisticated, context-
LLM. Hong et al. (2023) (Hong et al., 2023) adopt two following capabilities. Our discussion here will dissect this
ranking-based optimization objectives, Rank Responses to facet of skill distillation, categorizing it based on different
align Human Feedback (RRHF) (Yuan et al., 2023b) and types of context and elaborating on how each is distilled
Preference Ranking Optimization (PRO) (Song et al., 2023a), and incorporated into smaller, efficient models.
for preference distillation. RRHF (Yuan et al., 2023b) focuses
on a ranking loss defined as: 4.1.1 Instruction Following
X Instruction-following capacity enables LLMs to understand
LRRHF = max(0, pi − pj ), (15) and follow user-given instructions. This ability significantly
ri <rj
enhances human-AI interaction, allowing for seamless un-
where ri and rj are the reward scores assigned by the derstanding and execution of tasks as directed by users. A
teacher LLM for responses yi and yj , respectively, and pi , pj primary method for acquiring this skill involves construct-
are their corresponding conditional log probabilities under ing instruction-like prompt-response pairs and employing
the policy πθ . This approach emphasizes direct comparison Supervised Fine Tuning (SFT) for model training. Data for
and ranking of responses based on the teacher’s preferences. this purpose can be manually curated by human experts
PRO (Song et al., 2023a) expands the concept of pairwise or transformed from existing NLP tasks into instructional
14
Methods Skill Seed Knowledge Teacher LLM Student Model Knowledge Elicitation Objective
Context Following
Self-Instruct (Wang et al., 2022a) IF 175 human-curated tasks GPT3 GPT3 Expansion + Self-Knowledge SFT
Alpaca (Taori et al., 2023) IF 175 human-curated tasks GPT3 LLaMA Expansion + Self-Knowledge SFT
3.5K Wikipedia Categories +
LaMini-LM (Wu et al., 2023c) IF ChatGPT Various Models Expansion SFT
Mixed Dataset
WizardLM (Xu et al., 2023a) IF Alpaca Data ChatGPT LLaMA Expansion SFT
Lion (Jiang et al., 2023b) IF Alpaca Cata ChatGPT LLaMA Labeling + Expansion + Feedback -
BabyLlama (Timiryasov and Tastet, 2023) IF 10M-word BabyLM dataset GPT-2 + small LLaMA 58M-parameter LLaMA Feature D&S
MiniLLM (Gu et al., 2024) IF Dolly Dataset GPT2 + OPT + LLaMA GPT2 + OPT + LLaMA Feature D&S
Self-Align (Sun et al., 2024b) IF Human-written Principles LLaMA LLaMA Expansion + Self-Knowledge SFT
Self-Rewarding (Yuan et al., 2024a) IF Human-written Samples LLaMA LLaMA Self-Knowledge SFT + RL
STaR (Zelikman et al., 2022) IF Arithmetic + CommonsenseQA + GSM8K GPT-J GPT-J Self-Knowledge SFT
Llama-GPT4 (Peng et al., 2023a) IF Alpaca Dataset GPT4 LLaMA Labeling SFT
Reflection-Tuning (Li et al., 2023e) IF Alpaca/WizardLM Dataset ChatGPT LLaMA Labeling SFT
Selective Reflection-Tuning (Li et al., 2024d) IF Alpaca/WizardLM Dataset ChatGPT LLaMA Labeling SFT
Vicuna (Chiang et al., 2023) IF/MD Human Conversation ChatGPT + GPT4 LLaMA Labeling SFT
Koala (Geng et al., 2023) IF/MD Human Conversation ChatGPT LLaMA Labeling SFT
Baize (Xu et al., 2023b) IF/MD Quora + Stack Overflow ChatGPT LLaMA Expansion + Self-Knowledge SFT
UltraChat (Ding et al., 2023b) IF/MD Wikidata + Text Material + C4 ChatGPT LLaMA Curation SFT
Orca (Mukherjee et al., 2023) IF/TP FLAN-v2 ChatGPT + GPT4 LLaMA Labeling SFT
Orca2 (Mitra et al., 2023) IF/TP FLAN-v2 + Few-Shot/Math/Synthetic GPT4 LLaMA Labeling SFT
SelFee (Ye et al., 2023) IF/TP Human Conv, Flan/Code/Math Collection ChatGPT LLaMA Labeling SFT
CoT-Distill (Hsieh et al., 2023) IF/TP e-SNLI + ANLI + CQA + SVAMP PaLM T5 Labeling SFT
KnowPAT (Zhang et al., 2023a) IF/TP CPKG + QA Data ChatGPT + ChatGLM + Vicuna-7B LLaMA Labeling SFT
DEBATunE (Li et al., 2024e) IF/TP Controversial Topics ChatGPT LLaMA Labeling SFT
Phi-1 (Gunasekar et al., 2023) IF/Code - GPT3.5 phi-1 Curation SFT
Phi-1.5 (Li et al., 2023a) IF/Code 20k Topics from Web GPT3.5 phi-1 Curation + Labeling SFT
SAIL (Luo et al., 2023c) IF/RAG Alpaca Data + Web Content GPT4 LLaMA Label SFT
KARD (Kang et al., 2023b) IF/RAG MedQAUSMLE ChatGPT T5 + OPT Label SFT + D&S
Self-RAG (Asai et al., 2023) IF/RAG Open-Instruct GPT4 LLaMA Labeling SFT
Alignment
OpenChat (Wang et al., 2023c) IF/Preference Human Conversation ChatGPT + GPT4 LLaMA Labeling SFT + RL
Zephyr (Tunstall et al., 2023) IF/Preference Mixed Datasets GPT4 Mistral Labeling + Feedback SFT + RO
ALMoST (Kim et al., 2023a) IF/Preference Human-written Prompts LLaMA LLaMA Expansion + Labeling SFT + RL
RLCD (Yang et al., 2024) IF/Preference Human-written Prompts LLaMA LLaMA Labeling SFT + RL
RLAIF (Lee et al., 2023a) IF/Preference Human-written Prompts PaLM 2 PaLM 2 Labeling + Feedback RL
GPT3 Reward (Kwon et al., 2023) Preference Human-written Prompts GPT3 GPT3 Labeling RL
ILF (Scheurer et al., 2023) Preference Task-specific Datasets GPT3 + FeedME GPT3 Labeling RL
ULTRAFEEDBACK (Cui et al., 2023a) Preference Mixed Datasets GPT4 LLaMA Labeling RL
Constitutional AI (Bai et al., 2022a) Preference/Value Human-written Prompts Self-defined Student Model Self-defined Model Labeling + Expansion + Feedback SFT + RL
text-davinci-002/-003 +
SANDBOX (Liu et al., 2023b) Value Simulation LLaMA Data Curation SFT + RL
GPT4 + ChatGPT
Agent
Toolformer (Schick et al., 2023) Tool CCNet GPT-J GPT-J Labeling SFT
Graph-ToolFormer (Zhang, 2023) Tool Mixed Graph Dataset ChatGPT GPT-J + LLaMA Labeling SFT
Gorilla (Patil et al., 2023) Tool Online API Documentation GPT4 LLaMA Expansion SFT
GPT4Tools (Yang et al., 2023b) Tool Image Content ChatGPT LLaMA Curation + Expansion SFT
ToolAlpaca (Tang et al., 2023a) Tool Public-apis Repository ChatGPT LLaMA Curation SFT
ToolLLM (Qin et al., 2023a) Tool Real-world APIs ChatGPT LLaMA Curation SFT
MLLM-Tool (Wang et al., 2024) Tool HuggingFace Model Cards GPT4 LLaMA Curation SFT
FireAct (Chen et al., 2023b) Planning Mixed QA Dataset GPT4 LLaMA Labeling SFT
AgentTuning (Zeng et al., 2023a) Planning 6 Agent Tasks GPT4 + ChatGPT LLaMA Labeling + Expansion SFT
Lumos (Yin et al., 2023a) Planning Mixed Interactive Tasks GPT4 LLaMA Labeling SFT
AUTOACT (Qiao et al., 2024) Planning Mixed QA Tasks LLaMA LLaMA Labeling SFT
NLP Task Specialization
AugGPT (Dai et al., 2023a) NLU Amazon/Symptoms/PubMed20k Dataset ChatGPT BERT Label SFT
TDG (He et al., 2023b) NLU SST + QQP + MNLI GPT3 BERT Expansion SFT
SunGen (Gao et al., 2023a) NLU Text Classification Tasks GPT2 DistilBERT Curation SFT
UDG (Wang et al., 2021a) NLU NLU Tasks GPT3 BERT Expansion SFT
InheritSumm (Xu et al., 2023c) NLG Pile + ArXiv + CNN/DM + WikiHow GPT3.5 ZCode++ Label SFT
DIMSUM+ (Jung et al., 2023) NLG None GPT2 + CTRL + BioGPT T5 Curation + Self-Knowledge SFT
Genie (Yehudai et al., 2024) NLG ELI5 + ASQA + NQ + CNN/DM Falcon + LLaMA FLAN + LLaMA Label SFT
GKD (Agarwal et al., 2024) NLG/NLU/IF XSum+WMT14 en-de+GSM8K+FLAN2021 T5-XL T5 Feature + Feedback D&S + RL
QUILL (Srinivasan et al., 2022) IR IR Datasets T5 4-layer Transformer Internal Knowledge D&S
RankVicuna (Pradeep et al., 2023a) IR IR Datasets ChatGPT LLaMA Labeling SFT
RankZephyr (Pradeep et al., 2023b) IR IR Datasets ChatGPT + GPT4 Mistral Labeling SFT
NDR (Mysore et al., 2023) Recommendation Recommendation Datasets GPT3 MPnet-110M Labeling SFT
InstrcutRec (Zhang et al., 2023b) Recommendation 39 instruction templates ChatGPT Flan-T5 Expansion + Self-Knowledge SFT
ONCE (Liu et al., 2023c) Recommendation Recommendation Dataset ChatGPT LLaMA Labeling SFT
PandaLM (Wang et al., 2023b) Evaluation Alpaca Data ChatGPT LLaMA Labeling SFT
Prometheus (Kim et al., 2024) Evaluation 50 Seed Rubrics GPT4 LLaMA Labeling SFT
InstructScore (Xu et al., 2023d) Evaluation Mixed Dataset GPT4 LLaMA Labeling SFT
WizardMath (Luo et al., 2023b) Math GSM8k + MATH ChatGPT LLaMA Expansion + Feedback SFT + RL
Mammoth (Yue et al., 2023a) Math/TP Mixed Math Dataset GPT4 LLaMA Labeling SFT
Mixed Distill (Chenglin et al., 2023) Math/TP SVAMP + GSM8K + ASDIV + StrategyQA ChatGPT LLaMa Labeling SFT
WizardCoder (Luo et al., 2023a) Code Code Alpaca Data ChatGPT StarCoder Expansion SFT
Magicoder (Wei et al., 2023) Code Existing Source Codes ChatGPT LLaMa Curation SFT
WaveCoder (Yu et al., 2024) Code Existing Source Codes GPT4 LLaMa Curation SFT
Code Alpaca (Chaudhary, 2023) Code Code Instructions ChatGPT LLaMA Expansion + Self-Knowledge SFT
Code Llama (Rozière et al., 2023) Code Human-written Instructions LLaMA LLaMA Expansion + Self-Knowledge SFT
Code Clean (Jain et al., 2023) Code Code Datasets ChatGPT LLaMA Labeling SFT
Multi-Modality
LLaVA (Liu et al., 2023e) Vision-Language COCO GPT4 LLaMA Labeling SFT
SVIT (Zhao et al., 2023b) Vision-Language Visual Genome + COCO GPT4 LLaMA Labeling SFT
LVIS-Instruct4V (Wang et al., 2023e) Vision-Language LVIS GPT4V LLaMA Labeling SFT
LLaVAR (Zhang et al., 2023d) Vision-Language LAION GPT4 LLaMA Labeling SFT
Macaw-LLM (Lyu et al., 2023) Multiple Modalities Image/Video with Caption ChatGPT LLaMA Labeling SFT
MIMIC-IT (Li et al., 2023f) Multiple Modalities Image/Video Dataset ChatGPT LLaMA Labeling SFT
ChatBridge (Zhao et al., 2023d) Multiple Modalities Task-Specific/Multimodal-Chat Data GPT4 + ChatGPT LLaMA Labeling SFT

TABLE 3: A summary of skill distillation works. IF: Instruction Following, MD: Multi-turn Dialoue, TP: Think Pattern,
RAG: Retrieval-Augmented Generation, NLU: Natural Language Understanding, NLG: Natural Language Generation, IR:
Information Retrieval, SFT: Supervised Fine-Tuning, D&S: Divergence and Similarity, RL: Reinforcement Learning, RO:
Ranking Optimization.

formats with templates, such as prefacing machine transla- relevant works use OpenAI’s GPT series models to generate
tion data with ”Translate this sentence to Spanish:”. However, prompt-response data pairs and then train the student LLMs
these approaches have limitations. Manual data creation is by supervised fine-tuning (Wang et al., 2022a; Taori et al.,
labor-intensive, while template-based transformation lacks 2023; Chiang et al., 2023; Wu et al., 2023c; Xu et al., 2023a;
diversity in instructions and may not align well with natural Mukherjee et al., 2023; Mitra et al., 2023; Luo et al., 2023b;
human input. LLMs like GPT-4 offer an efficient alternative Peng et al., 2023a).
for creating diverse and controlled SFT data by their capabil-
Basic Instructions. Self-Instruct (Wang et al., 2022a) lever-
ities of in-context learning and instruction following. Most
ages the in-context learning capability of GPT-3 to expand
15

a seed pool of 175 tasks to 52K task-agnostic instructions, prompts GPT-4 to provide explanation traces that eluci-
ensuring a broad spectrum of general instructions. Addi- date the teacher’s reasoning process. Orca 2 (Mitra et al.,
tionally, a filtering and post-processing stage is introduced 2023) further trains the student model to identify the most
to eliminate redundant or similar instructions. Notably, effective solution strategy for each task, guided by Orca’s
through training with this enriched dataset, GPT-3 acquires performance. This approach significantly improves the abil-
the ability to follow instructions, enabling it to perform ity of smaller models to follow instructions that involve
comparably to InstructGPT in zero-shot instruction tasks reasoning.
and when provided with expert-written instructions for
novel tasks. Based on the self-instruct method, Taori et al.
High-Quality Instructions. As demonstrated in Zhou et al.
(2023) train an Alpaca model using the Llama 7B model
(2023a) and (Li et al., 2024f), the data quality is crucial
on 52K instruction-following demonstrations, generated in
for instruction following training. UltraChat (Ding et al.,
a similar style as self-instruct but utilizing the more robust
2023b) distills large-scale data with high-quality and di-
text-davinci-003 model. To enhance the diversity of instruc-
verse instructions from teacher LLMs by various meta-
tional data, Wu et al. (2023c) introduce a technique known
information. The UltraLLaMA model, fine-tuned on this
as Topic-Guided Instruction Generation. This method involves
data, consistently surpasses other open-source models. The
gathering 3.5K common topics from Wikipedia to serve as
Phi series models (Gunasekar et al., 2023; Li et al., 2023a;
guidance during the generation process.
Mar, 2023) prioritize data quality and employ synthetic
Complex Instructions. Some works promote students to methods to generate data of “textbook quality” to enhance
solve more complex instructions (Xu et al., 2023a; Luo et al., the learning experience for smaller models. Notably, Phi
2023b,a; Guo et al., 2023c). According to Xu et al. (2023a), in- exhibits the ability to follow instructions effectively even
struction datasets derived from human-written seeds often without specific instruction fine-tuning. What’s particularly
exhibit low to moderate complexity. To enhance the com- remarkable is that Phi-2, with just 2.7 billion parameters,
plex instruction-following capabilities of smaller models, outperforms Mistral and Llama-2 models with 7B and 13B
WizardLM (Xu et al., 2023a) introduces Evol-Instruct. This parameters across various benchmark evaluations.
method gradually transforms instructions into more com-
plex forms through a multi-step evolution process, focusing Improved Instructions. Another line of work focuses on
on both increasing difficulty levels and expanding the di- improving the quality of existing instruction data, including
versity of topics. They conducted four rounds of evolution both the improvement of instruction and corresponding
using the OpenAI ChatGPT API, resulting in a dataset of response. SelFee (Ye et al., 2023) utilizes the ChatGPT to iter-
250k complex instructions. Subsequently, they trained the atively improve the quality of responses. ExpertLLaMA (Xu
LLaMA 7B model, referred to as WizardLM, on this dataset. et al., 2023f) improves the quality of responses by augment-
In the high-difficulty section of test instructions, WizardLM ing vanilla instructions with specialized Expert Identity
even outperformed ChatGPT, achieving a win rate 7.9% descriptions. Reflection-Tuning (Li et al., 2023e) improves
higher than ChatGPT. Zhao et al. (2023e) further conduct both the instruction and response sequentially by reflecting
preliminary studies revealing the effectiveness of increasing on specific criteria. DEITA (Liu et al., 2023h) proposes to
instruction complexity. Instruction Fusion (Guo et al., 2023c) enhance and score instructions in three directions includ-
further uses teacher LLMs to increase the complexity by ing complexity, quality, and diversity to get high-quality
fusing two distinct evolved instructions. Furthermore, this distillation data. MUFFIN (Lou et al., 2023) proposes to
concept of “evolving” instructions has been extended to scale the instruction according to the input by diversifying
distill specific skills such as coding (Luo et al., 2023a) and these tasks with various input facets. Selective Reflection-
mathematics (Luo et al., 2023b). Tuning (Li et al., 2024d) first involves the student model
Human Instructions. In contrast to works that rely on gener- in the data improvement pipeline with a novel student-
ating instructions from ChatGPT, which may lack diversity selection module, in which the student model is able to
and have gaps with real human instructions, Vicuna (Chiang decide the data learn from.
et al., 2023) and Koala (Geng et al., 2023) showcase impres-
sive performance by using human conversations and natu- In summary, distilling instruction data from teachers
ral instructions from community-contributed conversations. presents a promising avenue for training cheap and re-
These conversations, found in platforms like ShareGPT, pro- producible instruction-following language models. Cur-
vide a forum for users to share their interactions with Chat- rent small models have made strides in enhancing var-
GPT. It’s important to note, however, that models trained ious aspects of instruction-following ability, like diver-
on such natural conversations might mimic the style but sity, complexity and explanation. However, student mod-
may not fully capture the reasoning process of the original els trained on instruction data expanded by ChatGPT of-
teacher (Gudibande et al., 2023; Mukherjee et al., 2023). ten mimic ChatGPT’s style without replicating its factual
accuracy (Gudibande et al., 2023). Achieving a more ca-
System Instructions. To encourage student models to learn pable instruction-following capability requires a stronger
the reasoning process, Orca and Orca 2 (Mukherjee et al., teacher LLM (Gudibande et al., 2023) and access to di-
2023; Mitra et al., 2023) enhance the prompt, response data verse, high-quality instruction data, such as the one used
pairs by introducing a system message (e.g., ”explain like in Orca (Mukherjee et al., 2023; Mitra et al., 2023), which
I’m five, think step-by-step”) to encourage student mod- incorporates extensive task instructions from the Flan 2022
els to grasp the reasoning process. This system message Collection (Longpre et al., 2023).
16

4.1.2 Multi-turn Dialogue promising technique to decrease this issue. Handling the
While instruction following focuses on single-instance com- augmented context of retrieved information is also a non-
mand execution, multi-turn dialogue extends this to com- trivial skill of LLMs. Several approaches to distill RAG
prehend and maintain context through ongoing interactions. capabilities have been proposed (Kang et al., 2023a; Luo
This skill is vital for models to engage meaningfully in et al., 2023c; Asai et al., 2023).
human-like conversations and respond coherently over suc- SAIL (Luo et al., 2023c) starts by retrieving search results
cessive dialogue turns. Some works have been dedicated for each training case using search APIs, creating search-
to train to small chat models by distilling multi-turn knowl- augmented instructions that include both the instruction
edge from teacher LLMs (Chiang et al., 2023; Xu et al., 2023b; and grounding information. To encourage the language
Ding et al., 2023b; Li et al., 2023b; Wang et al., 2023c; Tunstall model to prioritize informative retrieval results, they input
et al., 2023). each retrieved passage along with the ground truth response
ShareGPT serves as a platform for users to share their into the entailment model to label each retrieval result for
conversations with ChatGPT, offering a vast repository of relevance. Subsequently, the search-augmented instructions
multi-turn conversations readily available. Some small chat and relevance labels are fed into teacher LLMs (like GPT-
models are trained using this data to acquire the capability 4) for generating responses. Following fine-tuning on this
for engaging in multi-turn dialogues (Chiang et al., 2023; Ye training set, the student model becomes proficient at de-
et al., 2023; Wang et al., 2023c). For example, Vicuna (Chiang noising search results and generating accurate responses.
et al., 2023) is a chat model exclusively trained on ShareGPT KARD (Kang et al., 2023b) distills rationales r from the
data. Despite its sole training source being ShareGPT, Vi- teacher LLM in response to questions x. These rationales
cuna achieves a high MT-Bench (Zheng et al., 2023a) score are then utilized to train two models: a student LM and a
assigned by GPT-43 . In the study conducted by Wang et al. Reranker. For training the student LM, the rationales serve
(2023c), GPT-3.5 and GPT-4 are employed to generate mixed as a means to retrieve relevant knowledge d, and the student
responses using ShareGPT data. They assign higher rewards LM is subsequently fine-tuned using the rationales along-
to responses generated by GPT-4, aiming to incentivize side questions and knowledge. However, during inference,
student models to produce high-quality responses. Addi- only questions are available. To address this, the Reranker
tionally, Ye et al. (2023) enhance the quality of multi-turn is trained to mimic how the retriever scores passages with
data from ShareGPT by generating self-feedback on model the rationale by minimizing the KL divergence between
responses and iteratively refining the responses based on Retriever(d|r) and Reranker(d|x). However, the integra-
the received feedback. tion of a fixed number of passages in language models,
To enhance the multi-turn capabilities of student models, without considering their necessity or relevance, can reduce
another line of research focuses on expanding conversa- versatility and lead to the generation of unhelpful responses.
tional datasets through self-chat and using them to train To equip student LMs with adaptive RAG capabilities, Self-
smaller models (Xu et al., 2023b; Ding et al., 2023b; Tunstall Rag (Asai et al., 2023) distills this adaptive ability from
et al., 2023). For instance, Xu et al. (2023b) initiate their work teacher LLMs into a small critic model. This critic model
by using questions sourced from Quora and Stack Overflow determines whether retrieval is necessary and evaluates the
as seeds, resulting in the collection of 111.5k dialogues quality of the retrieved results by generating ‘reflection to-
through self-chat. Subsequently, they employ parameter- kens.’ For instance, Self-Rag initiates the retrieval operation
efficient tuning to train a chat model named Baize. Ding when generating the reflection token Retrieve . To distill
et al. (2023b) first construct a significantly larger dataset this critic data, GPT-4 is prompted to assess the need for
called UltraChat, comprising 1.5 million high-quality multi- retrieval using few-shot demonstrations I , the task input
turn dialogues. They achieve this by distilling instructions x, and output y to predict a reflection token r as follows:
and dialogues from ChatGPT. Notably, UltraChat encom- p(r|I, x, y).
passes a wide range of topics and instructions. Building
upon the UltraChat dataset, they fine-tune a LLaMA model, 4.2 Alignment
resulting in the creation of a powerful chat model known as
4.2.1 Thinking Pattern
UltraLLaMA. UltraLLaMA consistently outperforms other
open-source chat models, including Vicuna and Baize. Fur- Most existing methods mainly focus on directly aligning the
thermore, UltraChat is employed in conjunction with an direct responses of the student models to the responses of
AI preference-aligned chat model named Zephyr (Tunstall teacher models (Taori et al., 2023). Though effective, these
et al., 2023). Zephyr enhances intent alignment through models might suffer the problems that they tend to learn to
the application of distilled direct preference optimization imitate the response style of the teacher models, but not the
(dDPO). reasoning process (Mukherjee et al., 2023). Thus in order to
better distill from the teacher models, methods are proposed
4.1.3 RAG Capbility that not only imitate the pure responses but some novel
thinking patterns (Ye et al., 2023; Mukherjee et al., 2023;
LLMs are known to lack the ability to utilize up-to-date
Mitra et al., 2023; Wang et al., 2023d; Cheng et al., 2023;
knowledge, and often produce responses containing factual
Zhang et al., 2023a).
inaccuracies due to their sole reliance on the parametric
Motivated by the effectiveness of LLMs in generat-
knowledge. Retrieval-Augmented Generation (RAG) is a
ing their own feedback without relying on external mod-
3. MT-Bench: a multi-turn question set, where the generations of els (Schick et al., 2022; Madaan et al., 2023; Saunders
models are evaluated by LLM, like GPT-4. et al., 2022), SelFee (Ye et al., 2023) proposes to train a
17

model that has been fine-tuned to continuously revise its lished ground truth. Scheurer et al. (2023) propose Imitation
own answer until it provides a high-quality response in a Learning from Language Feedback, in which a language
single inference. During training, it utilizes both the final model is utilized to improve various outputs generated by
response and feedback chain as the fitting target. This pat- a model. This refinement is based on a reference provided
tern, response with the revision process, shows a promising by a human. Following this process, the most effectively
performance gain. Following SelFee, Reflection-Tuning (Li refined output is chosen to be used in further supervised
et al., 2023e, 2024d) also utilizes the reflection process as the fine-tuning. As outlined by Kim et al. (2023a), ALMoST in-
learning pattern. Noticing the lack of reasoning imitation volves condensing human preferences into a set of heuristic
of the previous methods, Orca (Mukherjee et al., 2023) guidelines. An example of such a rule is the idea that larger
first proposes Explanation tuning, which aims to learn the LLMs that utilize more comprehensive and higher-quality
reasoning steps, including explanation traces, step-by-step prompts are likely to yield superior responses. Based on
thought processes, and other complex instructions, from the these established guidelines, comparison data is generated
teacher model, rather than just the vanilla styles. Extensive using responses from LLMs of different sizes and with
experiments verify the effectiveness of distilling with this varying prompts. This data is then used to train a reward
thinking pattern. The following Orca2 (Mitra et al., 2023) model. Yang et al. (2024) propose Reinforcement Learning
further presents to equip the student models with the ability from Contrast Distillation, which aims to align language
to utilize different solution strategies for different tasks, mo- models without relying on human feedback. This approach
tivated by the capability discrepancies between the smaller involves training a preference model using simulated pairs
and larger models. By employing this training pattern, the of preferences, including both high-quality and low-quality
student models are able to gain a better reasoning ability. Be- examples which are generated through contrasting prompts,
sides learning with the corresponding revision or reflection positive and negative.
process, another thinking pattern that recently appeared is Lee et al. (2023a) further highlight the effectiveness of
generating both responses and preferences. Zhang et al. RLAIF. This work proposes that RLAIF not only matches but
(2023a) propose to learn both the knowledge and corre- in some cases surpasses RLHF, and interestingly, RLAIF can
sponding preference for domain-specific QA with LLMs. also enhance the performance of Supervised Fine-Tuning.
Recently, DEBATunE (Li et al., 2024e) proposes to improve Another notable discovery is that directly prompting the
the controllability of LLMs in generating statements on LLM for reward scores during reinforcement learning can
controversial topics. By engaging two agents in a structured be more effective than the conventional approach of training
multi-round debate on controversial topics, salient and in- a reward model based on LLM preferences. Wang et al.
depth statements can be obtained and further distilled into (2023f) propose Conditioned-RLFT, which treats different
the student models. data sources as coarse-grained reward labels and develops
a class-conditioned policy to effectively utilize the varying
4.2.2 Preference qualities of data, which is a Reinforcement Learning-free
The previously mentioned methods primarily focus on the supervised learning approach. Cui et al. (2023a) propose a
basic capability of student models to produce outcomes large-scale, high-quality, and diversified preference dataset
that are strictly accurate but may not align with human labeled by GPT4 for comprehensive feedback. Tunstall et al.
preferences, reaching alignment at this level enables these (2023), by proposing distilled Direct Preference Optimiza-
models to aid in various tasks without meeting higher-level tion (Rafailov et al., 2023) on UltraFeedback, obtaining a
demands. Early methods mainly utilize human feedback for small by powerful LLM.
the alignment of human preferences (Ziegler et al., 2019;
Stiennon et al., 2020; Wu et al., 2021; Ouyang et al., 2022; Bai 4.2.3 Value
et al., 2022b; Köpf et al., 2023; Yuan et al., 2023b). However, Attaining alignment with human preferences allows large
obtaining human feedback is costly and labor-intensive, models to optimize human satisfaction by operating in a
thus methods that learn from AI feedback are also proposed manner that aligns with human preferences. However, to
to align with human preferences (Bai et al., 2022a; Kwon establish trustworthy LLMs, the notion of ’aligning LLMs
et al., 2023; Scheurer et al., 2023; Kim et al., 2023a; Roit et al., with human values’ is proposed and the key principles of
2023; Yang et al., 2024; Lee et al., 2023a; Tunstall et al., 2023; alignment are often summarized as the “HHH” criteria:
Cui et al., 2023a; Wang et al., 2023f). helpful, harmless, honest (Weidinger et al., 2021; Askell
The concept of RLAIF, introduced by Bai et al. (2022a), et al., 2021). Numerous methods have been undertaken for
involves the integration of preferences labeled by LLMs building trustworthy LLMs. However, due to the intrinsic
with those labeled by humans. This approach is designed difficulty of this aim, which is still an unsolved problem
to simultaneously optimize two key objectives: ensuring for proprietary models (Sun et al., 2024a), most existing
the helpfulness of the output and minimizing any potential methods rely on constructing high-quality human prefer-
harm, making the responses of LLMs more aligned with ence datasets (Ji et al., 2023b; Solaiman and Dennison, 2021;
Human preferences. Kwon et al. (2023) develop a proxy Bai et al., 2022b; Qiu et al., 2022; Kiesel et al., 2022; Liu et al.,
reward function using LLMs like GPT-3, which is created by 2022a), utilizing human-written rules as constrains (Glaese
first providing the LLM with a description of the behaviors et al., 2022; Sun et al., 2023b, 2024b), etc. For detailed
desired by the user, along with a small number of examples. progress on trustworthy LLMs, please further refer to Yao
The LLM then produces rewards by evaluating how closely et al. (2023a); Liu et al. (2023i); Sun et al. (2024a).
the outputs of a model align with the provided descrip- Though slightly under-explored, aligning LLMs with
tions, essentially measuring their relevance to the estab- human values by distilling is still possible (Bai et al., 2022a;
18

Cui et al., 2023a; Yang et al., 2024; Sun et al., 2024b). For aimed at enhancing the tool-use capabilities of compact
instance, Bai et al. (2022a) propose RLAIF, utilizing AI- language models for embodied intelligence. It creates a
generated labels to interactively improve both helpfulness dataset with 3938 instances from over 400 real-world tool
and harmlessness. Sun et al. (2024b) prompt the student APIs across 50 categories and utilizes ChatGPT to generate
model with 16 principles as guidelines for generating help- documentation for each prompt for later training. ToolLLM
ful, ethical, and reliable responses. Similarly, both harmless (Qin et al., 2023a) proposes a comprehensive framework for
and harmful generations could be elicited by modifying enhancing LLMs with tool-use proficiency, focusing on data
the prompts, and then are used to train the preference creation, model training, and evaluation by distilling from
model (Yang et al., 2024). Cui et al. (2023a) utilize GPT- chatGPT. Their ToolLLaMA shows impressive performance
4 to rank generations regarding helpfulness, truthfulness, in executing complex instructions and handling new APIs,
and honesty. Liu et al. (2023b) advance the alignment of rivaling ChatGPT. CRAFT (Yuan et al., 2023a) builds a
LLMs with societal values by incorporating simulated social general tool creation and retrieval framework, which uti-
interactions into the training process. This approach encom- lizes GPT4 to generate code snippets as the created tools.
passes a range of elements, including demonstrations that During the inference, other small LLMs could select and
are both in alignment and in conflict with social norms, as retrieve from the generated code snippets to execute or
well as collective ratings, in-depth feedback, and responses generate other methods conditioned on the given snippets.
that are revised iteratively. Confucius (Gao et al., 2023b) introduces a tiered training
strategy for LLMs to master tool usage through a graduated
curriculum and an innovative method called Iterative Self-
4.3 Agent
instruction from Introspective Feedback (ISIF) for dynamic
4.3.1 Tool Using dataset enhancement to handle complex tools. MLLM-Tool
While recent LLMs have shown proficiency in solving var- (Wang et al., 2024) is a multi-modal tool agent capable
ious tasks, they still tend to make mistakes when handling of interpreting instructions embedded in visual or audio
large numerical values or executing intricate mathematical content through the integration of multi-modal encoders
calculations (Qian et al., 2022; She et al., 2023; Manikandan with open-source large language models. As a trainable
et al., 2023; Liang et al., 2023b; Mialon et al., 2023). Thus method, the initial instruction-answer pairs are generated
equipping LLM agents with the capability to utilize tools by utilizing GPT4. Shen et al. (2024) demonstrate that small
has been increasingly focused on. Commonly used methods LLMs are weak tool learners and proposes a multi-LLM
mainly relied on human-curated data for training (Parisi framework that decomposes the tool-use ability of a single
et al., 2022; Nakano et al., 2022; Qin et al., 2023c; Song model into a planner, caller, and summarizer for the tool
et al., 2023b) or prompt designing(Cai et al., 2023; Shen using, leading to a supreme performance. The two-stage
et al., 2023a; Hao et al., 2024). Recently, distillation-based training strategy introduced by this work is powered by
methods are also proposed (Schick et al., 2023; Zhang, 2023; ChatGPT and GPT4 for collecting execution trajectories for
Patil et al., 2023; Tang et al., 2023a; Qin et al., 2023a; Yuan the training set. Yuan et al. (2024b) notice the potential
et al., 2023a; Gao et al., 2023b; Wang et al., 2024; Shen et al., issue of the current lengthy tool documentation, which
2024; Yuan et al., 2024b). hinders LLMs from understanding how to utilize a tool,
Toolformer (Schick et al., 2023) utilizes a self-supervised thus proposing EASYTOOL to purify the important infor-
manner, avoiding large human annotations, to obtain the mation from extensive documentation. The ground truth
most required APIs to use and further distill this capability summarization of the training documents is obtained by
to the model itself. The performance of the GPT-J-based using ChatGPT.
Toolformer surpasses OPT (66B) (Zhang et al., 2022) and
GPT3 (175B) (Brown et al., 2020) greatly. Graph-ToolFormer 4.3.2 Planning
(Zhang, 2023) aims to equip LLMs with the ability to process Another important aspect for LLM agents is the ability to
and reason over complex graph data, which is designed decompose high-level tasks to a chosen set of actionable
to enhance LLMs with graph reasoning skills using exter- steps (Huang et al., 2022b), which is especially useful when
nal graph reasoning API tools by adopting ChatGPT to acting in interactive environments. Huang et al. (2022b) first
annotate and augment a larger graph reasoning statement demonstrate that LLMs can generate plausible goal-driven
dataset for training. Gorilla (Patil et al., 2023) addresses the action plans without training, introduces non-invasive tools
limitations of current LLMs in generating accurate input to enhance model executability, and assesses these methods
arguments and reduces the problem of ”hallucination” or through human evaluation to balance executability and
generating incorrect API usage and it collects thousands of semantic accuracy. Most existing methods utilize prompting
models from platforms like HuggingFace and Torch Hub strategies for task planning (Singh et al., 2022; Zhou et al.,
as the API calls and utilizes GPT4 to generate synthetic 2023b; Song et al., 2023c; Wang et al., 2023g; Yao et al.,
instruction data for training. GPT4Tools (Yang et al., 2023b) 2023b; Liu et al., 2023j; Hao et al., 2023; Hu et al., 2023a), or
introduces to enable open-source LLMs like LLaMA and building human-curated data for training (Lin et al., 2023a;
OPT to use multimodal tools, a capability previously limited Valmeekam et al., 2023). Recently, there have also been some
to advanced proprietary models like ChatGPT and GPT-4. distilling methods emerging (Chen et al., 2023b; Zeng et al.,
The approach involves generating an instruction-following 2023a; Yin et al., 2023a; Qiao et al., 2024; Kong et al., 2023).
dataset by prompting an advanced teacher model with mul- FireAct (Chen et al., 2023b) introduces an innovative ap-
timodal contexts, using the Low-Rank Adaptation optimiza- proach for refining LLMs. This method involves fine-tuning
tion. ToolAlpaca (Tang et al., 2023a) proposes a framework smaller-scale LLMs using agent trajectories that are derived
19

from a variety of tasks and prompting techniques. Applying ing human language. The knowledge distilled from LLMs,
this method with trajectories generated by GPT4 has been such as through data labeling or augmentation, is typi-
shown to consistently enhance performance. AgentTuning cally transferred into encoder-based language models like
(Zeng et al., 2023a) aims to enhance the performance of BERT (Vaswani et al., 2017) and RoBERTa (Liu et al., 2019).
LLMs in executing agent tasks without sacrificing their Regarding the task of classification, certain studies have
wide-ranging capabilities. By utilizing a new dataset called been noteworthy (Dai et al., 2023a; Gilardi et al., 2023; He
AgentInstruct, which includes high-quality interaction tra- et al., 2023b; Gao et al., 2023a; Chenglin et al., 2023; Li
jectories, it applies a hybrid instruction-tuning approach et al., 2023g). AugGPT (Dai et al., 2023a) focuses on both
that merges these trajectories with general domain instruc- general and clinical domain text classification. To address
tions. Lumos (Yin et al., 2023a) pertains to a novel frame- the limitations of small-scale clinical datasets, which often
work designed to train agents using a unified data format lack expert annotation and are subject to stringent privacy
and modular architecture based on open-source LLMs. This regulations, AugGPT utilizes knowledge from teacher LLMs
system comprises three key modules: planning, grounding, to rephrase each sentence in the training samples. This
and execution, enabling the decomposition of tasks into process creates multiple conceptually similar but seman-
subgoals and actionable steps. TPTU-v2 (Kong et al., 2023) tically distinct samples, enhancing the dataset’s richness
focuses on improving the task planning and tool usage abili- and diversity. Another approach is demonstrated by Gilardi
ties of LLMs in real-world scenarios, by utilizing data gener- et al. (2023), who employ ChatGPT as an annotator to cate-
ated by human experts or LLMs. It introduces a framework gorize inputs. This method has been shown to outperform
comprising three components: an API Retriever, an LLM crowd-workers in several tasks, including relevance, stance,
Finetuner, and a Demo Selector. AUTOACT (Qiao et al., topics, and frame detection. Furthermore, He et al. (2023b)
2024) proposes an agent learning framework that does not propose Targeted Data Generation (TDG), a novel approach
require large-scale annotated data or synthetic trajectories for identifying challenging subgroups within a dataset. TDG
from high-resource models like GPT-4. Instead, it uses a self- leverages LLMs, along with human-in-the-loop, to generate
instruct method to generate its own planning trajectories new data specifically tailored for these subgroups, thereby
with limited initial data. It then applies a division-of-labor enriching the dataset and improving model performance
strategy, creating sub-agents specialized in different aspects in sentiment analysis and natural language inference tasks.
of the task completion process. To facilitate the clinical information extraction task, Tang
Distillation also works out for the training of embodied et al. (2023b) elicit diverse samples from LLMs by providing
multi-modal agents (Sumers et al., 2023; Yang et al., 2023c; examples and different seeds of clinical entities, i.e. the
Ma et al., 2023a; Du et al., 2023a; Sumers et al., 2023). For Curation manner.
instance, Sumers et al. (2023) aim to enhance the ability of Several studies have also focused on multiple NLU
AI agents to follow instructions by using pretrained vision- tasks (Ding et al., 2023a; He et al., 2023a; Wang et al.,
language models to provide supervision for understanding 2021a; He et al., 2022; Ye et al., 2022; Meng et al., 2022).
and acting upon language within their operational environ- For example, He et al. (2023a) utilize the knowledge in
ment, leveraging model distillation and hindsight experi- GPT-3.5 to annotate inputs with labels and explanations
ence replay to teach them contextually relevant interactions for various NLU tasks, including user input and keyword
in a simulated 3D setting. Emma (Yang et al., 2023c) evalu- relevance assessment, BoolQ, and WiC. Wang et al. (2021a)
ates the challenges and inefficiency of training an embodied employ few-shot prompts to expand high-quality training
agent in a noisy visual world without expert guidance, and data using GPT-3, i.e. the Expansion manner. Beyond merely
proposes to train them in a simulated environment using employing a single approach to elicit NLP task knowledge,
imitation learning, guided by an expert Language Model Ding et al. (2023a) explore a combination of Labeling, Ex-
(like ChatGPT), which operates in a corresponding text- pansion, and Curation methods to extract knowledge from
based simulation, focusing on the same tasks. GPT-3 for distilling data for both sequence- and token-level
NLP tasks.
4.4 NLP Task Specialization
NLP tasks often grapple with challenges like data scarcity, 4.4.2 Natural Language Generation
interpretability issues, privacy concerns, and noisy data.
The “Knowledge” section of our survey illustrates various Natural Language Generation (NLG) is a key aspect of eval-
methods for distilling knowledge from LLMs, effectively uating the capabilities of LLMs, encompassing tasks such as
setting the stage for student models to adapt to a range summarization, machine translation, and other open-ended
of NLP tasks. This knowledge provides supervision for text generation tasks. Known for their potent generative
the training of student models through information aug- abilities and creativity, LLMs excel in these areas, making
mentation (e.g., CoT and explanation), data augmentation, them prime sources for distilling knowledge into student
and semantic representation. By transferring the distilled models tailored for NLG tasks (Xu et al., 2023c, 2024b;
knowledge from LLMs, student models can better handle Ramnath et al., 2023; Agarwal et al., 2024). Additionally,
diverse NLP challenges, improving task performance and the knowledge distilled from LLMs can be effectively used
addressing data limitations more robustly. for NLG task-specific data augmentation (Jung et al., 2023;
Wang et al., 2021b; Guo et al., 2023a; Yang and Nicolai,
4.4.1 Natural Language Understanding 2023; Wang et al., 2023h; Yang et al., 2023d). While the
Natural Language Understanding (NLU) is a fundamen- previous sections have focused on the works about open-
tal NLP task that involves comprehending and interpret- ended generation and multi-turn dialogue, this part will
20

specifically highlight the distillation techniques relevant to and expressiveness of user queries by refining or modifying
other NLG tasks. the initial query to more accurately align with the user’s
Although automatic metrics often favor smaller, fine- information needs. One notable approach is QUILL (Srini-
tuned models in summarization tasks, human evaluators vasan et al., 2022), which introduces a two-stage distillation
tend to prefer the summaries generated by LLMs. Address- method for query intent understanding. Initially, a retrieval-
ing this discrepancy, Xu et al. (2023c) develop a student sum- augmented LLM, serving as the ‘professor,’ is distilled into
marization model by distilling a GPTSUMM dataset, which a non-retrieval augmented teacher LLM, aiming to bolster
comprises over 4 million paragraph-summary pairs gener- its understanding capabilities. Subsequently, this enhanced
ated by querying GPT-3.5. In a different approach, Jung et al. teacher LLM is distilled into a final student model using a
(2023) introduce ‘Impossible Distillation,’ a method that large dataset, further refining the process. Incorporating the
creates high-quality summarization-specific dataset from QR into IR systems, Ma et al. (2023c) develop a ’Rewrite-
weak teacher LLMs. This method involves training a stu- Retrieve-Read’ framework. This process begins with an
dent model on the generated dataset and enhancing its LLM rewriting the queries via prompting, followed by a
capabilities through Self-Knowledge. Turning to the task of retrieval-augmented reading stage. To integrate the rewrit-
machine translation, where creating parallel corpora is tra- ten queries effectively into the IR system, the knowledge
ditionally expensive and time-consuming, Yang and Nicolai gleaned from the LLM is distilled into a compact student
(2023) propose a three-step distillation process. This process rewriter. This rewriter is then fine-tuned using feedback
involves generating seeds of verbs and nouns, forming sen- from the LLM reader through reinforcement learning.
tences, and then translating these sentences. Their findings
suggest that while the distilled dataset may lack diversity, Retriever and Reranker. In IR systems, the Retriever is
it effectively improves the translation signal for training designed to efficiently locate the top-k relevant texts from
student translation models. To distill high-quality content- a large corpus. It encodes both queries and documents into
grounded data automatically, Genie (Yehudai et al., 2024) vector representations and performs retrieval by computing
proposes a general methodology containing three key steps: the dot product between these vectors. The Reranker further
(a) preparation of the content, (b) distillation of responses refines the order of the retrieved documents to improve
from a teacher LLM corresponding to the content, and (c) the overall quality of the output. This is achieved in two
filtering mechanism to ensure the quality and faithfulness of primary ways, including Pointwise Reranker and Listwise
the generated data. Genie demonstrates that student models Reranker. Pointwise Reranker takes both the query and a
trained through this distilled data can match or even surpass single candidate document as input to directly generate a
models trained on human-generated data. relevance score. Listwise Reranker directly reorders a list of
input documents in terms of their relevance.
4.4.3 Information Retrieval Retriever and Pointwise Reranker. For the retriever and
pointwise reranker, a common application of KD from LLMs
Information Retrieval (IR) represents a crucial branch of is the generation of pseudo-queries for given documents.
computer science, focused on efficiently retrieving infor- This approach aims to expand the pairwise data, enhancing
mation relevant to user queries from extensive reposito- the training of dense retrievers or rerankers. For example,
ries (Cai et al., 2022; Liu et al., 2022b; Feng et al., 2023; InPars (Bonifacio et al., 2022) utilizes GPT-3 to generate
Shen et al., 2023b). A typical IR system encompasses three multiple pseudo-queries for an unlabeled document. To
main components: the query rewriter, the retriever, and ensure the relevance of these queries, the system filters
the reranker. Recent studies have highlighted the effective- them based on the highest log probabilities of generating a
ness of employing LLMs in IR systems, e.g. in enhancing query conditioned on the documents. Subsequently, InPars
the reranking stage through both point-wise and list-wise fine-tunes a reranker based on monoT5 (Raffel et al., 2020).
ranking methods (Ma et al., 2023b; Sun et al., 2023a; Qin Another similar approach, Promptagator (Dai et al., 2023b),
et al., 2023d). However, the practical application of LLMs in introduces a few-shot dense retrieval method that leverages
IR systems faces challenges, primarily due to their slower a small number of demonstrations from the target domain
generation speed, which conflicts with the low-latency re- for pseudo-query generation. Diverging from the reliance
quirements of IR tasks (Sun et al., 2023a). As a result, on unlabeled documents, Sachan et al. (2022) distill knowl-
the KD of LLMs emerges as a more promising approach edge from GPT-4 to curate diverse synthetic data for text
for IR, offering a way to infuse the distilled knowledge embedding tasks across nearly 100 languages. They fine-
from LLMs into various stages of the IR pipeline without tune powerful decoder-only LLMs, such as Mistral-7b (Jiang
compromising on speed. There has been a significant body et al., 2023a), on this synthetic data using standard con-
of work demonstrating how knowledge distilled from LLMs trastive loss. Remarkably, this method demonstrates strong
can benefit each component of the IR system, including the performance on text embedding and multilingual retrieval
Query Rewriter (Srinivasan et al., 2022; Ma et al., 2023c), the benchmarks without any labeled data. Beyond generating
Retriever (Dai et al., 2023b; Sachan et al., 2022, 2023; Schick pseudo-queries, teacher LLMs can also be employed to gen-
and Schütze, 2021; Meng et al., 2023; Peng et al., 2023b), and erate relevance scores as soft labels. These scores are used
the Reranker (Bonifacio et al., 2022; Sun et al., 2023a; Pradeep to train the retriever by minimizing the KL-divergence loss
et al., 2023a,b; Saad-Falcon et al., 2023; Ferraretto et al., 2023; between the teacher and student distributions, as explored
Jeronymo et al., 2023; Sun et al., 2023c). by Sachan et al. (2023).
Query Rewriter. The Query Rewriter (QR) is a pivotal com- Listwise Reranker. A distinct set of studies focuses on
ponent in IR systems, tasked with enhancing the precision listwise reranking, where its advantage lies in compar-
21

ing multiple documents simultaneously to determine the to the increasing use of LLMs in NLG evaluation (detailed
optimal reorder. RankGPT (Sun et al., 2023a) leverages further in (Li et al., 2024b)). Through KD of LLMs, student
GPT-4 to generate permutations for a group of candidate evaluators could enhance inference efficiency and achieve
passages. To distill this listwise ranking knowledge into a more flexible and highly customized evaluation (Wang et al.,
pointwise student reranker, various training loss functions 2023b; Kim et al., 2024; Xu et al., 2023d; Jiang et al., 2023c; Li
are employed, such as Listwise Cross-Entropy (Bruch et al., et al., 2024a).
2019), RankNet (Burges et al., 2005), and LambdaLoss (Wang PandaLM (Wang et al., 2023b) concentrates on a pairwise
et al., 2018). Building upon RankGPT’s framework, RankVi- evaluator designed to compare two pieces of generated
cuna (Pradeep et al., 2023a) and RankZephyr (Pradeep content. It utilizes a teacher LLM (GPT-3.5) to judge which
et al., 2023b) further refine this approach by directly fine- response is better for a given instruction and input, provid-
tuning a listwise reranker using teacher-generated textual ing reasons for its decision. Addressing the need for cus-
permutations. This enables the student reranker to produce tomized and flexible criteria to meet realistic user demands,
sequences of ranked results directly, bypassing the interme- Prometheus (Kim et al., 2024) distills GPT-4 to construct a
diate step of calculating individual relevance scores. training dataset that includes reference answers and a vari-
ety of customized scoring rubrics. This dataset is then used
4.4.4 Recommendation to tune LLaMA for evaluating model-generated responses.
Recommender systems are integral to enhancing user ex- Instructscore (Xu et al., 2023d) takes a more fine-grained ap-
perience in various online services, providing personalized proach by using GPT-4 to create detailed analysis data. This
content based on user preferences and behaviors. Many data is employed to tune LLaMA, enabling it to perform
works have demonstrated that LLMs could be directly used error analysis on generated texts compared to reference
as recommenders without fine-tuning (Wang et al., 2023i; texts. The system further refines its evaluation capabilities
Dai et al., 2023c) or generate auxiliary textual features to through self-training with real model-generated response-
benefit recommender systems (Xi et al., 2023; Ren et al., reference pairs. For reference-free evaluation across diverse
2023; Wei et al., 2024). (Wang et al., 2023j; Ren et al., 2023; domains, TigerScore (Jiang et al., 2023c) samples data from
Wei et al., 2024). However, the real-time nature of online rec- a variety of text generation datasets, such as summariza-
ommender systems demands rapid response times, posing tion, translation, and data-to-text. It distills error analysis
a challenge with the inherent inference latency associated knowledge from GPT-4 and uses this to fine-tune LLaMA.
with LLMs. To address this, several studies have explored Lastly, to adapt evaluation to real-world scenarios beyond
ways to distill and integrate the knowledge from LLMs into conventional NLP tasks, Auto-J (Li et al., 2024a) collects
recommender systems, thereby leveraging their advanced real-world user queries and their evaluations from a teacher
capabilities while mitigating latency issues for efficient real- LLM. This massive dataset of real-world scenarios is then
time recommendations (Mysore et al., 2023; Zhang et al., used to distill evaluation knowledge into LLaMA through
2023b; Liu et al., 2023c). fine-tuning, enhancing its practical applicability.
Mysore et al. (2023) tackle data scarcity in narrative-
driven recommendation (NDR), where users provide de- 4.4.6 Code
tailed descriptions of their preferences. They utilize GPT-3 LLMs, trained on extensive corpora containing code, are
to create synthetic narrative queries from user-item interac- highlighted for their proficiency in code-related tasks. Their
tions via few-shot prompting, then distill this data into re- capabilities extend beyond direct code generation to include
trieval models for NDR. Similarly, GENRE (Liu et al., 2023c) the provision of external knowledge and data, which is
employs GPT-3.5 to augment datasets with new knowledge crucial in distilling their expertise into smaller, more effi-
about news summarization, user profiles, and personalized cient models. Several works have successfully distilled code
content, aiding the training of content-based recommenda- knowledge from LLMs into those compact and specialized
tion models. To bridge the gap between language models code models (Chaudhary, 2023; Rozière et al., 2023; Gu-
and recommender systems, some research views behavior nasekar et al., 2023; Wei et al., 2023; Chen et al., 2023a;
modeling as an extension of language modeling (Cui et al., Liu et al., 2023d; Yu et al., 2024; Jain et al., 2023; Su and
2022; Liu et al., 2023k). InstructRec (Zhang et al., 2023b), McMillan, 2023; Guo et al., 2023d).
for instance, interprets recommendation as instruction fol- A primary focus in these student code models is on
lowing. They use ChatGPT to distill a wealth of user- code generation, a task of both common utility and practical
personalized instruction data reflecting diverse preferences significance. For instance, Code Alpaca (Chaudhary, 2023)
and intentions based on real historical interactions. This fine-tunes Llama using self-instruct with ChatGPT-distilled
data is then used to fine-tune a 3B student language model instructions specifically for code generation tasks. Similarly,
specifically for recommendation purposes. Code Llama-instruct (Rozière et al., 2023) is fine-tuned via
self-instruct, prompting Llama-2 (Touvron et al., 2023) with
4.4.5 Text Generation Evaluation coding problems, and further refined with unit tests. Phi-
Text generation evaluation, i.e. NLG evaluation, focuses on 1 (Gunasekar et al., 2023) aims to enhance the quality of dis-
assessing the quality of generated content. Unlike tradi- tilled code data by extracting “textbook quality” data from
tional NLG evaluation metrics like BLEU (Papineni et al., a teacher LLM, incorporating Python textbook and exercise
2002) or ROUGE (Lin, 2004), which primarily rely on data. Magicoder (Wei et al., 2023) addresses potential biases
surface-level text comparisons, LLMs, trained on extensive in teacher LLMs by referencing a wealth of open-source
corpora and refined through techniques like RLHF, offer a code, yielding more diverse and grounded data for code
more human-aligned assessment. This sophistication has led generation. To consider the capability of the student model
22

and leverage the feedback of the teacher, PERsD (Chen et al., GPT-4 to distill referential question-answer pairs from
2023a) introduces a Personalized Distillation method where the Flickr30K (Plummer et al., 2015) dataset, enhancing
the teacher LLM refines the student’s generated code based the understanding of referential regions within images.
on the execution feedback of the executor. LSKD (Park et al., 2023) introduces localized references
However, these models primarily target the code gener- to specific image regions, prompting the teacher LLM
ation task, lacking generalizability across a broader range to generate commonsense inferences about these areas.
of code-related tasks. To address this issue, MFTCoder (Liu To enhance the visual instruction tuning pipeline with
et al., 2023d) utilizes self-instruct to distill diverse code data text-rich images, LLaVAR (Zhang et al., 2023d) employs
from teacher LLMs for various tasks, such as code comple- the text-only GPT-4 as a teacher, using recognized texts
tion and text-to-code generation, training a student model and image captions to generate 16K conversation pairs for
via multi-task learning. WaveCoder (Yu et al., 2024), in text-rich images. The resultant student MLLM demonstrates
contrast, creates a comprehensive instruction tuning dataset enhanced interaction skills in content that combines both
covering four universal code-related tasks distilled from text and imagery.
GPT-3.5-turbo. WaveCoder first selects a diverse coreset of
Multiple Modalities. To extend knowledge distillation
raw data using the KCenterGreedy (Sener and Savarese,
of LLMs to encompass more modalities, such as audio
2018) clustering method, then employs the teacher LLM
and video, several innovative approaches have been in-
for generating task definitions and outputs. The teacher
troduced. These methods typically involve transforming
model also plays a role in evaluating and filtering this data.
these modalities into a textual format comprehensible to
Notably, WaveCoder demonstrates superior generalization
teacher LLMs, followed by the distillation of the teacher.
across different code-related tasks compared to other open-
Macaw-LLM (Lyu et al., 2023) leverages GPT-4 to generate
source models.
instruction-response pairs corresponding to the content of
images or videos. MIMIC-IT (Li et al., 2023f) aims to broaden
4.5 Multi-Modality the scope to language, image, and video understanding,
Multimodal Large Language Models (MLLMs) surpass tra- creating a substantial dataset with 2.8 million multimodal
ditional language-only LLMs by understanding and pro- instruction-response pairs distilled from ChatGPT. Chat-
cessing information across multiple modalities, more closely Bridge (Zhao et al., 2023d), on the other hand, represents
mirroring human perception and enabling a broader range a novel approach in multimodal language modeling. It
of real-world applications. There is a growing trend towards translates various non-textual modalities into text, combin-
developing MLLMs that follow multimodal instructions, ing fine-grained and global descriptions. This information
facilitating tasks with enhanced levels of interactivity. To ad- is then used to distill responses from ChatGPT or GPT-4
dress the scarcity of multimodal instruction-following data through an in-context learning process, effectively bridging
and to harness the commonsense and world knowledge the gap between different modalities.
embedded in teacher LLMs, numerous studies have focused
Others. Beyond distilling instruction-following data, sev-
on multimodal knowledge distillation from LLMs (Liu et al.,
eral methods have emerged that concentrate on harnessing
2023e; Zhao et al., 2023b; Wang et al., 2023e; Chen et al.,
different aspects of knowledge from LLMs. For instance,
2023c; Park et al., 2023; Pi et al., 2023; Zhao et al., 2023c; Liu
EMMA (Yang et al., 2023c) trains an MLLM to act as
et al., 2023f; Wu et al., 2023b; Luo et al., 2023d; Jiang et al.,
an embodied reflex agent within a visual environment.
2023d; Li et al., 2023c; Xu et al., 2023e).
It achieves this by distilling GPT-4’s skills in a parallel
Vision-Language. In the vision-language domain, textual world, generating actions and providing reflective
LLaVA (Liu et al., 2023e) pioneers the extension of the feedback. Silkie (Li et al., 2023h) takes a unique approach by
Self-Instruct approach from the language to the multimodal distilling preferences from GPT-4V, focusing on criteria like
field. It translates images into textual descriptions, helpfulness and visual faithfulness. Ha et al. (2023) represent
including captions and bounding boxes, and distills another innovative direction, where it generates, labels,
GPT-4 for generating new data in the context of seed and distills diverse robot-centric exploration experiences by
examples. This approach creates a LLaVA-Instruct-150k LLMs into a multi-task visuo-linguo-motor policy.
dataset, which serves as the foundation for further
developments like LLaVA-1.5 (Liu et al., 2023l) and 5 D OMAIN - SPECIFIED V ERTICAL D ISTILLATION
GPT4ROI (Zhang et al., 2023e), enhancing the instruction-
This section shifts from skill distillation to examine KD of
following capabilities of MLLMs. To expand the dataset’s
LLMs in various vertical domains, including Law, Medical
scale, SVIT (Zhao et al., 2023b) introduces a 4.2 million
& Healthcare, Finance, and Science, etc. It delves into cus-
image dataset, distilled from GPT-4 by leveraging manual
tomizing distilled LLMs for these fields, showing its signifi-
image annotations. It employs a novel data recipe to select
cant role in enhancing domain-specific AI applications. The
an informative, diverse, and balanced subset of training
taxonomy of these works is shown in Figure 7.
data. LVIS-Instruct4V (Wang et al., 2023e) leverages GPT-
4V (OpenAI, 2023), a powerful large multimodal model,
as a teacher to distill a more accurate and context-aware 5.1 Law
instruction-following dataset, focusing on fine-grained Law holds a crucial position in molding societies, over-
understanding. Further advancements include integrating seeing human interactions, and ensuring justice prevails.
specific region referencing in image-based instruction Informed decision-making, legal interpretation, and the pro-
following. For instance, Shikra (Chen et al., 2023c) uses vision of legal advice by professionals hinge on precise
23

Law LawyerLLaMA (Huang et al., 2023b), LawGPT (Cui et al., 2023b), Fuzi (Wu et al., 2023d)

Huatuogpt (Zhang et al., 2023c), Huatuogpt-II (Chen et al., 2023d), Doctorglm (Xiong et al., 2023),
Medical and Healthcare Alpacare (Zhang et al., 2023f), Huatuo (Wang et al., 2023a), ChatDoctor (Li et al., 2023i),
MedAlpaca (Han et al., 2023), PMC-LLaMA (Wu et al., 2023e), DISC-MedLLM (Bao et al., 2023a)

Finance XuanYuan (Zhang and Yang, 2023)


Verticalization Distillation
DARWIN (Xie et al., 2023a), SciGLM (Zhang et al., 2024), WizardMath (Luo et al., 2023b),
MAmmoTH (Yue et al., 2023a), TORA (Gou et al., 2024), AstroLLaMA-Chat (Perkowski et al., 2024),
G-LLaVA (Gao et al., 2023c), GIMLET (Zhao et al., 2023f), LLM-Prop (Rubungo et al., 2023),
Science
InstructMol (Cao et al., 2023a), Prot2Text (Abdine et al., 2023), BioMedGPT (Luo et al., 2023e),
xTrimoPGLM (Chen et al., 2024e), K2 (Deng et al., 2023), OceanGPT (Bi et al., 2023),
MarineGPT (Zheng et al., 2023b), GeoGalactica (Lin et al., 2024),

Miscellaneous EduChat (Dan et al., 2023), Owl (Guo et al., 2023b)

Fig. 7: Taxonomy of Verticalization Distillation.

and current information. Legal intelligent applications in 5.2 Medical and Healthcare
different scenarios usually require combinations of multiple The integration of LLMs holds great potential for trans-
fundamental capabilities of legal text retrieval, understand- forming medicine and healthcare. Extensive research has
ing, reasoning and generating (Zhang et al., 2023g; Sun, focused on adapting general-purpose LLMs to the medical
2023; Lai et al., 2023). To address challenges like legal ter- domain (Singhal et al., 2023), such as electronic health
minology, subtle interpretations, and the constant evolution records, and healthcare applications like patient care (Zhu
of legislation presents distinctive challenges that demand et al., 2023). Recent work has focused on enhancing medi-
customized resolutions. To handle the above challenges, cal instruction-following data with advanced teacher LLMs
several studies have investigated the customization of LLMs to better align with complex user instructions. Given the
for intelligent legal services (Cui et al., 2023b; Yue et al., abundance of medical data, most studies combine real-
2023b; Huang et al., 2023b; Wu et al., 2023d). This involves world data with distilled instruction data from teacher
a continued pre-training process on extensive legal corpora, LLMs (Zhang et al., 2023c; Xiong et al., 2023; Zhang et al.,
followed by fine-tuning with self-constructed instructions or 2023f; Wang et al., 2023a; Li et al., 2023i; Han et al., 2023; Wu
augmented data using advanced LLMs. et al., 2023f; Bao et al., 2023a; Chen et al., 2023d).
While existing studies predominantly concentrate on
training using dedicated medical dialogue datasets com-
prising medical textbooks (Wu et al., 2023e), biomedical
Huang et al. (2023b) have unveiled a Chinese legal papers (Luo et al., 2023e) medical knowledge-graphs (Bao
large model named LawyerLLaMA. The model undergoes et al., 2023b), or authentic doctor-patient interactions (Bao
an initial pre-training phase on an extensive legal corpus, et al., 2023b), an expanding body of research is delv-
systematically assimilating knowledge of the Chinese legal ing into the augmentation of medical instruction-following
system. Subsequently, fine-tuning occurs through the analy- data with advanced LLMs to enhance the alignment with
sis of objective questions from the Chinese National Judicial practical user instructions. Zhang et al. (2023c) introduce
Examination (Zhong et al., 2020) and the gathering of re- HuatuoGPT specifically tailored for medical consultations.
sponses to legal consultations using ChatGPT. This process The model leverages both distilled data from ChatGPT and
equips the model with the ability to apply legal knowledge real-world data from doctors during the supervised fine-
to specific scenarios. Cui et al. (2023b) present LawGPT, tuning stage. In a parallel effort, Xiong et al. (2023) con-
built upon the foundation of OpenLLAMA. The model is struct a dataset of medical dialogues in Chinese, em-
trained using a construction process that incorporates real- ploying ChatGPT’s assistance. Their methodology encom-
world legal text, legal regulations, judicial interpretations, passed various techniques to train DoctorGLM, an easily
and actual legal consultation data. Additionally, the authors deployable LLM designed for tasks such as diagnoses,
utilize the ChatGPT API for assisted construction, enabling drug recommendations, and other medical advice. Zhang
the generation of supplementary data derived from the et al. (2023f) fine-tune LLaMA-series models using 52k
existing dataset. Wu et al. (2023d) have developed a large- diverse, machine-generated, medical instruction-following
scale Chinese legal model (named Fuzi) with ChatGLM data named MedInstruct-52k. This effort resulted in the
as its foundation. This model undergoes training on an development of AlpaCare, a model demonstrating robust
extensive Chinese legal corpus, which incorporates unsu- medical proficiency and generalizability across both general
pervised judicial language data, including diverse judgment and medical-specific domain free-form instruction evalu-
documents and legal regulations. Additionally, it undergoes ations. In a different vein, Wang et al. (2023a) propose
supervised judicial fine-tuning with data encompassing le- HuaTuo, a LLaMA-based model that undergoes supervised
gal QA and case retrieval. Fuzi’s training also involves both fine-tuning with generated QA instances. This refinement
general instruction fine-tuning datasets, such as Alpaca, process enhances the model’s possession of more reliable
and domain-specific instruction fine-tuning datasets from medical knowledge. Li et al. (2023i) introduce ChatDoctor,
LawyerLLaMA (Huang et al., 2023b) and LawGPT (Cui which was first trained as a generic conversation model
et al., 2023b). based on LLaMA. It utilized 52K instruction-following data
24

from Stanford University’s Alpaca project (Taori et al., Specifically, XuanYuan (Zhang and Yang, 2023) lever-
2023). Subsequently, the conversation model underwent ages self-instruct over seed data and self-QA over struc-
fine-tuning on a dataset of 100K patient-physician conver- tured/unstructured data to generate instruction data in the
sations collected from an online medical consultation web- finance domain, which is used to train a finance LLM.
site. This two-step training process underscores the model’s
adaptability to diverse conversational contexts, particularly 5.4 Science
those specific to patient-physician interactions.
The integration of LLMs into the science domain (Taylor
Built upon existing datasets, MedAlpaca (Han et al.,
et al., 2022; Yin et al., 2023b) represents a paradigm shift
2023) proposes to reconstruct the data with GPT-3.5-Turbo,
in research, knowledge discovery, and the dissemination
which is then used to fine-tune LLMs for effective medical
of scientific information. In science, LLMs are leveraged to
applications. Furthermore, PMC-LLaMA (Wu et al., 2023f)
digest and synthesize vast amounts of literature, aiding in
proposes a training framework (i.e., continual pre-training
the identification of new research opportunities and the ac-
and domain-specific multi-task supervised fine-tuning) to
celeration of scientific breakthroughs. They facilitate the un-
adapt a general LLM to the medicine domain, where GPT-
derstanding of complex scientific concepts by summarizing
4 is leveraged to write synonymous sentences for data
research papers, generating hypotheses, and even drafting
augmentation in the SFT. To adapt LLMs to real-world
research proposals and manuscripts, thus significantly re-
medical consultation, DISC-MedLLM (Bao et al., 2023a)
ducing the time researchers spend on literature review and
leverages GPT-3.5 to 1) construct 50K QA pairs in a few-
enabling them to focus more on experimental work. LLMs
shot manner and 2) re-generate the 420k dialogues based
also democratize access to scientific knowledge by pro-
on real cases, which are then used to train LLMs in a
viding layperson summaries of complex research findings,
supervised fine-tuning manner. More recently, HuatuoGPT-
making science more accessible to non-experts and fostering
II (Chen et al., 2023d) proposes a one-stage training with
a broader public understanding of scientific advancements.
instruction-formatting unification of domain data collection
By enhancing the efficiency of research workflows and
for medical adaption upon LLMs, where GPT-4 is used to
fostering interdisciplinary collaborations, LLMs are poised
formulate medical questions to fine-tuning instructions.
to accelerate the pace of scientific discovery and innovation
These diverse studies collectively contribute to the ad-
across various fields. To distill knowledge from an LLM,
vancing field of the medical domain, facilitated by knowl-
DARWIN Series (Xie et al., 2023a) utilizes a semi self-
edge distillation from advanced LLMs. Through the ex-
instruct for instruction generation for science papers, which
ploration of various methodologies, these approaches pro-
is then used to fine-tune an LLM. SciGLM (Zhang et al.,
vide valuable insights into the challenges and potential
2024) proposes to train a scientific LLM, which prompts a
breakthroughs at the intersection of cutting-edge language
teacher LLM to generate detailed answers for unlabelled
models and medical applications.
scientific questions, as well as a self-reflective critic-and-
revise to improve data quality. Besides the above knowledge
5.3 Finance distillation methods to adapt LLMs to science, we will also
delve into how the distillation happens in sub-domains, e.g.,
The application of LLMs to the finance domain (Xue et al.,
mathematics, astronautics, chemistry, etc.
2023) significantly transforms how financial data is ana-
lyzed, decisions are made, and customer interactions are Mathematics. The application of LLMs within the sub-
managed. In finance, LLMs offer unprecedented capabil- domain of mathematics heralds a transformative era in
ities in understanding complex financial documents, pre- mathematical research, education, and problem-solving
dicting market trends, and automating risk assessment, (Azerbayev et al., 2023; Yu et al., 2023b). LLMs in mathemat-
thus enabling more informed and faster decision-making ics facilitate the exploration and understanding of complex
processes. By processing and analyzing vast amounts of mathematical theories and problems by providing intuitive
unstructured financial data, such as news articles, reports, explanations, proofs, and solutions that can bridge the
and real-time market feeds, LLMs can identify patterns gap between advanced mathematical concepts and learn-
and insights that were previously inaccessible, leading to ers at various levels. These models have shown potential
more accurate forecasts and strategic financial planning. in conjecturing new mathematical theorems and patterns,
Furthermore, LLMs enhance customer experiences through thus opening new avenues for research and discovery that
personalized financial advice, automated customer service, might not have been readily accessible to humans alone.
and sophisticated chatbots that can handle complex queries. In education, they serve as personalized tutors, offering
This level of automation and insight has the potential to students step-by-step guidance through mathematical prob-
increase efficiency, reduce operational costs, and improve lems and adapting explanations to the learner’s level of un-
compliance and risk management practices in financial derstanding. This democratizes access to high-quality math-
institutions, making LLMs a transformative force in the ematical education and fosters a deeper appreciation and
finance sector. Knowledge distillation from a proprietary understanding of mathematics among a broader audience.
LLM is still under-explored, and most existing works focus By enhancing collaborative efforts through the generation
on adapting LLMs to finance applications by continual pre- of new ideas and the simplification of complex concepts,
training on finance-specific corpora (Wu et al., 2023g; Lu LLMs are poised to significantly advance the field of math-
et al., 2023) or fine-tuning in a supervised manner on multi- ematics, making it more accessible, efficient, and innova-
task finance-specific instructions (Yang et al., 2023e; Xie tive. WizardMath (Luo et al., 2023b) enhances the mathe-
et al., 2023b; Wang et al., 2023k). matical reasoning capabilities of Llama-2 by applying the
25

novel Reinforcement Learning from Evol-Instruct Feedback physical and electronic properties of crystalline solids from
(RLEIF) method, significantly outperforming other open- text descriptions. This approach underscores the potential of
source LLMs on the GSM8k and MATH benchmarks, as text-based methods in materials science, offering significant
well as surpassing several closed-source LLMs including improvements in prediction accuracy while also contribut-
ChatGPT-3.5 and Minerva. MAmmoTH (Yue et al., 2023a) is ing a benchmark dataset, TextEdge, to foster further re-
a series of open-source LLMS specifically developed for gen- search in this emerging field. InstructMol (Cao et al., 2023a)
eral math problem-solving, achieving superior performance integrates multi-modal data, aligning molecular structures
on nine mathematical reasoning datasets. Utilizing a novel with natural language instructions for drug discovery tasks.
instruction tuning dataset called MathInstruct, which com- Through a novel two-stage instruction-tuning approach,
bines chain-of-thought and program-of-thought rationales, it significantly enhances performance in molecule-related
MAmmoTH models demonstrate substantial improvements tasks, establishing a reliable molecular assistant that outper-
over existing models. TORA (Gou et al., 2024), a series of forms existing LLMs and reduces the performance gap with
Tool-integrated Reasoning Agents, significantly advances specialized models. This demonstrates the value of multi-
mathematical problem-solving by combining natural lan- modal integration in developing versatile tools for complex
guage reasoning with the use of external computational domains like drug discovery.
tools. It markedly outperforms existing open-source models
Biology. In the field of Biology, particularly in the study
on 10 mathematical reasoning datasets, showcasing notable
of proteins, DNA, and RNA, LLMs are revolutionizing our
improvements over both rationale-based and program-
understanding of the fundamental molecules of life. By an-
based approaches, and introduces innovative training tech-
alyzing vast datasets of biological sequences and structures,
niques such as output space shaping to enhance model rea-
LLMs can predict the three-dimensional shapes of proteins,
soning capabilities. G-LLaVA (Gao et al., 2023c) introduces
potential functions, and interactions at a scale and speed
a significant advancement in geometric problem-solving for
beyond traditional computational methods. This capability
LLMs by leveraging a multimodal approach that combines
is critical for unraveling the complexities of biological sys-
text and image data. This model, utilizing the Geo170K
tems, advancing drug discovery by identifying targets and
dataset comprising over 170,000 geometric image-caption
designing molecules with high precision, and understand-
and question-answer pairs, demonstrates remarkable im-
ing genetic diseases through the interpretation of genomic
provements over GPT-4V on the MathVista benchmark.
variations.
Astronautics. The application of LLMs in astronau- Prot2Text (Abdine et al., 2023) introduces a novel multi-
tics (Nguyen et al., 2023) propels the field forward. modal framework for generating protein function descrip-
AstroLLaMA-Chat (Perkowski et al., 2024) is an ad- tions in free text by combining GNNs and LLMs. This
vancement of the AstroLLaMA model, leveraging a 7B- approach, which integrates structural and sequential protein
parameter LLaMA-2 model and targeted continual pre- information, highlights the transformative impact of knowl-
training on a curated astronomy corpus to enhance per- edge distillation through the fusion of GNNs and LLMs
formance in astronomy-focused question-answering. This for accurate protein function prediction, potentially revolu-
model demonstrates significant improvements in special- tionizing research in bioinformatics and biological sciences.
ized topic comprehension and introduces a chat-enabled BioMedGPT (Luo et al., 2023e) introduces a multimodal
version for the astronomy community, highlighting the generative pre-trained transformer specifically designed for
effectiveness of domain-specific knowledge distillation in the biomedicine domain, emphasizing the significance of
achieving superior performance on specialized topics. aligning molecular, protein, and natural language modal-
ities to enhance biomedical question-answering, molecule,
Chemistry and Materials Science. The integration of LLMs
and protein QA tasks. This framework showcases the critical
into Chemistry and Materials Science has revolutionized
role of knowledge distillation in bridging the gap between
the way researchers approach the discovery and develop-
complex biological data and human language, thereby fa-
ment of new compounds and materials. By analyzing vast
cilitating groundbreaking advancements in drug discovery
datasets and scientific literature, LLMs can predict the prop-
and therapeutic target identification. xTrimoPGLM (Chen
erties and behaviors of substances, significantly accelerating
et al., 2024e), a unified 100B-scale pre-trained transformer
the innovation cycle.
model, addresses both protein understanding and genera-
GIMLET (Zhao et al., 2023f), Graph Instruction based
tion tasks by integrating autoencoding and autoregressive
MolecuLe zEro-shoT learning, is a novel approach to
pre-training objectives. Its significant advancements over
molecule property prediction that integrates graph and text
existing models in 18 protein understanding benchmarks
data within a single language model framework, aiming
and its capability in de novo protein sequence generation
to improve instruction-based zero-shot learning for molec-
highlight the model’s importance in advancing the field of
ular tasks. By leveraging a transformer mechanism with
protein science through knowledge distillation.
generalized position embedding and decoupled attention,
GIMLET significantly outperforms traditional molecule-text Geography, Geology, and Environmental Science. The inte-
baselines in zero-shot learning scenarios, demonstrating gration of LLMs into Geography, Geology, and Environmen-
the model’s effectiveness in generalizing from instructions tal Science is revolutionizing these fields by enhancing data
to a broad range of molecule-related tasks without prior analysis, predictive modeling, and interdisciplinary research
explicit task-specific training. LLM-Prop (Rubungo et al., (Roberts et al., 2023; Lin et al., 2023b; Wang et al., 2023l).
2023), leveraging the T5 model, showcases how LLMs can K2 (Deng et al., 2023), the first-ever LLM specialized in
outperform SoTA graph neural networks in predicting the the geoscience domain, demonstrates the significant impact
26

of knowledge distillation in vertical domain specialization. linear function to select the most effective data based on
By adapting the general-domain LLaMA-7B model with a their statistical properties. Li et al. (2023j) propose a data
5.5B token geoscience corpus and introducing the GeoSignal selection pipeline similar to self-distillation, in which the
instruction tuning dataset, K2 showcases enhanced perfor- LLM firstly learns from a small subset of the data to get the
mance in geoscience knowledge understanding and uti- basic ability, and then further uses this learned model to rate
lization. The model’s development highlights a novel ap- for the original dataset. Du et al. (2023b) propose to consider
proach to efficiently gather domain-specific data and align three aspects including quality, coverage, and necessity for
model responses to specialized user queries. OceanGPT (Bi the filtering process. Li et al. (2023k) select instruction data
et al., 2023), introduced as the first LLM for ocean sci- by evaluating their one-shot improvement on a hold-out
ence tasks, underscores the vital role of knowledge distil- set. Li et al. (2024f) recently propose Superfiltering, which is
lation in the vertical domain of oceanography. It leverages able to utilize small language models like GPT2 to filter out
DOINSTRUCT, a novel framework for generating domain- the high-quality subset from a given high-quality dataset.
specific instruction data through multi-agent collaboration, Despite the emergence of these works working on data fil-
and establishes OCEANBENCH, a benchmark for evaluat- tering, How to efficiently select the optimal distillation data
ing LLMs in the ocean domain. MarineGPT (Zheng et al., for LLMs, and How much data is required for distillation
2023b) showcases the transformative potential of knowl- are still unsolved.
edge distillation in the marine domain by leveraging a
novel vision-language model tailored for marine science. Reduce the Distillation Cost (Lightweight Methods) De-
Utilizing the Marine-5M dataset, which includes over 5 spite the remarkable abilities of the latest LLMs, their sig-
million marine image-text pairs, MarineGPT excels in pro- nificant resource requirements underscore the urgent need
viding detailed, accurate, and domain-specific responses. to find efficient solutions to overcome these challenges.
GeoGalactica (Lin et al., 2024) represents a pioneering step Common ways to further reduce the distillation cost include
in specializing LLMs for geoscience, leveraging a 30 billion Model Compression and Efficient Fine-Tuning. In the realm
parameter model pre-trained on a vast geoscience corpus. of Model Compression, Quantization (Frantar et al., 2023;
This model is notable for being the largest of its kind within Dettmers et al., 2022; Kim et al., 2023c; Tao et al., 2022b; Yao
the geoscience domain. et al., 2022; Xiao et al., 2023), Parameter Pruning (Ma et al.,
2023d; Zhang et al., 2023h; Frantar and Alistarh, 2023), and
Low-Rank Approximation (Xu et al., 2023g; Li et al., 2023l)
5.5 Miscellaneous are commonly utilized. In the realm of Efficient Fine-Tuning,
Knowledge distillation of LLMs has vast potential across Parameter Efficient Fine-Tuning (Hu et al., 2023b; Liu et al.,
various verticals beyond the ones previously discussed, 2022c; Wang et al., 2022b; Hu et al., 2021; Li and Liang,
highlighting their versatility and transformative impact 2021; Liu et al., 2022d), and Memory Efficient Fine-Tuning
across different industries. For instance, in the education (Dettmers et al., 2023; Kim et al., 2023d; Malladi et al., 2024)
sector, EduChat (Dan et al., 2023) exemplifies a chatbot are utilized. A detailed survey on Efficient Large Language
system that provides tailored support to teachers, students, Models can be found here in Wan et al. (2024b). The problem
and parents. KD is central to its design, leveraging pre- that remains is how can we further compress the model and
training on educational data followed by fine-tuning with build effective distillation algorithms.
custom instructions to deliver capabilities such as essay
evaluation and emotional support. Similarly, Owl (Guo Multi-Teacher Distillation Most of the existing distilled
et al., 2023b), an LLM designed for IT operations, boosts models are distilled from a single teacher model, how-
operational efficiency using the Owl-Instruct dataset, which ever, it is widely accepted that models trained with dif-
is distilled from ChatGPT. By applying a mixture-of-adapter ferent sources of data have various capabilities. Thus a
strategy for domain-specific tuning, it enhances analysis and question arises: Is it possible to distill knowledge from
performance in IT-related tasks. different teacher models into one student model? BabyL-
lama (Timiryasov and Tastet, 2023) proposes to distill the
knowledge from both the GPT2 and LLaMA into the small-
6 O PEN P ROBLEMS size student models. Ensemble-Instruct (Lee et al., 2023b)
tries to generate both instructions and responses ensembled
Further Data Selection How much data is required for LLM from several different LLMs with RougeL as the indicator.
distillation and how to filter out the low-quality data remain FUSELLM (Wan et al., 2024a) externalizes the collective
open-domain questions. In the field of instruction tuning, knowledge and unique strengths by leveraging the genera-
one of the most commonly used methods for distillation, tive distributions of different LLMs aiming to train a student
Zhou et al. (2023a) propose that only 1000 human-curated model beyond those of any individual source LLM. Despite
high-quality data is enough for the alignment of LLMs, the recent progress in this topic, it still remains an under-
hypothesizing that LLMs have learned the required knowl- explored topic.
edge from pretraining and only a small amount of data is
required for the alignment. Its finding further raises a new Explore Richer Knowledge from Teacher LLMs As indicated
question, how to automatically select the data for better in Table 3, the majority of teacher LLMs are closed-source
distillation? Chen et al. (2023e) directly apply ChatGPT to due to their advanced capabilities. Consequently, current
rate each data sample together with explanations, and then methodologies primarily focus on using the generations
the data is selected based on the rating. Cao et al. (2023b) from these models as hard labels, training student models
split the existing instruction-tuning datasets and trains a through simple supervised fine-tuning. However, beyond
27

the straightforward imitation of output behaviors via hard bines task distribution modeling and knowledge distillation
labels, there is a growing interest in harnessing richer to mitigate catastrophic forgetting without requiring access
knowledge from teacher LLMs, including feedback and to the old data. To evaluate the effectiveness of instruction
feature knowledge, as well as exploring diverse combina- tuning in the context of continuous learning tasks, Zhang
tions of knowledge elicitation methods. As highlighted in et al. (2023i) introduce a more challenging yet practical
the Feedback section, teachers can provide various types of problem called Continual Instruction Tuning (CIT) and also
feedback based on the student’s outputs (Lee et al., 2023a; establish a benchmark suite consisting of learning and eval-
Jiang et al., 2023b; Chen et al., 2023a). Similarly, the Feature uation protocols. Although current research has explored
section discusses how knowledge based on features, such some simple methods to alleviate knowledge forgetting dur-
as logits serving as soft labels, can offer deeper, intrinsic ing model fine-tuning or knowledge distillation processes,
insights into the teacher model (Gu et al., 2024; Agarwal effectively avoiding catastrophic forgetting across domains
et al., 2024). These explorations have demonstrated promis- and skills remains a challenging issue. How to retain the
ing outcomes, suggesting that access to a broader spectrum original model’s capabilities effectively during knowledge
of knowledge can significantly enhance student model per- distillation or transfer processes is still a challenging prob-
formance beyond what is achievable through simple SFT lem.
distillation alone. This highlights the critical need for further
Trustworthy Knowledge Distillation Trustworthiness in
research into varied knowledge extraction methods from
LLMs is paramount, encompassing attributes such as truth-
teacher LLMs to augment the effectiveness of KD processes.
fulness, safety, fairness, robustness, privacy, and adherence
to machine ethics (Sun et al., 2024a). The rapid advancement
Overcoming Catastrophic Forgetting During Distillation
of LLMs brings to the forefront concerns regarding their
Previous research has delved into the fine-tuning of LLMs
trustworthiness, stemming from their complex outputs, the
to acquire the ability to follow instructions or transfer
biases present in vast training datasets, and the potential
knowledge for forthcoming tasks, skills, or domains, lever-
inclusion of private information. Current efforts in KD
aging advancements in LLM technology. Nevertheless, in-
of LLMs primarily focus on distilling various skills from
vestigations have revealed that the continual fine-tuning of
LLMs, with relatively little attention paid to trustworthiness
LLMs on particular datasets (skills, domains) can lead to
aspects. Existing studies tend to concentrate on a subset of
a phenomenon known as catastrophic forgetting, wherein
trustworthiness aspects, such as helpfulness, honesty, and
previously acquired knowledge and problem-solving abil-
harmlessness (Bai et al., 2022a; Yang et al., 2024; Cui et al.,
ities for earlier tasks are compromised (Chen et al., 2023f;
2023a). Consequently, in the distillation process, student
Kotha et al., 2023; Koloski et al., 2023; Wu et al., 2024;
models may inherit issues related to trustworthiness from
Luo et al., 2023f). Earlier studies in machine learning and
their teacher LLMs. As assessed in Sun et al. (2024a), smaller
deep learning have investigated various techniques to help
open-source LLMs generally fall short of their proprietary
mitigate forgetting during the fine-tuning or continue learn-
counterparts in trustworthiness metrics. Therefore, consid-
ing process, such as rehearsal, which entails periodically
ering trustworthiness alongside the distillation of capabil-
revisiting and training on past data (Kirkpatrick et al., 2017;
ities into student models is crucial. It is imperative that
Rostami et al., 2019; Rolnick et al., 2019), as well as reg-
future research on KD not only enhances the capabilities
ularization methods like elastic weight consolidation (Lee
of student models but also ensures that broader aspects of
et al., 2017), or dynamic architecture methods (Mallya et al.,
trustworthiness are meticulously addressed.
2018; Wang et al., 2022c; Hu et al., 2023c; Chen et al., 2023f).
To address the challenges of catastrophic forgetting and to Weak-to-strong Distillation. The concept of “weak-to-
enhance the diversity of generated instructions in knowl- strong generalization” in LLMs (Burns et al., 2023) empha-
edge distillation for LLMs, Jiang et al. (2023b) randomly sizes the potential to leverage weak supervision to elicit
sample an instruction from the easy instructions and also the advanced capabilities of more powerful models. This
prompt the generator to generate a new instruction that approach challenges the traditional distillation paradigm by
belongs to the same domain as the sampled one. In a similar suggesting that even with limited or imperfect supervision,
vein, Li et al. (2023m) study the problem of instruction- it is possible to enhance the performance of LLMs sig-
tuning in multi-modal LLMs knowledge distillation and nificantly. This necessitates exploring innovative strategies
introduce a competitive distillation framework. The model that enable weaker models to guide the learning process
tries to produce new instructions that differ in content but of stronger ones effectively, highlighting the importance
are similar in difficulty to the original pictures in the multi- of developing methods that can bridge the gap between
modal augmentation phase, so as to alleviate catastrophic these models. Such research could unlock new avenues
forgetting of the model and enhance the diversity of the for improving LLMs’ efficiency and effectiveness, making
instruction tuning pool. Chen et al. (2023f) propose the the pursuit of “weak-to-strong distillation” a crucial area
Lifelong-MoE (Mixture-of Experts) architecture based on for future investigations in this LLM era. Initially, Burns
general language models, which dynamically adds model et al. (2023) investigates whether weak model supervision
capacity via adding experts with regularized pretraining. can unlock the full capabilities of much stronger models.
Additionally, the model also introduces implicit regulariza- Through experiments with pre-trained language models in
tion via distillation of the knowledge from old experts and the GPT-4 family across NLP, chess, and reward modeling
gatings to effectively preserve old knowledge. Zeng et al. tasks, it finds that finetuning strong models on weak labels
(2023b) propose a new generative-based rehearsal method leads to better performance than their weak supervisors,
as Dirichlet Continual Learning (DCL). This method com- demonstrating weak-to-strong generalization. Then, Li et al.
28

(2024g) introduce Superfiltering, a method that employs of training and deployment. Our review emphasizes vari-
smaller, weaker models like GPT-2 to select high-quality ous KD approaches, from algorithmic innovations to skill
data for fine-tuning larger, more capable models such as enhancement and vertical distillation. Notably, data aug-
LLaMA2. This approach is rooted in discovering a strong mentation and synthesis within KD emerge as vital tools
consistency in evaluating instruction tuning data difficulty for improving distillation, revealing the powerful synergy
across models of varying sizes. More recently, Ji et al. (2024) between enriched training data and effective model distil-
introduce Aligner, a novel approach for aligning LLMs with lation. As the AI landscape evolves, rapid advancements
human values and intentions by utilizing weak supervisory in model architectures and training methods present both
signals from smaller models to improve the performance challenges and research opportunities for KD of LLMs.
of larger models. However, Burns et al. (2023) find that Future innovation will need to focus on achieving efficiency,
achieving the full capabilities of strong models requires transparency, and ethics while maintaining model trust-
more than naive finetuning, suggesting the need for further worthiness. Furthermore, promising areas such as weak-
research in this area. Therefore, open questions still remain to-strong generalization, self-alignment, and multi-modal
about 1) What are the theoretical and practical limits of LLMs offer the potential to enhance the capabilities of
weak-to-strong distillation? Can weak supervision reliably distilled models. In conclusion, the KD of LLMs is set to play
extract and enhance the full spectrum of capabilities in a pivotal role in the future of AI research. As highlighted
stronger models across all domains, or are there inherent in this survey, sustained research efforts will be critical in
limitations based on model architecture or task specificity? developing accessible, efficient, and responsible AI for all.
2) How do we identify or design the optimal weak su- Importantly, when conducting KD of LLMs like ChatGPT
pervisors for distilling knowledge into stronger models? Is or Llama, it’s essential to comply with the model providers’
there a framework or criteria to predict which weak models terms4 , such as the restrictions on developing competitive
would be most effective in guiding the learning process of products.
more complex models for specific tasks? 3) To what extent
are weak-to-strong distillation techniques transferable and R EFERENCES
scalable across different sizes and types of models? How
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright,
can these methods be adapted to ensure efficacy and ef-
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray
ficiency in distilling knowledge from very large models to
et al., “Training language models to follow instructions
significantly smaller ones, especially in resource-constrained
with human feedback,” Advances in Neural Information
environments?
Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
Self-Alignment. Aligning LLMs traditionally relies heavily OpenAI, :, J. Achiam, S. Adler, S. Agarwal, L. Ahmad,
on human or teacher LLMs to supply extensive preference I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
data. Consequently, the alignment of the student model S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji,
is limited by the quantity of distilled preference data and V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum,
the teacher’s capabilities. Self-alignment offers a promising I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bog-
alternative, aiming to enhance alignment beyond the con- donoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman,
straints of teacher-provided preferences. In self-alignment, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell,
the student model endeavors to autonomously improve A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan,
and align its responses with desired behaviors, including C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen,
generating model-written feedback, critiques, and explana- M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung,
tions. Several studies have explored utilizing the student D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry,
model’s inherent capabilities to generate knowledge for N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling,
alignment (Bai et al., 2022a; Sun et al., 2024b; Li et al., 2024c; S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi,
Yuan et al., 2024a). Beyond merely producing improved L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford,
responses (Bai et al., 2022a; Sun et al., 2024b), implemen- L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni,
tations of self-alignment include employing the student as G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray,
its reward model to offer feedback (Yuan et al., 2024a), a R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han,
strategy that merges Self-Knowledge with Feedback methods J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse,
of eliciting knowledge. We advocate for increasingly lever- A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu,
aging the student model itself to provide feedback, thereby S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang,
enhancing self-alignment capabilities. This approach not R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaf-
only facilitates moving beyond traditional human/teacher tan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S.
preference-based rewards but also opens avenues for con- Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim,
tinual self-improvement and alignment. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz
Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic,
7 C ONCLUSION AND D ISCUSSION G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike,
J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin,
This survey has explored the diverse landscape of knowl-
M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Mal-
edge distillation for LLMs, highlighting key techniques,
facini, S. Manning, T. Markov, Y. Markovski, B. Mar-
applications, and challenges. KD plays a crucial role in
democratizing access to advanced LLM capabilities, pro- 4. OpenAI Business Terms: https://openai.com/policies/business-
viding cutting-edge advancements without the high costs terms
29

tin, K. Mayer, A. Mayne, B. McGrew, S. M. McKin- Intelligence, 2023.


ney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale,
V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull,
O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Nee- D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao,
lakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pa- V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou,
chocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascan- H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann,
dolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. Pong, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-
T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Rad- stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
ford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor,
C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang,
T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic,
J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, S. Edunov, and T. Scialom, “Llama 2: Open foundation
J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, and fine-tuned chat models,” 2023.
J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S.
N. Staudacher, F. P. Such, N. Summers, I. Sutskever, Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lam-
J. Tang, N. Tezak, M. Thompson, P. Tillet, A. Tootoonchian, ple, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock,
E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed,
A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. “Mistral 7b,” 2023.
Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang,
A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez,
D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, and I. Stoica, “Judging llm-as-a-judge with mt-bench and
S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, chatbot arena,” CoRR, vol. abs/2306.05685, 2023. [Online].
Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, Available: https://doi.org/10.48550/arXiv.2306.05685
S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph, “Gpt- L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao,
4 technical report,” 2023. Y. Huang, W. Lyu, Y. Zhang, X. Li, Z. Liu, Y. Liu, Y. Wang,
G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, Z. Zhang, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. Xing,
R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao,
“Gemini: a family of highly capable multimodal models,” M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei,
arXiv preprint arXiv:2312.11805, 2023. J. Liu, J. Gao, J. Han, J. Zhao, J. Tang, J. Wang, J. Mitchell,
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, K. Shu, K. Xu, K.-W. Chang, L. He, L. Huang, M. Backes,
D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, N. Z. Gong, P. S. Yu, P.-Y. Chen, Q. Gu, R. Xu, R. Ying, S. Ji,
T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, S. Jana, T. Chen, T. Liu, T. Zhou, W. Wang, X. Li, X. Zhang,
“Emergent abilities of large language models,” Trans. X. Wang, X. Xie, X. Chen, X. Wang, Y. Liu, Y. Ye, Y. Cao,
Mach. Learn. Res., vol. 2022, 2022. [Online]. Available: Y. Chen, and Y. Zhao, “Trustllm: Trustworthiness in large
https://openreview.net/forum?id=yzkSU5zdwD language models,” 2024.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge
Q. V. Le, D. Zhou et al., “Chain-of-thought prompting distillation: A survey,” International Journal of Computer
elicits reasoning in large language models,” Advances in Vision, vol. 129, pp. 1789–1819, 2021.
Neural Information Processing Systems, vol. 35, pp. 24 824– M. Gupta and P. Agrawal, “Compression of deep learning
24 837, 2022. models for text: A survey,” ACM Transactions on Knowledge
X. Xu, C. Tao, T. Shen, C. Xu, H. Xu, G. Long, and J. guang Discovery from Data (TKDD), vol. 16, no. 4, pp. 1–55, 2022.
Lou, “Re-reading improves reasoning in large language S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mi-
models,” 2024. tamura, and E. Hovy, “A survey of data augmentation
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, approaches for nlp,” arXiv preprint arXiv:2105.03075, 2021.
M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin,
B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An
C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, instruction-following llama model,” https://github.com/
E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, tatsu-lab/stanford alpaca, 2023.
H. Yao, J. Wang, K. Santhanam, L. J. Orr, L. Zheng, Y. Gu, L. Dong, F. Wei, and M. Huang, “MiniLLM:
M. Yüksekgönül, M. Suzgun, N. Kim, N. Guha, N. S. Knowledge distillation of large language models,” in The
Chatterji, O. Khattab, P. Henderson, Q. Huang, R. Chi, Twelfth International Conference on Learning Representations,
S. M. Xie, S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard, 2024. [Online]. Available: https://openreview.net/forum?
T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang, id=5h0qf7IBZZ
and Y. Koreeda, “Holistic evaluation of language R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R.
models,” CoRR, vol. abs/2211.09110, 2022. [Online]. Garea, M. Geist, and O. Bachem, “On-policy distillation
Available: https://doi.org/10.48550/arXiv.2211.09110 of language models: Learning from self-generated
X. Wu, R. Duan, and J. Ni, “Unveiling security, privacy, mistakes,” in The Twelfth International Conference on
and ethical concerns of chatgpt,” Journal of Information and Learning Representations, 2024. [Online]. Available: https:
30

//openreview.net/forum?id=3zKtaqxLhW ation for Computational Linguistics, 2023, pp. 8003–8017.


W. Yuan, R. Y. Pang, K. Cho, S. Sukhbaatar, J. Xu, and A. Mitra, L. D. Corro, S. Mahajan, A. Codas, C. Simoes,
J. Weston, “Self-rewarding language models,” 2024. S. Agarwal, X. Chen, A. Razdaibiedina, E. Jones, K. Aggar-
Z. Chen, Y. Deng, H. Yuan, K. Ji, and Q. Gu, “Self-play wal, H. Palangi, G. Zheng, C. Rosset, H. Khanpour, and
fine-tuning converts weak language models to strong A. Awadallah, “Orca 2: Teaching small language models
language models,” 2024. how to reason,” 2023.
Y. Huang, Y. Chen, Z. Yu, and K. McKeown, “In-context C. Xu, D. Guo, N. Duan, and J. J. McAuley, “Baize: An open-
learning distillation: Transferring few-shot learning abil- source chat model with parameter-efficient tuning on self-
ity of pre-trained language models,” 2022. chat data,” in EMNLP. Association for Computational
G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, Linguistics, 2023, pp. 6268–6278.
G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting lan- X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su,
guage models with high-quality feedback,” arXiv preprint and W. Chen, “Mammoth: Building math generalist mod-
arXiv:2310.01377, 2023. els through hybrid instruction tuning,” arXiv preprint
S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, arXiv:2309.05653, 2023.
and A. Awadallah, “Orca: Progressive learning from L. Chenglin, C. Qianglong, W. Caiyu, and Z. Yin, “Mixed
complex explanation traces of gpt-4,” arXiv preprint distillation helps smaller language model better reason-
arXiv:2306.02707, 2023. ing,” 2023.
B. Ding, C. Qin, L. Liu, Y. K. Chia, B. Li, S. Joty, and L. Bing, Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,
“Is GPT-3 a good data annotator?” in ACL (1). Asso- D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning
ciation for Computational Linguistics, 2023, pp. 11 173– language model with self generated instructions,” arXiv
11 195. preprint arXiv:2212.10560, 2022.
S. Chaudhary, “Code alpaca: An instruction-following Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox,
llama model for code generation,” https://github.com/ Y. Yang, and C. Gan, “Principle-driven self-alignment
sahil280114/codealpaca, 2023. of language models from scratch with minimal human
H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and supervision,” Advances in Neural Information Processing
T. Liu, “Huatuo: Tuning llama model with chinese medi- Systems, vol. 36, 2024.
cal knowledge,” arXiv preprint arXiv:2304.06975, 2023. Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma,
LawGPT. GitHub, 2023. Q. Lin, and D. Jiang, “Wizardcoder: Empowering code
D. Zhang, Z. Hu, S. Zhoubian, Z. Du, K. Yang, Z. Wang, large language models with evol-instruct,” arXiv preprint
Y. Yue, Y. Dong, and J. Tang, “Sciglm: Training arXiv:2306.08568, 2023.
scientific language models with self-reflective instruction H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng,
annotation and tuning,” CoRR, vol. abs/2401.07950, 2024. Q. Lin, S. Chen, and D. Zhang, “Wizardmath: Empower-
[Online]. Available: https://doi.org/10.48550/arXiv.2401. ing mathematical reasoning for large language models via
07950 reinforced evol-instruct,” arXiv preprint arXiv:2308.09583,
W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, 2023.
L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao,
I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot S. Xu, W. Liu, N. Liu, S. Li, D. Zhu, H. Cai, L. Sun, Q. Li,
impressing gpt-4 with 90%* chatgpt quality,” March 2023. D. Shen, T. Liu, and X. Li, “Auggpt: Leveraging chatgpt
[Online]. Available: https://lmsys.org/blog/2023-03-30- for text data augmentation,” 2023.
vicuna/ Z. He, M. T. Ribeiro, and F. Khani, “Targeted data
C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, generation: Finding and fixing model weaknesses,”
and D. Jiang, “Wizardlm: Empowering large language in Proceedings of the 61st Annual Meeting of the
models to follow complex instructions,” arXiv preprint Association for Computational Linguistics (Volume 1: Long
arXiv:2304.12244, 2023. Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki,
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, Eds. Toronto, Canada: Association for Computational
B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Linguistics, Jul. 2023, pp. 8506–8520. [Online]. Available:
Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. https://aclanthology.org/2023.acl-long.474
Nie, and J.-R. Wen, “A survey of large language models,” N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun,
2023. and B. Zhou, “Enhancing chat language models by scaling
X. He, Z. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, J. Jiao, S. M. high-quality instructional conversations,” in EMNLP. As-
Yiu, N. Duan, W. Chen et al., “Annollm: Making large sociation for Computational Linguistics, 2023, pp. 3029–
language models to be better crowdsourced annotators,” 3051.
arXiv preprint arXiv:2303.16854, 2023. S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D.
Y. Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa,
R. Xie, J. Wang, X. Xie, W. Ye, S. Zhang, and Y. Zhang, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang,
“Pandalm: An automatic evaluation benchmark for llm S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y. Li,
instruction tuning optimization,” 2023. “Textbooks are all you need,” 2023.
C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y. Fujii, A. Ratner, Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and
R. Krishna, C. Lee, and T. Pfister, “Distilling step-by-step! Y. T. Lee, “Textbooks are all you need ii: phi-1.5 technical
outperforming larger language models with less training report,” arXiv preprint arXiv:2309.05463, 2023.
data and smaller model sizes,” in ACL (Findings). Associ- Phi-2: The surprising power of small lan-
31

guage models, December 2023. [Online]. Avail- white-box models for better human alignment,” arXiv
able: https://www.microsoft.com/en-us/research/blog/ preprint arXiv:2310.16271, 2023.
phi-2-the-surprising-power-of-small-language-models/ H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop,
Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: V. Carbune, and A. Rastogi, “Rlaif: Scaling reinforcement
Source code is all you need,” 2023. learning from human feedback with ai feedback,” arXiv
Z. Yu, X. Zhang, N. Shang, Y. Huang, C. Xu, Y. Zhao, W. Hu, preprint arXiv:2309.00267, 2023.
and Q. Yin, “Wavecoder: Widespread and versatile en- Y. Jiang, C. Chan, M. Chen, and W. Wang, “Lion: Adversarial
hanced instruction tuning with refined data generation,” distillation of closed-source large language model,” arXiv
2024. preprint arXiv:2305.12870, 2023.
J. Ye, J. Gao, Q. Li, H. Xu, J. Feng, Z. Wu, T. Yu, and H. Chen, A. Saha, S. Hoi, and S. Joty, “Personalized
L. Kong, “Zerogen: Efficient zero-shot learning via dataset distillation: Empowering open-sourced LLMs with
generation,” in EMNLP. Association for Computational adaptive learning for code generation,” in The 2023
Linguistics, 2022, pp. 11 653–11 669. Conference on Empirical Methods in Natural Language
J. Gao, R. Pi, Y. Lin, H. Xu, J. Ye, Z. Wu, W. Zhang, Processing, 2023. [Online]. Available: https://openreview.
X. Liang, Z. Li, and L. Kong, “Self-guided noise-free data net/forum?id=alxWMBcNVN
generation for efficient zero-shot learning,” in The Eleventh K. Yang, D. Klein, A. Celikyilmaz, N. Peng, and Y. Tian,
International Conference on Learning Representations, ICLR “RLCD: Reinforcement learning from contrastive distilla-
2023, Kigali, Rwanda, May 1-5, 2023, 2023. [Online]. tion for LM alignment,” in The Twelfth International Confer-
Available: https://openreview.net/pdf?id=h5OpjGd lo6 ence on Learning Representations, 2024. [Online]. Available:
L. H. Bonifacio, H. Q. Abonizio, M. Fadaee, and R. F. https://openreview.net/forum?id=v3XXtxWKi6
Nogueira, “Inpars: Data augmentation for information J. Jung, P. West, L. Jiang, F. Brahman, X. Lu, J. Fisher,
retrieval using large language models,” CoRR, vol. T. Sorensen, and Y. Choi, “Impossible distillation: from
abs/2202.05144, 2022. [Online]. Available: https://arxiv. low-quality model to high-quality dataset & model for
org/abs/2202.05144 summarization and paraphrasing,” 2023.
I. Timiryasov and J.-L. Tastet, “Baby llama: knowledge J. Huang, S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
distillation from an ensemble of teachers trained on J. Han, “Large language models can self-improve,” in
a small dataset with no performance penalty,” in Proceedings of the 2023 Conference on Empirical Methods
Proceedings of the BabyLM Challenge at the 27th Conference in Natural Language Processing, H. Bouamor, J. Pino, and
on Computational Natural Language Learning, A. Warstadt, K. Bali, Eds. Singapore: Association for Computational
A. Mueller, L. Choshen, E. Wilcox, C. Zhuang, J. Ciro, Linguistics, Dec. 2023, pp. 1051–1068. [Online]. Available:
R. Mosquera, B. Paranjabe, A. Williams, T. Linzen, https://aclanthology.org/2023.emnlp-main.67
and R. Cotterell, Eds. Singapore: Association for C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova,
Computational Linguistics, Dec. 2023, pp. 279–289. L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang,
[Online]. Available: https://aclanthology.org/2023.conll- C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas,
babylm.24 “Reinforced self-training (rest) for language modeling,”
C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, 2023.
P. Luo, and N. Wong, “Compression of generative pre- E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “Star: Boot-
trained language models via quantization,” arXiv preprint strapping reasoning with reasoning,” in NeurIPS, 2022.
arXiv:2203.10705, 2022. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert,
Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, a distilled version of bert: smaller, faster, cheaper and
Y. Shi, R. Krishnamoorthi, and V. Chandra, “Llm-qat: lighter,” arXiv preprint arXiv:1910.01108, 2019.
Data-free quantization aware training for large language Y. Wen, Z. Li, W. Du, and L. Mou, “f-divergence
models,” arXiv preprint arXiv:2305.17888, 2023. minimization for sequence-level knowledge distillation,”
Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, in Proceedings of the 61st Annual Meeting of the Association
A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, for Computational Linguistics (Volume 1: Long Papers),
C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Gan- A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto,
guli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, Canada: Association for Computational Linguistics, Jul.
J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, 2023, pp. 10 817–10 834. [Online]. Available: https:
M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. Das- //aclanthology.org/2023.acl-long.605
Sarma, R. Lasenby, R. Larson, S. Ringer, S. Johnston, C. Liang, S. Zuo, Q. Zhang, P. He, W. Chen, and T. Zhao,
S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen- “Less is more: Task-aware layer-wise distillation for lan-
Lawton, T. Conerly, T. Henighan, T. Hume, S. R. Bow- guage model compression,” in International Conference on
man, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, Machine Learning. PMLR, 2023, pp. 20 852–20 867.
S. McCandlish, T. Brown, and J. Kaplan, “Constitutional M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward de-
ai: Harmlessness from ai feedback,” 2022. sign with language models,” in ICLR. OpenReview.net,
L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, 2023.
Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction
et al., “Zephyr: Direct distillation of lm alignment,” arXiv tuning with gpt-4,” 2023.
preprint arXiv:2310.16944, 2023. G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and
J. Hong, Q. Tu, C. Chen, X. Gao, J. Zhang, and R. Yan, B. Ghanem, “Camel: Communicative agents for” mind”
“Cyclealign: Iterative distillation from black-box llm to exploration of large scale language model society,” arXiv
32

preprint arXiv:2303.17760, 2023. arXiv preprint arXiv:2304.11116, 2023.


G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, and Y. Liu, S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla:
“OpenChat: Advancing Open-source Language Models Large language model connected with massive apis,”
with Mixed-Quality Data,” Sep. 2023, arXiv:2309.11235 2023.
[cs]. [Online]. Available: http://arxiv.org/abs/2309.11235 Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, B. Cao, and
M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang, L. Sun, “Toolalpaca: Generalized tool learning for lan-
“Knowledge-augmented reasoning distillation for small guage models with 3000 simulated cases,” 2023.
language models in knowledge-intensive tasks,” arXiv Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong,
preprint arXiv:2305.18395, 2023. X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie,
H. Luo, Y.-S. Chuang, Y. Gong, T. Zhang, Y. Kim, X. Wu, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Toolllm:
D. Fox, H. Meng, and J. Glass, “Sail: Search-augmented in- Facilitating large language models to master 16000+ real-
struction learning,” arXiv preprint arXiv:2305.15225, 2023. world apis,” 2023.
A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji,
rag: Learning to retrieve, generate, and critique through “Craft: Customizing llms by creating and retrieving from
self-reflection,” arXiv preprint arXiv:2310.11511, 2023. specialized toolsets,” 2023.
S. Ye, Y. Jo, D. Kim, S. Kim, H. Hwang, and M. Seo, “Selfee: S. Gao, Z. Shi, M. Zhu, B. Fang, X. Xin, P. Ren, Z. Chen,
Iterative self-revising llm empowered by self-feedback J. Ma, and Z. Ren, “Confucius: Iterative tool learning from
generation,” Blog post, May 2023. [Online]. Available: introspection feedback by easy-to-difficult curriculum,”
https://kaistai.github.io/SelFee/ 2023.
P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y. Cao, T. Liu, and C. Wang, W. Luo, Q. Chen, H. Mai, J. Guo, S. Dong, Xiaohua,
Z. Sui, “Making large language models better reasoners Xuan, Z. Li, L. Ma, and S. Gao, “Mllm-tool: A multimodal
with alignment,” 2023. large language model for tool agent learning,” 2024.
D. Cheng, S. Huang, and F. Wei, “Adapting large language W. Shen, C. Li, H. Chen, M. Yan, X. Quan, H. Chen, J. Zhang,
models via reading comprehension,” 2023. and F. Huang, “Small llms are weak tool learners: A multi-
Y. Zhang, Z. Chen, Y. Fang, L. Cheng, Y. Lu, F. Li, W. Zhang, llm agent,” 2024.
and H. Chen, “Knowledgeable preference alignment for B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan,
llms in domain-specific question answering,” 2023. and S. Yao, “Fireact: Toward language agent fine-tuning,”
J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan, A. Chen, 2023.
K. Cho, and E. Perez, “Training language models with A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang,
language feedback at scale,” 2023. “Agenttuning: Enabling generalized agent abilities for
S. Kim, S. Bae, J. Shin, S. Kang, D. Kwak, K. Yoo, llms,” 2023.
and M. Seo, “Aligning large language models through D. Yin, F. Brahman, A. Ravichander, K. Chandu, K.-W.
synthetic feedback,” in Proceedings of the 2023 Conference Chang, Y. Choi, and B. Y. Lin, “Lumos: Learning agents
on Empirical Methods in Natural Language Processing, with unified data, modular design, and open-source
H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: llms,” 2023.
Association for Computational Linguistics, Dec. 2023, pp. S. Qiao, N. Zhang, R. Fang, Y. Luo, W. Zhou, Y. E. Jiang,
13 677–13 700. [Online]. Available: https://aclanthology. C. Lv, and H. Chen, “Autoact: Automatic agent learning
org/2023.emnlp-main.844 from scratch via self-planning,” 2024.
P. Roit, J. Ferret, L. Shani, R. Aharoni, G. Cideron, Y. Kong, J. Ruan, Y. Chen, B. Zhang, T. Bao, S. Shi, G. Du,
R. Dadashi, M. Geist, S. Girgin, L. Hussenot, O. Keller, X. Hu, H. Mao, Z. Li, X. Zeng, and R. Zhao, “Tptu-v2:
N. Momchev, S. Ramos Garea, P. Stanczyk, N. Vieillard, Boosting task planning and tool usage of large language
O. Bachem, G. Elidan, A. Hassidim, O. Pietquin, model-based agents in real-world systems,” 2023.
and I. Szpektor, “Factually consistent summarization F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt
via reinforcement learning with textual entailment outperforms crowd workers for text-annotation tasks,”
feedback,” in Proceedings of the 61st Annual Meeting of Proceedings of the National Academy of Sciences, vol.
the Association for Computational Linguistics (Volume 1: 120, no. 30, Jul. 2023. [Online]. Available: http:
Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, //dx.doi.org/10.1073/pnas.2305016120
Eds. Toronto, Canada: Association for Computational Z. Wang, A. W. Yu, O. Firat, and Y. Cao, “Towards zero-label
Linguistics, Jul. 2023, pp. 6252–6272. [Online]. Available: language learning,” 2021.
https://aclanthology.org/2023.acl-long.344 Y. Xu, R. Xu, D. Iter, Y. Liu, S. Wang, C. Zhu,
Y. Yang, E. Chern, X. Qiu, G. Neubig, and P. Liu, “Alignment and M. Zeng, “InheritSumm: A general, versatile
for honesty,” arXiv preprint arXiv:2312.07000, 2023. and compact summarizer by distilling from GPT,” in
R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M. Dai, Findings of the Association for Computational Linguistics:
D. Yang, and S. Vosoughi, “Training socially aligned lan- EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.
guage models on simulated social interactions,” 2023. Singapore: Association for Computational Linguistics,
T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, Dec. 2023, pp. 13 879–13 892. [Online]. Available: https:
L. Zettlemoyer, N. Cancedda, and T. Scialom, “Tool- //aclanthology.org/2023.findings-emnlp.927
former: Language models can teach themselves to use F. Xu, W. Shi, and E. Choi, “RECOMP: Improving retrieval-
tools,” 2023. augmented LMs with context compression and selective
J. Zhang, “Graph-toolformer: To empower llms with graph augmentation,” in The Twelfth International Conference
reasoning ability via prompt augmented by chatgpt,” on Learning Representations, 2024. [Online]. Available:
33

https://openreview.net/forum?id=mlJLVigNHp Q. Liu, N. Chen, T. Sakai, and X.-M. Wu, “Once: Boost-


S. Ramnath, B. Joshi, S. Hallinan, X. Lu, L. H. Li, A. Chan, ing content-based recommendation with both open- and
J. Hessel, Y. Choi, and X. Ren, “Tailoring self-rationalizers closed-source large language models,” 2023.
with multi-reward distillation,” 2023. S. Kim, J. Shin, Y. Cho, J. Jang, S. Longpre, H. Lee,
S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo,
“Want to reduce labeling cost? GPT-3 can help,” in “Prometheus: Inducing evaluation capability in language
Findings of the Association for Computational Linguistics: models,” in The Twelfth International Conference on
EMNLP 2021, M.-F. Moens, X. Huang, L. Specia, Learning Representations, 2024. [Online]. Available: https:
and S. W.-t. Yih, Eds. Punta Cana, Dominican //openreview.net/forum?id=8euJaTveKw
Republic: Association for Computational Linguistics, W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Wang,
Nov. 2021, pp. 4195–4205. [Online]. Available: https: and L. Li, “INSTRUCTSCORE: Towards explainable
//aclanthology.org/2021.findings-emnlp.354 text generation evaluation with automatic feedback,” in
Z. Guo, P. Wang, Y. Wang, and S. Yu, “Improving small Proceedings of the 2023 Conference on Empirical Methods
language models on pubmedqa via generative data aug- in Natural Language Processing, H. Bouamor, J. Pino, and
mentation,” 2023. K. Bali, Eds. Singapore: Association for Computational
W. Yang and G. Nicolai, “Neural machine translation data Linguistics, Dec. 2023, pp. 5967–5994. [Online]. Available:
generation and augmentation using chatgpt,” 2023. https://aclanthology.org/2023.emnlp-main.365
K. Srinivasan, K. Raman, A. Samanta, L. Liao, L. Bertelli, D. Jiang, Y. Li, G. Zhang, W. Huang, B. Y. Lin, and W. Chen,
and M. Bendersky, “QUILL: Query intent with large “Tigerscore: Towards building explainable metric for all
language models using retrieval augmentation and text generation tasks,” 2023.
multi-stage distillation,” in Proceedings of the 2022 J. Li, S. Sun, W. Yuan, R.-Z. Fan, hai zhao, and P. Liu,
Conference on Empirical Methods in Natural Language “Generative judge for evaluating alignment,” in The
Processing: Industry Track, Y. Li and A. Lazaridou, Twelfth International Conference on Learning Representations,
Eds. Abu Dhabi, UAE: Association for Computational 2024. [Online]. Available: https://openreview.net/forum?
Linguistics, Dec. 2022, pp. 492–501. [Online]. Available: id=gtkFw6sZGS
https://aclanthology.org/2022.emnlp-industry.50 B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E.
Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov,
J. Lu, A. Bakalov, K. Guu, K. B. Hall, and I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori,
M. Chang, “Promptagator: Few-shot dense retrieval W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron,
from 8 examples,” in The Eleventh International L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code
Conference on Learning Representations, ICLR 2023, Kigali, llama: Open foundation models for code,” 2023.
Rwanda, May 1-5, 2023, 2023. [Online]. Available: B. Liu, C. Chen, C. Liao, Z. Gong, H. Wang, Z. Lei, M. Liang,
https://openreview.net/pdf?id=gmL46YMpu2J D. Chen, M. Shen, H. Zhou, H. Yu, and J. Li, “Mftcoder:
R. Meng, Y. Liu, S. Yavuz, D. Agarwal, L. Tu, N. Yu, J. Zhang, Boosting code llms with multitask fine-tuning,” 2023.
M. Bhat, and Y. Zhou, “Augtriever: Unsupervised dense N. Jain, T. Zhang, W. Chiang, J. E. Gonzalez, K. Sen,
retrieval by scalable data augamentation,” 2023. and I. Stoica, “Llm-assisted code cleaning for training
W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and accurate code generators,” CoRR, vol. abs/2311.14904,
Z. Ren, “Is chatgpt good at search? investigating large 2023. [Online]. Available: https://doi.org/10.48550/
language models as re-ranking agents,” 2023. arXiv.2311.14904
R. Pradeep, S. Sharifymoghaddam, and J. Lin, “Rankvicuna: H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
Zero-shot listwise document reranking with open-source tuning,” in NeurIPS, 2023.
large language models,” 2023. B. Zhao, B. Wu, M. He, and T. Huang, “Svit: Scaling up
——, “Rankzephyr: Effective and robust zero-shot listwise visual instruction tuning,” 2023.
reranking is a breeze!” 2023. J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y.-G. Jiang,
F. Ferraretto, T. Laitz, R. Lotufo, and R. Nogueira, “To see is to believe: Prompting gpt-4v for better visual
“Exaranker: Synthetic explanations improve neural instruction tuning,” 2023.
rankers,” in Proceedings of the 46th International ACM SIGIR K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao,
Conference on Research and Development in Information “Shikra: Unleashing multimodal llm’s referential dialogue
Retrieval, ser. SIGIR ’23. New York, NY, USA: Association magic,” 2023.
for Computing Machinery, 2023, p. 2409–2414. [Online]. J. S. Park, J. Hessel, K. R. Chandu, P. P. Liang, X. Lu,
Available: https://doi.org/10.1145/3539618.3592067 P. West, Y. Yu, Q. Huang, J. Gao, A. Farhadi, and Y. Choi,
S. Mysore, A. Mccallum, and H. Zamani, “Large language “Localized symbolic knowledge distillation for visual
model augmented narrative driven recommendations,” commonsense models,” 2023.
in Proceedings of the 17th ACM Conference on Recommender R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao,
Systems, ser. RecSys ’23. New York, NY, USA: Association J. Han, H. Xu, L. Kong, and T. Zhang, “DetGPT:
for Computing Machinery, 2023, p. 777–783. [Online]. Detect what you need via reasoning,” in Proceedings
Available: https://doi.org/10.1145/3604915.3608829 of the 2023 Conference on Empirical Methods in Natural
J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and J.-R. Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds.
Wen, “Recommendation as instruction following: A large Singapore: Association for Computational Linguistics,
language model empowered recommendation approach,” Dec. 2023, pp. 14 172–14 189. [Online]. Available: https:
2023. //aclanthology.org/2023.emnlp-main.876
34

L. Zhao, E. Yu, Z. Ge, J. Yang, H. Wei, H. Zhou, J. Sun, model-based chatbot system for intelligent education,”
Y. Peng, R. Dong, C. Han, and X. Zhang, “Chatspot: CoRR, vol. abs/2308.02773, 2023. [Online]. Available:
Bootstrapping multimodal llms via precise referring in- https://doi.org/10.48550/arXiv.2308.02773
struction tuning,” 2023. H. Guo, J. Yang, J. Liu, L. Yang, L. Chai, J. Bai, J. Peng, X. Hu,
F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, C. Chen, D. Zhang, X. Shi, T. Zheng, L. Zheng, B. Zhang,
“Mitigating hallucination in large multi-modal models via K. Xu, and Z. Li, “OWL: A large language model for IT
robust instruction tuning,” 2023. operations,” CoRR, vol. abs/2309.09298, 2023. [Online].
S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any- Available: https://doi.org/10.48550/arXiv.2309.09298
to-any multimodal llm,” 2023. Y. Kim and A. M. Rush, “Sequence-level knowledge distil-
R. Luo, Z. Zhao, M. Yang, J. Dong, D. Li, P. Lu, T. Wang, lation,” arXiv preprint arXiv:1606.07947, 2016.
L. Hu, M. Qiu, and Z. Wei, “Valley: Video assistant with S. Han, H. Mao, and W. J. Dally, “Deep compression:
large language model enhanced ability,” 2023. Compressing deep neural networks with pruning, trained
Y. Jiang, E. Schoop, A. Swearngin, and J. Nichols, “Iluvui: quantization and huffman coding,” International Confer-
Instruction-tuned language-vision modeling of uis from ence on Learning Representations (ICLR), 2016.
machine conversations,” 2023. V. Gangal, S. Y. Feng, M. Alikhani, T. Mitamura, and
Y. Li, C. Zhang, G. Yu, Z. Wang, B. Fu, G. Lin, C. Shen, E. Hovy, “Nareor: The narrative reordering problem,” in
L. Chen, and Y. Wei, “Stablellava: Enhanced visual in- Proceedings of the AAAI Conference on Artificial Intelligence,
struction tuning with synthesized image-dialogue data,” vol. 36, no. 10, 2022, pp. 10 645–10 653.
2023. S. Longpre, Y. Lu, Z. Tu, and C. DuBois, “An exploration of
R. Xu, X. Wang, T. Wang, Y. Chen, J. Pang, and D. Lin, data augmentation and sampling techniques for domain-
“Pointllm: Empowering large language models to under- agnostic question answering,” in Proceedings of the 2nd
stand point clouds,” 2023. Workshop on Machine Reading for Question Answering,
Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen,
Z. Wu, and Y. Feng, “Lawyer llama technical report,” Eds. Hong Kong, China: Association for Computational
arXiv preprint arXiv:2305.15062, 2023. Linguistics, Nov. 2019, pp. 220–227. [Online]. Available:
J. Cui, Z. Li, Y. Yan, B. Chen, and L. Yuan, “Chatlaw: Open- https://aclanthology.org/D19-5829
source legal large language model with integrated ex- P. West, C. Bhagavatula, J. Hessel, J. Hwang, L. Jiang,
ternal knowledge bases,” arXiv preprint arXiv:2306.16092, R. Le Bras, X. Lu, S. Welleck, and Y. Choi, “Symbolic
2023. knowledge distillation: from general language models
H. Zhang, J. Chen, F. Jiang, F. Yu, Z. Chen, G. Chen, to commonsense models,” in Proceedings of the 2022
J. Li, X. Wu, Z. Zhiyi, Q. Xiao, X. Wan, B. Wang, Conference of the North American Chapter of the Association
and H. Li, “HuatuoGPT, towards taming language for Computational Linguistics: Human Language Technologies,
model to be a doctor,” in Findings of the Association M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds.
for Computational Linguistics: EMNLP 2023, H. Bouamor, Seattle, United States: Association for Computational
J. Pino, and K. Bali, Eds. Singapore: Association Linguistics, Jul. 2022, pp. 4602–4625. [Online]. Available:
for Computational Linguistics, Dec. 2023, pp. 10 859– https://aclanthology.org/2022.naacl-main.341
10 885. [Online]. Available: https://aclanthology.org/ Z. Li, X. Xu, T. Shen, C. Xu, J.-C. Gu, and C. Tao, “Leveraging
2023.findings-emnlp.725 large language models for nlg evaluation: A survey,” 2024.
J. Chen, X. Wang, A. Gao, F. Jiang, S. Chen, H. Zhang, S. Li, J. Chen, Y. Shen, Z. Chen, X. Zhang, Z. Li, H. Wang,
D. Song, W. Xie, C. Kong, J. Li, X. Wan, H. Li, and B. Wang, J. Qian, B. Peng, Y. Mao, W. Chen, and X. Yan, “Explana-
“Huatuogpt-ii, one-stage training for medical adaption tions from large language models make small reasoners
of llms,” CoRR, vol. abs/2311.09774, 2023. [Online]. better,” 2022.
Available: https://doi.org/10.48550/arXiv.2311.09774 N. Ho, L. Schmid, and S. Yun, “Large language models
X. Zhang and Q. Yang, “Xuanyuan 2.0: A large are reasoning teachers,” in ACL (1). Association for
chinese financial chat model with hundreds of billions Computational Linguistics, 2023, pp. 14 852–14 882.
parameters,” in Proceedings of the 32nd ACM International L. C. Magister, J. Mallinson, J. Adamek, E. Malmi,
Conference on Information and Knowledge Management, and A. Severyn, “Teaching small language models to
CIKM 2023, Birmingham, United Kingdom, October 21- reason,” in Proceedings of the 61st Annual Meeting of the
25, 2023, I. Frommholz, F. Hopfgartner, M. Lee, Association for Computational Linguistics (Volume 2: Short
M. Oakes, M. Lalmas, M. Zhang, and R. L. T. Santos, Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki,
Eds. ACM, 2023, pp. 4435–4439. [Online]. Available: Eds. Toronto, Canada: Association for Computational
https://doi.org/10.1145/3583780.3615285 Linguistics, Jul. 2023, pp. 1773–1781. [Online]. Available:
T. Xie, Y. Wan, W. Huang, Z. Yin, Y. Liu, S. Wang, https://aclanthology.org/2023.acl-short.151
Q. Linghu, C. Kit, C. Grazian, W. Zhang, I. Razzak, Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, “Specializ-
and B. Hoex, “DARWIN series: Domain specific ing smaller language models towards multi-step reason-
large language models for natural science,” CoRR, ing,” 2023.
vol. abs/2308.13565, 2023. [Online]. Available: https: L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi,
//doi.org/10.48550/arXiv.2308.13565 “Symbolic chain-of-thought distillation: Small models can
Y. Dan, Z. Lei, Y. Gu, Y. Li, J. Yin, J. Lin, L. Ye, Z. Tie, also “think” step-by-step,” in Proceedings of the 61st Annual
Y. Zhou, Y. Wang, A. Zhou, Z. Zhou, Q. Chen, J. Zhou, Meeting of the Association for Computational Linguistics
L. He, and X. Qiu, “Educhat: A large-scale language (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber,
35

and N. Okazaki, Eds. Toronto, Canada: Association G. Guo, R. Zhao, T. Tang, X. Zhao, and J.-R. Wen, “Beyond
for Computational Linguistics, Jul. 2023, pp. 2665–2679. imitation: Leveraging fine-grained quality signals for
[Online]. Available: https://aclanthology.org/2023.acl- alignment,” in The Twelfth International Conference
long.150 on Learning Representations, 2024. [Online]. Available:
W. Liu, G. Li, K. Zhang, B. Du, Q. Chen, X. Hu, H. Xu, https://openreview.net/forum?id=LNLjU5C5dK
J. Chen, and J. Wu, “Mind’s mirror: Distilling self- Z. Allen-Zhu and Y. Li, “Towards understanding ensemble,
evaluation capability and comprehensive thinking from knowledge distillation and self-distillation in deep learn-
large language models,” 2023. ing,” arXiv preprint arXiv:2012.09816, 2020.
S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, T. Zheng, S. Guo, X. Qu, J. Guo, W. Zhang, X. Du, C. Lin,
D. Zhou, Q. V. Le, B. Zoph, J. Wei et al., “The flan collec- W. Huang, W. Chen, J. Fu et al., “Kun: Answer polish-
tion: Designing data and methods for effective instruction ment for chinese self-alignment with instruction back-
tuning,” arXiv preprint arXiv:2301.13688, 2023. translation,” arXiv preprint arXiv:2401.06477, 2024.
Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and X. Li, P. Yu, C. Zhou, T. Schick, O. Levy, L. Zettlemoyer, J. E.
A. Mulyar, “Gpt4all: Training an assistant-style chatbot Weston, and M. Lewis, “Self-alignment with instruction
with large scale data distillation from gpt-3.5-turbo,” backtranslation,” in The Twelfth International Conference
GitHub, 2023. on Learning Representations, 2024. [Online]. Available:
Q. Si, T. Wang, Z. Lin, X. Zhang, Y. Cao, and W. Wang, https://openreview.net/forum?id=1oijHJBRsT
“An empirical study of instruction-tuning large language B. Zhao, H. Hajishirzi, and Q. Cao, “Apt: Adaptive pruning
models in chinese,” in EMNLP (Findings). Association and tuning pretrained language models for efficient train-
for Computational Linguistics, 2023, pp. 4086–4107. ing and inference,” arXiv preprint arXiv:2401.12200, 2024.
Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, L. Zhang, B. Ma, and A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, P. J.
X. Li, “Exploring the impact of instruction data scaling on Liu, J. Harrison, J. Lee, K. Xu, A. Parisi et al., “Beyond hu-
large language models: An empirical study on real-world man data: Scaling self-training for problem-solving with
use cases,” 2023. language models,” arXiv preprint arXiv:2312.06585, 2023.
M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. W. Chen, D. Song, and B. Li, “Grath: Gradual self-truthifying
Aji, “Lamini-lm: A diverse herd of distilled models from for large language models,” 2024.
large-scale instructions,” 2023. A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni,
W. Guo, J. Yang, K. Yang, X. Li, Z. Rao, Y. Xu, and D. Niu, and R. Agarwal, “V-star: Training verifiers for self-taught
“Instruction fusion: Advancing prompt evolution through reasoners,” 2024.
hybridization,” 2023. A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,
Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. Ratner, R. Krishna, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma,
J. Shen, and C. Zhang, “Large language model as at- N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion,
tributed training data generator: A tale of diversity and K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark,
bias,” 2023. S. McCandlish, C. Olah, and J. Kaplan, “A general lan-
F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi, guage assistant as a laboratory for alignment,” 2021.
“Knowledge fusion of large language models,” in The J. Huang, S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
Twelfth International Conference on Learning Representations, J. Han, “Large language models can self-improve,” in
2024. [Online]. Available: https://openreview.net/forum? Proceedings of the 2023 Conference on Empirical Methods
id=jiDsk12qcz in Natural Language Processing, H. Bouamor, J. Pino, and
Q. Zhao and B. Zhu, “Towards the fundamental K. Bali, Eds. Singapore: Association for Computational
limits of knowledge transfer over finite domains,” Linguistics, Dec. 2023, pp. 1051–1068. [Online]. Available:
in NeurIPS 2023 Workshop on Mathematics of Modern https://aclanthology.org/2023.emnlp-main.67
Machine Learning, 2023. [Online]. Available: https: H. Chen, X. Quan, H. Chen, M. Yan, and J. Zhang, “Knowl-
//openreview.net/forum?id=9qxoXqxa0N edge distillation for closed-source language models,”
C. Qin, W. Xia, F. Jiao, and S. Joty, “Improving in-context arXiv preprint arXiv:2401.07013, 2024.
learning via bidirectional alignment,” 2023. I. Sason and S. Verdú, “f -divergence inequalities,” IEEE
N. Boizard, K. El-Haddad, C. Hudelot, and P. Colombo, Transactions on Information Theory, vol. 62, no. 11, pp. 5973–
“Towards cross-tokenizer distillation: the universal logit 6006, 2016.
distillation loss for llms,” arXiv preprint arXiv:2402.12030, S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge
2024. distillation for bert model compression,” 2019.
Q. Zhong, L. Ding, L. Shen, J. Liu, B. Du, and D. Tao, “Revis- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and
iting knowledge distillation for autoregressive language D. Zhou, “MobileBERT: a compact task-agnostic BERT
models,” 2024. for resource-limited devices,” in Proceedings of the 58th
M. Kim, S. Lee, J. Lee, S. Hong, D.-S. Chang, W. Sung, Annual Meeting of the Association for Computational
and J. Choi, “Token-scaled logit distillation for ternary Linguistics, D. Jurafsky, J. Chai, N. Schluter, and
weight generative language models,” arXiv preprint J. Tetreault, Eds. Online: Association for Computational
arXiv:2308.06744, 2023. Linguistics, Jul. 2020, pp. 2158–2170. [Online]. Available:
Z. Chen, K. Zhou, W. X. Zhao, J. Wan, F. Zhang, D. Zhang, https://aclanthology.org/2020.acl-main.195
and J.-R. Wen, “Improving large language models via fine- X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang,
grained reinforcement learning with minimum editing and Q. Liu, “TinyBERT: Distilling BERT for natural
constraint,” 2024. language understanding,” in Findings of the Association for
36

Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, for diverse people? tuning llms via debate to generate
and Y. Liu, Eds. Online: Association for Computational controllable controversial statements,” 2024.
Linguistics, Nov. 2020, pp. 4163–4174. [Online]. Available: M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang,
https://aclanthology.org/2020.findings-emnlp.372 “Knowledge-augmented reasoning distillation for small
L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and language models in knowledge-intensive tasks,” 2023.
Q. Liu, “Dynabert: Dynamic bert with adaptive width and R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan,
depth,” Advances in Neural Information Processing Systems, “Gpt4tools: Teaching large language model to use tools
vol. 33, pp. 9782–9793, 2020. via self-instruction,” 2023.
S. Zuo, Q. Zhang, C. Liang, P. He, T. Zhao, and W. Chen, A. Yehudai, B. Carmeli, Y. Mass, O. Arviv, N. Mills,
“Moebert: from bert to mixture-of-experts via importance- A. Toledo, E. Shnarch, and L. Choshen, “Genie: Achieving
guided adaptation,” arXiv preprint arXiv:2204.07675, 2022. human parity in content-grounded datasets generation,”
K. J. Liang, W. Hao, D. Shen, Y. Zhou, W. Chen, C. Chen, and 2024.
L. Carin, “Mixkd: Towards efficient distillation of large- Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang, and
scale language models,” in 9th International Conference on T. Sun, “Llavar: Enhanced visual instruction tuning for
Learning Representations, ICLR 2021, Virtual Event, Austria, text-rich image understanding,” 2023.
May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and
https://openreview.net/forum?id=UFGEelJkLu5 Z. Tu, “Macaw-llm: Multi-modal language modeling with
Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- image, audio, video, and text integration,” arXiv preprint
yaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: arXiv:2306.09093, 2023.
Human-level reward design via coding large language B. Li, Y. Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li,
models,” 2023. and Z. Liu, “Mimic-it: Multi-modal in-context instruction
J.-C. Pang, P. Wang, K. Li, X.-H. Chen, J. Xu, Z. Zhang, and tuning,” 2023.
Y. Yu, “Language model self-improvement by reinforce- Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan,
ment learning contemplation,” 2023. and J. Liu, “Chatbridge: Bridging modalities with large
Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, language model as a language catalyst,” 2023.
A. Gupta, and J. Andreas, “Guiding pretraining in Y. Zhao, B. Yu, B. Hui, H. Yu, F. Huang, Y. Li, and N. L.
reinforcement learning with large language models,” Zhang, “A preliminary study of the intrinsic relationship
in Proceedings of the 40th International Conference on between complexity and alignment,” 2023.
Machine Learning, ser. Proceedings of Machine Learning A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu,
Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, P. Abbeel, S. Levine, and D. Song, “The false
S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, promise of imitating proprietary llms,” arXiv preprint
23–29 Jul 2023, pp. 8657–8677. [Online]. Available: arXiv:2305.15717, 2023.
https://proceedings.mlr.press/v202/du23f.html C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma,
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and A. Efrat, P. Yu, L. YU, S. Zhang, G. Ghosh, M. Lewis,
O. Klimov, “Proximal policy optimization algorithms,” L. Zettlemoyer, and O. Levy, “LIMA: Less is more
2017. for alignment,” in Thirty-seventh Conference on Neural
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Man- Information Processing Systems, 2023. [Online]. Available:
ning, and C. Finn, “Direct preference optimization: Your https://openreview.net/forum?id=KBMOKmX2he
language model is secretly a reward model,” 2023. M. Li, Y. Zhang, S. He, Z. Li, H. Zhao, J. Wang, N. Cheng,
F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang, and T. Zhou, “Superfiltering: Weak-to-strong data filtering
“Preference ranking optimization for human alignment,” for fast instruction-tuning,” 2024. [Online]. Available:
arXiv preprint arXiv:2306.17492, 2023. https://api.semanticscholar.org/CorpusID:267365346
Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and
F. Huang, “Rrhf: Rank responses to align language mod- Z. Mao, “Expertprompting: Instructing large language
els with human feedback without tears,” arXiv preprint models to be distinguished experts,” 2023.
arXiv:2304.05302, 2023. W. Liu, W. Zeng, K. He, Y. Jiang, and J. He, “What makes
M. Li, L. Chen, J. Chen, S. He, and T. Zhou, good data for alignment? a comprehensive study of auto-
“Reflection-tuning: Recycling data for better instruction- matic data selection in instruction tuning,” 2023.
tuning,” in NeurIPS 2023 Workshop on Instruction Tuning R. Lou, K. Zhang, J. Xie, Y. Sun, J. Ahn, H. Xu, Y. Su, and
and Instruction Following, 2023. [Online]. Available: W. Yin, “Muffin: Curating multi-faceted instructions for
https://openreview.net/forum?id=xaqoZZqkPU improving instruction-following,” 2023.
M. Li, L. Chen, J. Chen, S. He, J. Gu, and T. Zhou, “Selective T. Schick, J. Dwivedi-Yu, Z. Jiang, F. Petroni, P. Lewis,
reflection-tuning: Student-selected data recycling for G. Izacard, Q. You, C. Nalmpantis, E. Grave, and S. Riedel,
llm instruction-tuning,” 2024. [Online]. Available: https: “Peer: A collaborative language model,” 2022.
//api.semanticscholar.org/CorpusID:267682220 A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao,
X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang,
S. Levine, and D. Song, “Koala: A dialogue model S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yaz-
for academic research,” Blog post, April 2023. [Online]. danbakhsh, and P. Clark, “Self-refine: Iterative refinement
Available: https://bair.berkeley.edu/blog/2023/04/03/ with self-feedback,” 2023.
koala/ W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward,
M. Li, J. Chen, L. Chen, and T. Zhou, “Can llms speak and J. Leike, “Self-critiquing models for assisting human
37

evaluators,” 2022. findings-naacl.18


D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu,
D. Amodei, P. Christiano, and G. Irving, “Fine-tuning T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick,
language models from human preferences,” arXiv preprint P. Thacker et al., “Improving alignment of dialogue
arXiv:1909.08593, 2019. agents via targeted human judgements,” arXiv preprint
N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, arXiv:2209.14375, 2022.
A. Radford, D. Amodei, and P. F. Christiano, “Learning H. Sun, Z. Zhang, F. Mi, Y. Wang, W. Liu, J. Cui, B. Wang,
to summarize with human feedback,” Advances in Neu- Q. Liu, and M. Huang, “MoralDial: A framework to
ral Information Processing Systems, vol. 33, pp. 3008–3021, train and evaluate moral dialogue systems via moral
2020. discussions,” in Proceedings of the 61st Annual Meeting
J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, of the Association for Computational Linguistics (Volume 1:
J. Leike, and P. Christiano, “Recursively summarizing Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki,
books with human feedback,” 2021. Eds. Toronto, Canada: Association for Computational
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. Das- Linguistics, Jul. 2023, pp. 2213–2230. [Online]. Available:
Sarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., https://aclanthology.org/2023.acl-long.123
“Training a helpful and harmless assistant with rein- J. Yao, X. Yi, X. Wang, J. Wang, and X. Xie, “From instructions
forcement learning from human feedback,” arXiv preprint to intrinsic human values – a survey of alignment goals
arXiv:2204.05862, 2022. for big models,” 2023.
A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng,
K. Stevens, A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms:
S. ES, S. Suri, D. Glushkov, A. Dantuluri, A. Maguire, a survey and guideline for evaluating large language
C. Schuhmann, H. Nguyen, and A. Mattick, “Openassis- models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
tant conversations – democratizing large language model J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limitations of
alignment,” 2023. language models in arithmetic and symbolic induction,”
G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, and Y. Liu, 2022.
“Openchat: Advancing open-source language models X. She, Y. Liu, Y. Zhao, Y. He, L. Li, C. Tantithamthavorn,
with mixed-quality data,” 2023. Z. Qin, and H. Wang, “Pitfalls in language models for
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.- code intelligence: A taxonomy and survey,” 2023.
S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, H. Manikandan, Y. Jiang, and J. Z. Kolter, “Language models
Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, are weak learners,” 2023.
A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou,
S. Legassick, G. Irving, and I. Gabriel, “Ethical and social S. Lu, L. Ji, S. Mao, Y. Wang, L. Shou, M. Gong, and
risks of harm from language models,” 2021. N. Duan, “Taskmatrix.ai: Completing tasks by connecting
J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, C. Zhang, foundation models with millions of apis,” 2023.
R. Sun, Y. Wang, and Y. Yang, “Beavertails: Towards G. Mialon, R. Dessı̀, M. Lomeli, C. Nalmpantis, R. Pa-
improved safety alignment of llm via a human-preference sunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-
dataset,” 2023. Yu, A. Celikyilmaz, E. Grave, Y. LeCun, and T. Scialom,
I. Solaiman and C. Dennison, “Process for adapting lan- “Augmented language models: a survey,” 2023.
guage models to society (palms) with values-targeted A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented
datasets,” Advances in Neural Information Processing Sys- language models,” 2022.
tems, vol. 34, pp. 5861–5873, 2021. R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim,
L. Qiu, Y. Zhao, J. Li, P. Lu, B. Peng, J. Gao, and S.-C. C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang,
Zhu, “Valuenet: A new dataset for human value driven K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight,
dialogue system,” in Proceedings of the AAAI Conference B. Chess, and J. Schulman, “Webgpt: Browser-assisted
on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 183– question-answering with human feedback,” 2022.
11 191. Y. Qin, Z. Cai, D. Jin, L. Yan, S. Liang, K. Zhu,
J. Kiesel, M. Alshomary, N. Handke, X. Cai, H. Wachsmuth, Y. Lin, X. Han, N. Ding, H. Wang, R. Xie, F. Qi,
and B. Stein, “Identifying the human values behind Z. Liu, M. Sun, and J. Zhou, “WebCPM: Interactive
arguments,” in Proceedings of the 60th Annual Meeting web search for Chinese long-form question answering,”
of the Association for Computational Linguistics (Volume 1: in Proceedings of the 61st Annual Meeting of the
Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Association for Computational Linguistics (Volume 1: Long
Eds. Dublin, Ireland: Association for Computational Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki,
Linguistics, May 2022, pp. 4459–4471. [Online]. Available: Eds. Toronto, Canada: Association for Computational
https://aclanthology.org/2022.acl-long.306 Linguistics, Jul. 2023, pp. 8968–8988. [Online]. Available:
R. Liu, G. Zhang, X. Feng, and S. Vosoughi, “Aligning https://aclanthology.org/2023.acl-long.499
generative language models with human values,” in Y. Song, W. Xiong, D. Zhu, W. Wu, H. Qian, M. Song,
Findings of the Association for Computational Linguistics: H. Huang, C. Li, K. Wang, R. Yao, Y. Tian, and S. Li,
NAACL 2022, M. Carpuat, M.-C. de Marneffe, and I. V. “Restgpt: Connecting large language models with real-
Meza Ruiz, Eds. Seattle, United States: Association world restful apis,” 2023.
for Computational Linguistics, Jul. 2022, pp. 241– T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, “Large
252. [Online]. Available: https://aclanthology.org/2022. language models as tool makers,” 2023.
38

Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, S. Kambhampati, “On the planning abilities of
“Hugginggpt: Solving ai tasks with chatgpt and its friends large language models - a critical investigation,”
in hugging face,” 2023. in Thirty-seventh Conference on Neural Informa-
S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Augment- tion Processing Systems, 2023. [Online]. Available:
ing frozen language models with massive tools via tool https://openreview.net/forum?id=X6dEqXIsEW
embeddings,” 2024. T. Sumers, K. Marino, A. Ahuja, R. Fergus, and I. Dasgupta,
S. Yuan, K. Song, J. Chen, X. Tan, Y. Shen, R. Kan, D. Li, “Distilling internet-scale vision-language models into em-
and D. Yang, “Easytool: Enhancing llm-based agents with bodied agents,” in Proceedings of the 40th International
concise tool instruction,” 2024. Conference on Machine Learning, ser. ICML’23. JMLR.org,
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, 2023.
C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, Y. Yang, T. Zhou, K. Li, D. Tao, L. Li, L. Shen, X. He, J. Jiang,
S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, and Y. Shi, “Embodied multi-modal agent trained by an
T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained llm from a parallel textworld,” 2023.
transformer language models,” 2022. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, is all you need,” Advances in neural information processing
A. Askell et al., “Language models are few-shot learners,” systems, vol. 30, 2017.
Advances in neural information processing systems, vol. 33, Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
pp. 1877–1901, 2020. M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Lan- robustly optimized bert pretraining approach,” 2019.
guage models as zero-shot planners: Extracting actionable J. Li, L. Gui, Y. Zhou, D. West, C. Aloisi, and Y. He, “Dis-
knowledge for embodied agents,” in International Confer- tilling chatgpt for explainable automated student answer
ence on Machine Learning. PMLR, 2022, pp. 9118–9147. assessment,” in EMNLP (Findings). Association for Com-
I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Trem- putational Linguistics, 2023, pp. 6007–6026.
blay, D. Fox, J. Thomason, and A. Garg, “Progprompt: R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic
Generating situated robot task plans using large language data generation of llms help clinical text mining?” arXiv
models,” 2022. preprint arXiv:2303.04360, 2023.
D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, X. He, I. Nassar, J. Kiros, G. Haffari, and M. Norouzi,
D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, “Generate, annotate, and learn: NLP with synthetic text,”
“Least-to-most prompting enables complex reasoning in Trans. Assoc. Comput. Linguistics, vol. 10, pp. 826–842,
large language models,” 2023. 2022. [Online]. Available: https://transacl.org/ojs/index.
C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, php/tacl/article/view/3811
and Y. Su, “Llm-planner: Few-shot grounded planning for Y. Meng, J. Huang, Y. Zhang, and J. Han,
embodied agents with large language models,” in Proceed- “Generating training data with language models:
ings of the IEEE/CVF International Conference on Computer Towards zero-shot language understanding,” in
Vision, 2023, pp. 2998–3009. Advances in Neural Information Processing Systems 35:
Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, Annual Conference on Neural Information Processing
explain, plan and select: Interactive planning with large Systems 2022, NeurIPS 2022, New Orleans, LA,
language models enables open-world multi-task agents,” USA, November 28 - December 9, 2022, 2022.
arXiv preprint arXiv:2302.01560, 2023. [Online]. Available: http://papers.nips.cc/paper files/
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, paper/2022/hash/0346c148ba1c21c6b4780a961ea141dc-
and K. Narasimhan, “Tree of thoughts: Deliberate prob- Abstract-Conference.html
lem solving with large language models,” arXiv preprint J. Wang, Z. Yao, A. Mitra, S. Osebe, Z. Yang, and H. Yu,
arXiv:2305.10601, 2023. “UMASS BioNLP at MEDIQA-chat 2023: Can LLMs
B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, generate high-quality synthetic note-oriented doctor-
and P. Stone, “Llm+ p: Empowering large language mod- patient conversations?” in Proceedings of the 5th Clinical
els with optimal planning proficiency,” arXiv preprint Natural Language Processing Workshop, T. Naumann,
arXiv:2304.11477, 2023. A. Ben Abacha, S. Bethard, K. Roberts, and A. Rumshisky,
S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Eds. Toronto, Canada: Association for Computational
Z. Hu, “Reasoning with language model is planning with Linguistics, Jul. 2023, pp. 460–471. [Online]. Available:
world model,” arXiv preprint arXiv:2305.14992, 2023. https://aclanthology.org/2023.clinicalnlp-1.49
M. Hu, Y. Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, Z. Yang, S. Cherian, and S. Vucetic, “Data augmentation
B. Wang, Y. Qiao, and P. Luo, “Tree-planner: Efficient for radiology report simplification,” in Findings of the
close-loop task planning with large language models,” Association for Computational Linguistics: EACL 2023,
arXiv preprint arXiv:2310.08582, 2023. A. Vlachos and I. Augenstein, Eds. Dubrovnik,
B. Y. Lin, C. Huang, Q. Liu, W. Gu, S. Sommerer, and Croatia: Association for Computational Linguistics,
X. Ren, “On grounded planning for embodied tasks with May 2023, pp. 1922–1932. [Online]. Available: https:
language models,” in Proceedings of the AAAI Conference //aclanthology.org/2023.findings-eacl.144
on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 192– Z. Cai, C. Tao, T. Shen, C. Xu, X. Geng, X. A. Lin, L. He, and
13 200. D. Jiang, “Hyper: Multitask hyper-prompted training en-
K. Valmeekam, M. Marquez, S. Sreedharan, and ables large-scale retrieval generalization,” in The Eleventh
39

International Conference on Learning Representations, 2022. language models as efficient dataset generators for infor-
C. Liu, C. Tao, X. Geng, T. Shen, D. Zhao, C. Xu, B. Jiao, mation retrieval,” arXiv preprint arXiv:2301.01820, 2023.
and D. Jiang, “Adam: Dense retrieval distillation with W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren, Z. Chen,
adaptive dark examples,” arXiv preprint arXiv:2212.10192, D. Yin, and Z. Ren, “Instruction distillation makes large
2022. language models efficient zero-shot rankers,” 2023.
J. Feng, C. Tao, X. Geng, T. Shen, C. Xu, G. Long, D. Zhao, C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
and D. Jiang, “Knowledge refinement via interaction be- M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
tween search engines and large language models,” arXiv the limits of transfer learning with a unified text-to-text
preprint arXiv:2305.07402, 2023. transformer,” J. Mach. Learn. Res., vol. 21, no. 1, jan 2020.
T. Shen, G. Long, X. Geng, C. Tao, T. Zhou, and D. Jiang, S. Bruch, X. Wang, M. Bendersky, and M. Najork, “An
“Large language models are strong zero-shot retriever,” analysis of the softmax cross entropy loss for learning-
arXiv preprint arXiv:2304.14233, 2023. to-rank with binary relevance,” in Proceedings of the
X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot listwise 2019 ACM SIGIR International Conference on Theory of
document reranking with a large language model,” 2023. Information Retrieval, ICTIR 2019, Santa Clara, CA, USA,
Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, J. Shen, October 2-5, 2019, 2019, pp. 75–78. [Online]. Available:
T. Liu, J. Liu, D. Metzler, X. Wang, and M. Bendersky, https://doi.org/10.1145/3341981.3344221
“Large language models are effective text rankers with C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,
pairwise ranking prompting,” 2023. N. Hamilton, and G. Hullender, “Learning to rank
X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, “Query using gradient descent,” in Proceedings of the 22nd
rewriting in retrieval-augmented large language models,” International Conference on Machine Learning, ser. ICML
in Proceedings of the 2023 Conference on Empirical Methods ’05. New York, NY, USA: Association for Computing
in Natural Language Processing, H. Bouamor, J. Pino, and Machinery, 2005, p. 89–96. [Online]. Available: https:
K. Bali, Eds. Singapore: Association for Computational //doi.org/10.1145/1102351.1102363
Linguistics, Dec. 2023, pp. 5303–5315. [Online]. Available: X. Wang, C. Li, N. Golbandi, M. Bendersky, and M. Najork,
https://aclanthology.org/2023.emnlp-main.322 “The lambdaloss framework for ranking metric
D. Sachan, M. Lewis, M. Joshi, A. Aghajanyan, W.- optimization,” in Proceedings of the 27th ACM International
t. Yih, J. Pineau, and L. Zettlemoyer, “Improving Conference on Information and Knowledge Management,
passage retrieval with zero-shot question generation,” ser. CIKM ’18. New York, NY, USA: Association for
in Proceedings of the 2022 Conference on Empirical Computing Machinery, 2018, p. 1313–1322. [Online].
Methods in Natural Language Processing, Y. Goldberg, Available: https://doi.org/10.1145/3269206.3271784
Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi, W. Wang, X. Lin, F. Feng, X. He, and T.-S. Chua, “Generative
United Arab Emirates: Association for Computational recommendation: Towards next-generation recommender
Linguistics, Dec. 2022, pp. 3781–3797. [Online]. Available: paradigm,” 2023.
https://aclanthology.org/2022.emnlp-main.249 S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun,
D. S. Sachan, M. Lewis, D. Yogatama, L. Zettlemoyer, X. Zhang, and J. Xu, “Uncovering chatgpt’s capabilities
J. Pineau, and M. Zaheer, “Questions are all you need in recommender systems,” in Proceedings of the 17th
to train a dense passage retriever,” Transactions of the ACM Conference on Recommender Systems, ser. RecSys
Association for Computational Linguistics, vol. 11, pp. ’23. New York, NY, USA: Association for Computing
600–616, 2023. [Online]. Available: https://aclanthology. Machinery, 2023, p. 1126–1132. [Online]. Available:
org/2023.tacl-1.35 https://doi.org/10.1145/3604915.3610646
T. Schick and H. Schütze, “Generating datasets with Y. Xi, W. Liu, J. Lin, X. Cai, H. Zhu, J. Zhu, B. Chen,
pretrained language models,” in Proceedings of the 2021 R. Tang, W. Zhang, R. Zhang, and Y. Yu, “Towards open-
Conference on Empirical Methods in Natural Language world recommendation with knowledge augmentation
Processing, M.-F. Moens, X. Huang, L. Specia, and from large language models,” 2023.
S. W.-t. Yih, Eds. Online and Punta Cana, Dominican X. Ren, W. Wei, L. Xia, L. Su, S. Cheng, J. Wang, D. Yin, and
Republic: Association for Computational Linguistics, C. Huang, “Representation learning with large language
Nov. 2021, pp. 6943–6951. [Online]. Available: https: models for recommendation,” 2023.
//aclanthology.org/2021.emnlp-main.555 W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng, J. Wang,
Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning for D. Yin, and C. Huang, “Llmrec: Large language models
augmenting dense retrieval with large language models,” with graph augmentation for recommendation,” 2024.
arXiv preprint arXiv:2307.08303, 2023. L. Wang, S. Zhang, Y. Wang, E.-P. Lim, and Y. Wang,
J. Saad-Falcon, O. Khattab, K. Santhanam, R. Florian, “LLM4Vis: Explainable visualization recommendation
M. Franz, S. Roukos, A. Sil, M. A. Sultan, and C. Potts, using ChatGPT,” in Proceedings of the 2023 Conference
“UDAPDR: unsupervised domain adaptation via LLM on Empirical Methods in Natural Language Processing:
prompting and distillation of rerankers,” in Proceedings Industry Track, M. Wang and I. Zitouni, Eds. Singapore:
of the 2023 Conference on Empirical Methods in Natural Association for Computational Linguistics, Dec. 2023, pp.
Language Processing, EMNLP 2023, Singapore, December 675–692. [Online]. Available: https://aclanthology.org/
6-10, 2023, 2023, pp. 11 265–11 279. [Online]. Available: 2023.emnlp-industry.64
https://aclanthology.org/2023.emnlp-main.693 Z. Cui, J. Ma, C. Zhou, J. Zhou, and H. Yang, “M6-rec:
V. Jeronymo, L. Bonifacio, H. Abonizio, M. Fadaee, Generative pretrained language models are open-ended
R. Lotufo, J. Zavrel, and R. Nogueira, “Inpars-v2: Large recommender systems,” 2022.
40

P. Liu, L. Zhang, and J. A. Gulla, “Pre-train, prompt and 2023.


recommendation: A comprehensive survey of language Y. Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y. Zhang,
modelling paradigm adaptations in recommender sys- “Chatdoctor: A medical chat model fine-tuned on a large
tems,” 2023. language model meta-ai (llama) using medical domain
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a knowledge,” Cureus, vol. 15, no. 6, 2023.
method for automatic evaluation of machine translation,” T. Han, L. C. Adams, J. Papaioannou, P. Grundmann,
in Proceedings of the 40th Annual Meeting on Association for T. Oberhauser, A. Löser, D. Truhn, and K. K.
Computational Linguistics, ser. ACL ’02. USA: Association Bressem, “Medalpaca - an open-source collection of
for Computational Linguistics, 2002, p. 311–318. [Online]. medical conversational AI models and training data,”
Available: https://doi.org/10.3115/1073083.1073135 CoRR, vol. abs/2304.08247, 2023. [Online]. Available:
C.-Y. Lin, “ROUGE: A package for automatic evaluation https://doi.org/10.48550/arXiv.2304.08247
of summaries,” in Text Summarization Branches Out. C. Wu, W. Lin, X. Zhang, Y. Zhang, Y. Wang, and W. Xie,
Barcelona, Spain: Association for Computational “Pmc-llama: Towards building open-source language
Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: models for medicine,” arXiv preprint arXiv:2305.10415,
https://aclanthology.org/W04-1013 vol. 6, 2023.
C. Su and C. McMillan, “Distilled GPT for source Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong,
code summarization,” CoRR, vol. abs/2308.14731, 2023. J. Peng, X. Huang, and Z. Wei, “Disc-medllm: Bridging
[Online]. Available: https://doi.org/10.48550/arXiv.2308. general large language models and real-world medical
14731 consultation,” CoRR, vol. abs/2308.14346, 2023. [Online].
W. Guo, J. Yang, K. Yang, X. Li, Z. Rao, Y. Xu, and D. Niu, Available: https://doi.org/10.48550/arXiv.2308.14346
“Instruction fusion: Advancing prompt evolution through Z. Gou, Z. Shao, Y. Gong, yelong shen, Y. Yang,
hybridization,” CoRR, vol. abs/2312.15692, 2023. [Online]. M. Huang, N. Duan, and W. Chen, “ToRA: A tool-
Available: https://doi.org/10.48550/arXiv.2312.15692 integrated reasoning agent for mathematical problem
O. Sener and S. Savarese, “Active learning for convolutional solving,” in The Twelfth International Conference on
neural networks: A core-set approach,” in 6th International Learning Representations, 2024. [Online]. Available: https:
Conference on Learning Representations, ICLR 2018, //openreview.net/forum?id=Ep0TtjVoap
Vancouver, BC, Canada, April 30 - May 3, 2018, E. Perkowski, R. Pan, T. D. Nguyen, Y. Ting, S. Kruk,
Conference Track Proceedings, 2018. [Online]. Available: T. Zhang, C. O’Neill, M. Jablonska, Z. Sun, M. J.
https://openreview.net/forum?id=H1aIuk-RW Smith, H. Liu, K. Schawinski, K. Iyer, I. Ciuca,
H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with and UniverseTBD, “Astrollama-chat: Scaling astrollama
visual instruction tuning,” 2023. with conversational and diverse datasets,” CoRR,
S. Zhang, P. Sun, S. Chen, M. Xiao, W. Shao, W. Zhang, vol. abs/2401.01916, 2024. [Online]. Available: https:
Y. Liu, K. Chen, and P. Luo, “Gpt4roi: Instruction tuning //doi.org/10.48550/arXiv.2401.01916
large language model on region-of-interest,” 2023. J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong,
OpenAI, “Gpt-4v(ision) system card,” 2023. [Online]. J. Han, H. Xu, Z. Li, and L. Kong, “G-llava: Solving
Available: https://api.semanticscholar.org/CorpusID: geometric problem with multi-modal large language
263218031 model,” CoRR, vol. abs/2312.11370, 2023. [Online].
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, Available: https://doi.org/10.48550/arXiv.2312.11370
J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: H. Zhao, S. Liu, C. Ma, H. Xu, J. Fu, Z.-H. Deng, L. Kong,
Collecting region-to-phrase correspondences for richer and Q. Liu, “GIMLET: A unified graph-text model
image-to-sentence models,” in Proceedings of the IEEE in- for instruction-based molecule zero-shot learning,” in
ternational conference on computer vision, 2015, pp. 2641– Thirty-seventh Conference on Neural Information Processing
2649. Systems, 2023. [Online]. Available: https://openreview.
L. Li, Z. Xie, M. Li, S. Chen, P. Wang, L. Chen, Y. Yang, net/forum?id=Tt6DrRCgJV
B. Wang, and L. Kong, “Silkie: Preference distilla- A. N. Rubungo, C. Arnold, B. P. Rand, and A. B. Dieng,
tion for large visual language models,” arXiv preprint “Llm-prop: Predicting physical and electronic properties
arXiv:2312.10665, 2023. of crystalline solids from their text descriptions,”
H. Ha, P. Florence, and S. Song, “Scaling up and distilling CoRR, vol. abs/2310.14029, 2023. [Online]. Available:
down: Language-guided robot skill acquisition,” in Con- https://doi.org/10.48550/arXiv.2310.14029
ference on Robot Learning. PMLR, 2023, pp. 3766–3777. H. Cao, Z. Liu, X. Lu, Y. Yao, and Y. Li, “Instructmol:
S. Wu, Z. Liu, Z. Zhang, Z. Chen, W. Deng, W. Zhang, Multi-modal integration for building a versatile and
J. Yang, Z. Yao, Y. Lyu, X. Xin, S. Gao, P. Ren, Z. Ren, reliable molecular assistant in drug discovery,” CoRR,
and Z. Chen, “fuzi.mingcha,” https://github.com/irlab- vol. abs/2311.16208, 2023. [Online]. Available: https:
sdu/fuzi.mingcha, 2023. //doi.org/10.48550/arXiv.2311.16208
H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, Q. Wang, and H. Abdine, M. Chatzianastasis, C. Bouyioukos, and
D. Shen, “Doctorglm: Fine-tuning your chinese doctor M. Vazirgiannis, “Prot2text: Multimodal protein’s
is not a herculean task,” arXiv preprint arXiv:2304.01097, function generation with GNNs and transform-
2023. ers,” in Deep Generative Models for Health
X. Zhang, C. Tian, X. Yang, L. Chen, Z. Li, and L. R. Pet- Workshop NeurIPS 2023, 2023. [Online]. Available:
zold, “Alpacare: Instruction-tuned large language models https://openreview.net/forum?id=EJ7YNgWYFj
for medical application,” arXiv preprint arXiv:2310.14558, Y. Luo, J. Zhang, S. Fan, K. Yang, Y. Wu, M. Qiao,
41

and Z. Nie, “Biomedgpt: Open multimodal generative C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Pmc-
pre-trained transformer for biomedicine,” arXiv preprint llama: Further finetuning llama on medical papers,”
arXiv:2308.09442, 2023. CoRR, vol. abs/2304.14454, 2023. [Online]. Available:
B. Chen, X. Cheng, P. Li, Y. Geng, J. Gong, S. Li, Z. Bei, https://doi.org/10.48550/arXiv.2304.14454
X. Tan, B. Wang, X. Zeng, C. Liu, A. Zeng, Y. Dong, Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong, J. Peng,
J. Tang, and L. Song, “xtrimopglm: Unified 100b-scale X. Huang, and Z. Wei, “Disc-medllm: Bridging general
pre-trained transformer for deciphering the language large language models and real-world medical consulta-
of protein,” CoRR, vol. abs/2401.06199, 2024. [Online]. tion,” arXiv preprint arXiv:2308.14346, 2023.
Available: https://doi.org/10.48550/arXiv.2401.06199 S. Xue, F. Zhou, Y. Xu, H. Zhao, S. Xie, Q. Dai,
C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, C. Jiang, J. Zhang, J. Zhou, D. Xiu, and H. Mei,
W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He, “K2: “Weaverbird: Empowering financial decision-making
A foundation language model for geoscience knowledge with large language model, knowledge base, and search
understanding and utilization,” 2023. engine,” CoRR, vol. abs/2308.05361, 2023. [Online].
Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, and Available: https://doi.org/10.48550/arXiv.2308.05361
H. Chen, “Oceangpt: A large language model for ocean S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
science tasks,” CoRR, vol. abs/2310.02031, 2023. [Online]. S. Gehrmann, P. Kambadur, D. S. Rosenberg, and
Available: https://doi.org/10.48550/arXiv.2310.02031 G. Mann, “Bloomberggpt: A large language model
Z. Zheng, J. Zhang, T. Vu, S. Diao, Y. H. W. Tim, and for finance,” CoRR, vol. abs/2303.17564, 2023. [Online].
S. Yeung, “Marinegpt: Unlocking secrets of ocean to Available: https://doi.org/10.48550/arXiv.2303.17564
the public,” CoRR, vol. abs/2310.13596, 2023. [Online]. D. Lu, H. Wu, J. Liang, Y. Xu, Q. He, Y. Geng, M. Han,
Available: https://doi.org/10.48550/arXiv.2310.13596 Y. Xin, and Y. Xiao, “Bbt-fin: Comprehensive construction
Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, of chinese financial domain pre-trained language model,
Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, T. Shi, corpus and benchmark,” CoRR, vol. abs/2302.09432, 2023.
T. Huang, Y. Xu, S. Wang, L. Fu, W. Zhang, J. He, [Online]. Available: https://doi.org/10.48550/arXiv.2302.
C. Ma, Y. Zhu, X. Wang, and C. Zhou, “Geogalactica: 09432
A scientific large language model in geoscience,” Y. Yang, Y. Tang, and K. Y. Tam, “Investlm: A large language
CoRR, vol. abs/2401.00434, 2024. [Online]. Available: model for investment using financial domain instruction
https://doi.org/10.48550/arXiv.2401.00434 tuning,” CoRR, vol. abs/2309.13064, 2023. [Online].
D. Zhang, A. Petrova, D. Trautmann, and F. Schilder, “Un- Available: https://doi.org/10.48550/arXiv.2309.13064
leashing the power of large language models for legal Q. Xie, W. Han, X. Zhang, Y. Lai, M. Peng, A. Lopez-
applications,” in Proceedings of the 32nd ACM International Lira, and J. Huang, “PIXIU: A large language model,
Conference on Information and Knowledge Management, 2023, instruction data and evaluation benchmark for finance,”
pp. 5257–5258. CoRR, vol. abs/2306.05443, 2023. [Online]. Available:
Z. Sun, “A short survey of viewing large language models https://doi.org/10.48550/arXiv.2306.05443
in legal aspect,” arXiv preprint arXiv:2303.09136, 2023. N. Wang, H. Yang, and C. D. Wang, “Fingpt: Instruction
J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu, “Large language tuning benchmark for open-source large language models
models in law: A survey,” arXiv preprint arXiv:2312.03718, in financial datasets,” CoRR, vol. abs/2310.04793, 2023.
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.
S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, 04793
Y. Xiao, S. Yun, W. Lin et al., “Disc-lawllm: Fine-tuning R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn,
large language models for intelligent legal services,” arXiv E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic,
preprint arXiv:2309.11325, 2023. “Galactica: A large language model for science,”
H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and M. Sun, CoRR, vol. abs/2211.09085, 2022. [Online]. Available:
“Jec-qa: a legal-domain question answering dataset,” in https://doi.org/10.48550/arXiv.2211.09085
Proceedings of the AAAI Conference on Artificial Intelligence, J. Yin, S. Dash, F. Wang, and M. Shankar, “FORGE:
vol. 34, no. 05, 2020, pp. 9701–9708. pre-training open foundation models for science,”
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, in Proceedings of the International Conference for High
L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, Performance Computing, Networking, Storage and Analysis,
M. Schaekermann, A. Wang, M. Amin, S. Lachgar, P. A. SC 2023, Denver, CO, USA, November 12-17, 2023,
Mansfield, S. Prakash, B. Green, E. Dominowska, B. A. D. Arnold, R. M. Badia, and K. M. Mohror, Eds.
y Arcas, N. Tomasev, Y. Liu, R. Wong, C. Semturs, ACM, 2023, pp. 81:1–81:13. [Online]. Available: https:
S. S. Mahdavi, J. K. Barral, D. R. Webster, G. S. //doi.org/10.1145/3581784.3613215
Corrado, Y. Matias, S. Azizi, A. Karthikesalingam, and Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos,
V. Natarajan, “Towards expert-level medical question S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and
answering with large language models,” CoRR, vol. S. Welleck, “Llemma: An open language model for
abs/2305.09617, 2023. [Online]. Available: https://doi. mathematics,” CoRR, vol. abs/2310.10631, 2023. [Online].
org/10.48550/arXiv.2305.09617 Available: https://doi.org/10.48550/arXiv.2310.10631
W. Zhu, X. Wang, H. Zheng, M. Chen, and B. Tang, F. Yu, A. Gao, and B. Wang, “Outcome-supervised
“Promptcblue: A chinese prompt tuning benchmark for verifiers for planning in mathematical reasoning,”
the medical domain,” arXiv preprint arXiv:2310.14151, CoRR, vol. abs/2311.09724, 2023. [Online]. Available:
2023. https://doi.org/10.48550/arXiv.2311.09724
42

T. D. Nguyen, Y. Ting, I. Ciuca, C. O’Neill, Z. Sun, Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and
M. Jablonska, S. Kruk, E. Perkowski, J. W. Miller, Y. He, “Zeroquant: Efficient and affordable post-training
J. Li, J. Peek, K. Iyer, T. Rózanski, P. Khetarpal, quantization for large-scale transformers,” Advances in
S. Zaman, D. Brodrick, S. J. R. Méndez, T. Bui, Neural Information Processing Systems, vol. 35, pp. 27 168–
A. Goodman, A. Accomazzi, J. P. Naiman, J. Cranney, 27 183, 2022.
K. Schawinski, and UniverseTBD, “Astrollama: Towards G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han,
specialized foundation models in astronomy,” CoRR, “Smoothquant: Accurate and efficient post-training quan-
vol. abs/2309.06126, 2023. [Online]. Available: https: tization for large language models,” 2023.
//doi.org/10.48550/arXiv.2309.06126 X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the struc-
J. Roberts, T. Lüddecke, S. Das, K. Han, and S. Albanie, tural pruning of large language models,” 2023.
“Gpt4geo: How a language model sees the world’s ge- M. Zhang, H. Chen, C. Shen, Z. Yang, L. Ou, X. Yu,
ography,” 2023. and B. Zhuang, “Loraprune: Pruning meets low-rank
Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, parameter-efficient fine-tuning,” 2023.
Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, T. Shi, T. Huang, E. Frantar and D. Alistarh, “Sparsegpt: Massive language
Y. Xu, S. Wang, L. Fu, W. Zhang, J. He, C. Ma, Y. Zhu, models can be accurately pruned in one-shot,” 2023.
X. Wang, and C. Zhou, “Geogalactica: A scientific large M. Xu, Y. L. Xu, and D. P. Mandic, “Tensorgpt: Efficient
language model in geoscience,” 2023. compression of the embedding layer in llms based on the
C. Wang, D. Engler, X. Li, J. Hou, D. J. Wald, K. Jaiswal, tensor-train decomposition,” 2023.
and S. Xu, “Near-real-time earthquake-induced fatality Y. Li, Y. Yu, Q. Zhang, C. Liang, P. He, W. Chen, and
estimation using crowdsourced data and large-language T. Zhao, “Losparse: Structured compression of large lan-
models,” 2023. guage models based on low-rank and sparse approxima-
L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Yadav, tion,” 2023.
Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and H. Jin, Z. Hu, L. Wang, Y. Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu,
“Alpagasus: Training a better alpaca with fewer data,” S. Poria, and R. K.-W. Lee, “Llm-adapters: An adapter
2023. family for parameter-efficient fine-tuning of large lan-
Y. Cao, Y. Kang, and L. Sun, “Instruction mining: High- guage models,” 2023.
quality instruction data selection for large language mod- H. Liu, D. Tam, M. Mohammed, J. Mohta, T. Huang,
els,” 2023. M. Bansal, and C. Raffel, “Few-shot parameter-
M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, efficient fine-tuning is better and cheaper than in-
J. Wang, T. Zhou, and J. Xiao, “From quantity to quality: context learning,” in Advances in Neural Information
Boosting llm performance with self-guided data selection Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave,
for instruction tuning,” ArXiv, vol. abs/2308.12032, and K. Cho, Eds., 2022. [Online]. Available: https:
2023. [Online]. Available: https://api.semanticscholar. //openreview.net/forum?id=rBCvMG-JsPd
org/CorpusID:261076515 Y. Wang, S. Agarwal, S. Mukherjee, X. Liu, J. Gao,
Q. Du, C. Zong, and J. Zhang, “Mods: Model-oriented data A. H. Awadallah, and J. Gao, “AdaMix: Mixture-
selection for instruction tuning,” 2023. of-adaptations for parameter-efficient model tuning,”
Y. Li, B. Hui, X. Xia, J. Yang, M. Yang, L. Zhang, S. Si, in Proceedings of the 2022 Conference on Empirical
J. Liu, T. Liu, F. Huang, and Y. Li, “One shot learning as Methods in Natural Language Processing, Y. Goldberg,
instruction data prospector for large language models,” Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi,
2023. United Arab Emirates: Association for Computational
E. Frantar, S. P. Singh, and D. Alistarh, “Optimal brain com- Linguistics, Dec. 2022, pp. 5744–5760. [Online]. Available:
pression: A framework for accurate post-training quanti- https://aclanthology.org/2022.emnlp-main.388
zation and pruning,” 2023. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, L. Wang, and W. Chen, “Lora: Low-rank adaptation of
“GPT3.int8(): 8-bit matrix multiplication for transformers large language models,” 2021.
at scale,” in Advances in Neural Information Processing X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous
Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, prompts for generation,” in Proceedings of the 59th Annual
Eds., 2022. [Online]. Available: https://openreview.net/ Meeting of the Association for Computational Linguistics and
forum?id=dXiGWqBoxaD the 11th International Joint Conference on Natural Language
Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, Processing (Volume 1: Long Papers), C. Zong, F. Xia,
“Finequant: Unlocking efficiency with fine-grained W. Li, and R. Navigli, Eds. Online: Association for
weight-only quantization for llms,” 2023. Computational Linguistics, Aug. 2021, pp. 4582–4597.
C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, [Online]. Available: https://aclanthology.org/2021.acl-
and N. Wong, “Compression of generative pre-trained long.353
language models via quantization,” in Proceedings of the X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-
60th Annual Meeting of the Association for Computational tuning: Prompt tuning can be comparable to fine-tuning
Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, across scales and tasks,” in Proceedings of the 60th Annual
and A. Villavicencio, Eds. Dublin, Ireland: Association Meeting of the Association for Computational Linguistics
for Computational Linguistics, May 2022, pp. 4821–4836. (Volume 2: Short Papers), S. Muresan, P. Nakov, and
[Online]. Available: https://aclanthology.org/2022.acl- A. Villavicencio, Eds. Dublin, Ireland: Association for
long.331 Computational Linguistics, May 2022, pp. 61–68. [Online].
43

Available: https://aclanthology.org/2022.acl-short.8 for continual learning,” in Proceedings of the IEEE/CVF


T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, Conference on Computer Vision and Pattern Recognition, 2022,
“Qlora: Efficient finetuning of quantized llms,” 2023. pp. 139–149.
J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, Z. Hu, Y. Li, J. Lyu, D. Gao, and N. Vasconcelos, “Dense
and D. Lee, “Memory-efficient fine-tuning of compressed network expansion for class incremental learning,” in
large language models via sub-4-bit integer quantization,” Proceedings of the IEEE/CVF Conference on Computer Vision
2023. and Pattern Recognition, 2023, pp. 11 858–11 867.
S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, X. Li, L. Lin, S. Wang, and C. Qian, “Unlock the power:
and S. Arora, “Fine-tuning language models with just Competitive distillation for multi-modal large language
forward passes,” 2024. models,” arXiv preprint arXiv:2311.08213, 2023.
Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, J. Liu, Z. Qu, M. Zeng, W. Xue, Q. Liu, and Y. Guo, “Continual learning
S. Yan, Y. Zhu, Q. Zhang, M. Chowdhury, and M. Zhang, with dirichlet generative-based rehearsal,” arXiv preprint
“Efficient large language models: A survey,” 2024. arXiv:2309.06917, 2023.
Y.-S. Lee, M. Sultan, Y. El-Kurdi, T. Naseem, A. Munawar, Z. Zhang, M. Fang, L. Chen, and M.-R. Namazi-Rad, “Citb:
R. Florian, S. Roukos, and R. Astudillo, “Ensemble- A benchmark for continual instruction tuning,” arXiv
instruct: Instruction tuning data generation with a preprint arXiv:2310.14510, 2023.
heterogeneous mixture of LMs,” in Findings of the C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao,
Association for Computational Linguistics: EMNLP 2023, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar,
H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: J. Leike, I. Sutskever, and J. Wu, “Weak-to-strong
Association for Computational Linguistics, Dec. 2023, pp. generalization: Eliciting strong capabilities with weak
12 561–12 571. [Online]. Available: https://aclanthology. supervision,” CoRR, vol. abs/2312.09390, 2023. [Online].
org/2023.findings-emnlp.836 Available: https://doi.org/10.48550/arXiv.2312.09390
W. Chen, Y. Zhou, N. Du, Y. Huang, J. Laudon, Z. Chen, and M. Li, Y. Zhang, S. He, Z. Li, H. Zhao, J. Wang,
C. Cui, “Lifelong language pretraining with distribution- N. Cheng, and T. Zhou, “Superfiltering: Weak-to-
specialized experts,” in International Conference on Machine strong data filtering for fast instruction-tuning,” CoRR,
Learning. PMLR, 2023, pp. 5383–5395. vol. abs/2402.00530, 2024. [Online]. Available: https:
S. Kotha, J. M. Springer, and A. Raghunathan, “Under- //doi.org/10.48550/arXiv.2402.00530
standing catastrophic forgetting in language models via J. Ji, B. Chen, H. Lou, D. Hong, B. Zhang, X. Pan,
implicit inference,” arXiv preprint arXiv:2309.10105, 2023. J. Dai, and Y. Yang, “Aligner: Achieving efficient
B. Koloski, B. Škrlj, M. Robnik-Šikonja, and S. Pollak, “Mea- alignment through weak-to-strong correction,” CoRR,
suring catastrophic forgetting in cross-lingual transfer vol. abs/2402.02416, 2024. [Online]. Available: https:
paradigms: Exploring tuning strategies,” arXiv preprint //doi.org/10.48550/arXiv.2402.02416
arXiv:2309.06089, 2023.
T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, and G. Haffari,
“Continual learning for large language models: A sur-
vey,” arXiv preprint arXiv:2402.01364, 2024.
Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang,
“An empirical study of catastrophic forgetting in large
language models during continual fine-tuning,” arXiv
preprint arXiv:2308.08747, 2023.
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-
jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,
A. Grabska-Barwinska et al., “Overcoming catastrophic
forgetting in neural networks,” Proceedings of the national
academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
M. Rostami, S. Kolouri, and P. K. Pilly, “Complementary
learning for overcoming catastrophic forgetting using ex-
perience replay,” arXiv preprint arXiv:1903.04566, 2019.
D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne,
“Experience replay for continual learning,” Advances in
Neural Information Processing Systems, vol. 32, 2019.
S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang,
“Overcoming catastrophic forgetting by incremental mo-
ment matching,” Advances in neural information processing
systems, vol. 30, 2017.
A. Mallya, D. Davis, and S. Lazebnik, “Piggyback: Adapting
a single network to multiple tasks by learning to mask
weights,” in Proceedings of the European conference on com-
puter vision (ECCV), 2018, pp. 67–82.
Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren,
G. Su, V. Perot, J. Dy, and T. Pfister, “Learning to prompt

You might also like