2024 Achieving Peak Performance For Large Language Models A Systematic Review
2024 Achieving Peak Performance For Large Language Models A Systematic Review
Budapest, Hungary
Corresponding author: Zhyar Rzgar K. Rostam ([email protected])
ABSTRACT In recent years, large language models (LLMs) have achieved remarkable success in natural
language processing (NLP). LLMs require an extreme amount of parameters to attain high performance.
As models grow into the trillion-parameter range, computational and memory costs increase significantly.
This makes it difficult for many researchers to access the resources needed to train or apply these
models. Optimizing LLM performance involves two main approaches: fine-tuning pre-trained models for
specific tasks to achieve state-of-the-art performance, and reducing costs or improving training time while
maintaining similar performance. This paper presents a systematic literature review (SLR) following the
Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. We reviewed
65 publications out of 983 from 2017 to December 2023, retrieved from 5 databases. The study presents
methods to optimize and accelerate LLMs while achieving cutting-edge results without sacrificing accuracy.
We begin with an overview of the development of language modeling, followed by a detailed explanation
of commonly used frameworks and libraries, and a taxonomy for improving and speeding up LLMs based
on three classes: LLM training, LLM inference, and system serving. We then delve into recent optimization
and acceleration strategies such as training optimization, hardware optimization, scalability and reliability,
accompanied by the taxonomy and categorization of these strategies. Finally, we provide an in-depth
comparison of each class and strategy, with two case studies on optimizing model training and enhancing
inference efficiency. These case studies showcase practical approaches to address LLM resource limitations
while maintaining performance.
INDEX TERMS Distributed training, GPU acceleration, large language model, LLM, LLM acceleration,
LLM frameworks, LLM optimization.
I. INTRODUCTION compared to the existing models [4], [5], [6], [7], [8], [9],
In recent years, dense deep learning models have seen [10], [11], [12], [13].
an extraordinary growth in the number of parameters [1], To achieve significant accuracy in deep learning, large
[2], [3]. Transformer as an effective deep learning archi- models with billions to trillions of parameters are essential.
tecture has been widely used over the recent years, and Therefore, deep learning models continue to grow in
transformer-based models have achieved notable success and complexity with an array of large-scale models ranging
recognition in various fields including language modeling from Bidirectional Encoder Representations from Trans-
formers (BERTlarge , 340 million parameters) [8], Generative
Pre-trained Transformer-3 (GPT-3, 175 billion parameters)
The associate editor coordinating the review of this manuscript and [14], to General Language Model (GLM-3, 1.75 trillion
approving it for publication was Agostino Forestiero . parameters) [15]. With models now reaching trillions of
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 12, 2024 96017
Z. R. K. Rostam et al.: Achieving Peak Performance for LLMs: A Systematic Review
parameters, even the most powerful GPUs are struggling to reliability (presented in Table 5). In the latter analysis,
keep up [1]. This resource-intensive requirement is making we also consider the focus of classes.
it difficult for many researchers to access the computational • Case studies: We include two in-depth case studies
resources they need to train these models [1], [4], [16]. Also, that demonstrate practical approaches to optimizing
handling, managing, and fitting these models into device model training and enhancing inference efficiency.
memory is a daunting challenge due to memory limitations, These case studies highlight how resource limita-
and this tremendous size of data brings complexity, and tions can be addressed while maintaining performance
requires high-end computing resources with significant (Sections VIII-A, VIII-B).
memory requirements to process [5], [17], [18], [19]. • Future direction: We explore a range of promising future
Training large-scale models effectively require significant directions for LLM development. These areas, detailed
adjustments [20], [21], [22], [23], [24], especially in terms in specific sections, focus on enhancing efficiency,
of increasing training throughput and loading these kinds of scalability, and flexibility for LLMs (Section X).
large models into GPU memory [18]. This review paper is organized as follows: an overview
As a result, developing frameworks, libraries and propos- of language modeling development (Section II), followed
ing new techniques to overcome the mentioned challenges by an in-depth explanation of the most commonly utilized
has become an essential task. There are many studies frameworks and libraries specifically designed for optimizing
that have worked on possibilities for optimization and and accelerating large language models (LLMs) (Section IV,
acceleration with large models and using various techniques Tables 3 and 4), accompanied by taxonomy and catego-
to achieve state-of-the-art (SOTA) results without sacrificing rization. Additionally, it delves into recent optimization and
accuracy. These remarkable advancements in the field acceleration strategies employed within LLMs, including the
of language models (LMs) required a systematic review taxonomy and categorization of these strategies (presented in
of recent LM optimization and acceleration techniques. Fig. 1) (Section V, VI, VII), Table 8 summarizes the reviewed
To address these challenges and guide future research, this papers, excluding those already covered in Tables 3 and 4
SLR paper aims to: or the main text. Moreover, we present an individual
• Analyze recent optimization and acceleration techniques comparison in terms of performance, cost, and scalability
for LLMs. for reviewed strategies discussed in Tables 6, and 7, and
• Identify challenges associated with training, inference, the classes (training optimization, hardware optimization,
and system serving for LLMs (billions/trillions of scalability and reliability) presented in Table 5. In addition
parameters). to the mentioned factors, we consider the classes’ focus
• Develop a structured taxonomy to categorize LLM in this comparison. Finally, we illustrate these concepts
optimization techniques. with two real-world examples: optimizing model training
• Review and evaluate recent libraries and frameworks and improving inference efficiency through case studies
designed for LLM optimization. (Section VIII).
• Identify promising areas for future research in LLM
development, focusing on efficiency, scalability, and A. RELATED WORKS
flexibility. In this section, we will present the related studies that
In this SLR we are making the following contributions: investigate optimization and acceleration with dense deep
• Comprehensive overview: We offer a comprehensive learning models and LLMs. Jahan et al., in [25] present a
overview of the development of language modeling systematic literature review (SLR) by comparing 31 language
(Section II), detailing commonly used frameworks and models inspired by BERT, published between 2018 and
libraries (Section IV), and recently used techniques and 2020, to help researchers choose the best model based on
strategies (Sections V, VI, VII). This serves as a valuable their requirements. By analyzing each model’s performance
resource for understanding the current landscape of against RoBERTa, the study identified seven models that
LLM optimization. performed better, and the rest of the studies investigated
• Taxonomy of optimization strategies: We categorize with different parameter settings. The outperforming models
optimization strategies into three classes: training opti- varied in dataset size, suggesting that both large and small
mization, hardware optimization, and scalability and datasets can be effective depending on the model’s archi-
reliability. This taxonomy helps clarify the various tecture. Ultimately, this research provides valuable insights
approaches and their specific applications (presented in for researchers seeking the optimal language model for their
Fig 4, Sections V, VI, VII). specific tasks. Yu et al. [26] conduct a survey that explores
• Detailed analysis of techniques: Our analysis explores the growing challenges and opportunities for optimizing
recent optimization and acceleration strategies, we pro- large-scale deep learning systems. By highlighting recent
vide two comparative analyses regarding performance, advances in optimization techniques, it proposes a new way
cost, and scalability for reviewed strategies (presented to categorize and explain the different computing approaches
in Tables 6 and 7) and their core categories: training used. Zhao et al. [24] carry out a survey that focuses on
optimization, hardware optimization, and scalability and the recent advancements in LLMs. The study concentrates
TABLE 1. Research queries executed. criteria, followed by author selection of the most relevant and
impactful studies. Finally, the ‘‘compute rating’’ function in
Rayyan was used, and the authors double-checked excluded
studies for accuracy.
Data Extraction: In this stage, we focused on extracting
relevant data from selected studies. Our aim was to collect
information on two key aspects:
Outcomes: We were particularly interested in outcomes
related to LLM optimization and acceleration. Specifically,
systematic reviews. Our approach included a detailed search we sought data on:
strategy across multiple databases, explicit inclusion and • Performance metrics: This could include metrics like
exclusion criteria, and a thorough study selection process. perplexity [19], BLEU score, ROUGE score [33],
We documented each step meticulously, including the study or task-specific accuracy measures depending on the
selection and exclusion procedures. (presented in Fig. 2). study’s focus.
Eligibility Criteria: This review will focus on the optimiza- • Training time reduction: We looked for data on how
tion and acceleration of LLMs, examining the most recent different techniques impacted the time required to train
and widely utilized libraries, frameworks, and techniques LLMs.
in this field. To ensure focused analysis, strict eligibility • Resource usage: If studies reported resource (memory)
criteria are applied. Only studies published between 2017 and usage changes with different optimization techniques,
December 2023 are considered, excluding publications not we collected that data.
written in English and retracted papers. Additionally, studies We aimed to collect all relevant results within these out-
are excluded if they are irrelevant to our SLR, or do not come domains whenever possible. This included considering
explicitly address ‘‘Optimization’’ ‘‘Acceleration,’’ or ‘‘Large data from different measures, time points, and analyses
Language Models’’ in their titles, abstracts, or keywords. reported by the study authors.
Information Sources: To ensure a comprehensive search Additional Variables: In addition to the main outcomes,
for authentic studies, a variety of sources, including we also extracted data on the following aspects of the studies:
databases, websites, and tools, were employed. Digital • LLM architecture: The specific type of LLM architec-
libraries like IEEE Xplore, Web of Science, and Scopus ture used in the study.
alongside open access libraries like arXiv and dedicated tools • Optimization techniques: Detailed description of the
like Zotero facilitated the data collection and reference man- optimization techniques employed in the study.
agement. The last search was conducted on May 25th , 2024. • Hardware/Software platforms: The hardware and soft-
Additionally, Rayyan and the researchrabbit.ai websites were ware platforms used for training, inference, serving, and
utilized for data exploration and study selection. evaluation.
Search Strategy: This systematic review leveraged two Data Collection Process: ResearchRabbit is a web-based
web-based AI tools, ResearchRabbit [31], and Rayyan [32], tool powered by AI that helps and guides researchers to find
for both data collection and study selection. In all databases relevant studies in a variety of digital libraries and allows
and websites, we were particularly interested in finding researchers to export retrieved results in a collection to ref-
studies that focused on language modeling, particularly erence managers tools (similar to Zotero). ResearchRabbit’s
those that focused on LLM optimization and acceleration. search is powered by SemanticScholar and shows only the
We employed various queries in each source (see Table 1) top 50 search results for a single query, aiming to maintain
and exported the retrieved studies for import into Rayyan. the research focus effectively [31]. Initially, we applied our
Rayyan’s AI capabilities facilitated both the selection of queries to the ResearchRabbit website and then added the
desired studies and the exclusion of irrelevant ones. most relevant retrieved results to our collection. Following
Selection Process: The process of selecting which works that, we applied the same queries in digital libraries like IEEE
to review in this study employed strict inclusion criteria. Xplore, Web of Science, Scopus, and arXiv (see Table 2). The
In this SLR we explore the techniques and methods that papers were reviewed on a case-by-case basis. Then, a precise
were primarily examined based on their focus on large-scale summary of each paper was written. Finally, the interesting
language modeling, including transformer-based models data that directly addressed the issues the papers attempted to
such as PLMs, LLMs, and even general NLP models. The address were extracted from the summaries.
Rayyan platform facilitated the selection process. Two stages Study Risk of Bias Assessment: In this SLR, we followed a
were involved: initial screening using eligible and inclusion meticulous process to assess the risk of bias in the included
studies, adhering to best practices for ensuring the reliability
1 The initial search query in arXiv with this RQ was broad, returning and validity of our findings.
440 studies, many irrelevant to our research. To refine the results and Automation Tools:
minimize the risk of bias, also ensure retrieval of high-quality, relevant
papers, we employed the AND operator along with the title field within the • We utilized Rayyan, an AI-powered tool, to facilitate
search query specifically on arXiv. the initial screening and selection process. Rayyan’s AI
capabilities helped in identifying potential biases and reviewers and the use of validated tools. Automation tools
categorizing studies based on relevance and quality. such as Rayyan and ResearchRabbit played a crucial role
• ResearchRabbit was used for gathering relevant studies, in streamlining the screening and selection process, thereby
which provided a focused list of top search results, enhancing the efficiency and accuracy of our assessments.
aiding in maintaining the research scope effectively. By combining independent reviews, consensus discussions,
Reviewer Process: and advanced AI tools, we ensured a robust and unbiased
• Each study was assessed by three independent review- evaluation of the included studies.
ers. This approach helps to minimize subjective bias and Synthesis Methods: To enable a comprehensive and
ensures a more balanced evaluation. insightful analysis of LLM optimization techniques across
• The reviewers independently examined each study diverse contexts, a three-tiered categorization scheme will
based on predefined criteria including selection bias, be employed. The initial categorization will consist of group
performance bias, detection bias, attrition bias, and studies based on the utilized LLM libraries/frameworks and
reporting bias. the optimization techniques investigated. Subgroups within
Independent Review and Consensus: these categories will be further established based on the spe-
• The reviewers worked independently during the initial cific type of LLM or the NLP task addressed by the studies.
assessment phase to ensure unbiased evaluations. This method enables a highly detailed examination of how
• After the independent assessments, the reviewers com- the effectiveness of optimization techniques varies across
pared their findings. Any discrepancies or disagreements different LLM and NLP task configurations. Additionally,
were resolved through discussion and consensus. key findings from each individual study will be summarized
We adhered to a rigorous and systematic approach to in tables, including details like the optimization technique
assess the risk of bias, which involved multiple independent used, LLM type, NLP task addressed, achieved performance
VOLUME 12, 2024 96021
Z. R. K. Rostam et al.: Achieving Peak Performance for LLMs: A Systematic Review
TABLE 2. Studies retrieved per database / search engine. comprehension. They are trained on extensive text
corpora to discern patterns and relationships [37]. The
adoption of machine learning in NLP introduced a
more advanced methodology, enabling the creation of
applications such as spam detection [38] and sentiment
analysis [39].
• Neural language models, these models are developed
based on NN for working with a sequence of data. They
have a special ability of learning effective features for
words or sentences. These studies [40], [41], [42] have
metrics, and the study’s aims. Finally, a narrative synthesis initiated the use of language models for representation
will be conducted to analyze recurring themes across the learning (beyond word sequence modeling), and show
studies. This thematic analysis will focus on the effectiveness that these models have an important impact on the field
of LLM libraries and optimization techniques in achiev- of NLP [24], [43].
ing performance improvements while considering resource • Transformer language models refer to those models that
constraints. It will also explore potential explanations for leverage the capabilities of a deep learning architecture
observed variations in effectiveness, with particular attention called Transformer to process and understand human
paid to factors like LLM size, resources used, and the NLP language [44], [45]. These models achieved remarkable
task addressed. results by using ‘‘special attention mechanism’’ to
Reporting Bias Assessment and Certainty Assessment: understand the relationship between words and sen-
To minimize the risk of bias in our systematic review, tences. These models capture context-aware represen-
we implemented a multifaceted strategy. First, to address tation instead of learning fixed word representations,
reporting bias, we utilized Rayyan and ResearchRabbit, first pre-training then fine-tuning according to specific
as AI-powered tools, during the initial screening and downstream tasks [2], [8], [24], [46]. Transformer
selection process. These tools can categorize studies based architecture has been used to build PLMs such as
on relevance and quality and can help flag studies with BERT [8], GPT-2 [47], and BART [48]. These models
characteristics suggestive of reporting bias, such as those underwent training using bidirectional language models
focusing solely on positive outcomes. Second, to strengthen and specifically designed pre-training tasks applied to
the certainty of our findings and minimize subjective bias, extensive unlabeled datasets. The growth in model size
we implemented a multi-reviewer approach. Each study and data size has revolutionized the way we approach
underwent independent assessment by three reviewers based downstream tasks, enabling large-sized PLMs to achieve
on predefined criteria. This approach ensures a more balanced remarkable performance gains. These models exhibit
evaluation and reduces the influence of individual reviewer unique characteristics compared to smaller PLMs,
bias. such as 330M-BERT and 1.5B-GPT-2, demonstrating
exceptional abilities in solving complex tasks. As a
II. LANGUAGE MODELING DEVELOPMENT result, LLM is the term used to refer to large-sized
Language modeling is a fundamental approach to enhancing PLMs [46], [49], [50].
the ability of machines to understand and process human
language. It is a computational model that can learn and III. MACHINE LEARNING MODELS
predict the possibilities of incoming (or missing) tokens [24]. The process of building, deploying, and managing a machine
The development of language models can be classified as learning model involves three distinct phases: training,
follows (see Fig. 3): inference, and system serving. Training is the foundation of
• N-gram language models, like bigrams and trigrams, machine learning, where a vast dataset of labeled data is used
are basic methods that learn from the frequency of to develop a model that can identify patterns and relationships
word sequences in text [34], [35]. However, their within the data. Inference is the application of the trained
limited context window restricts their ability to capture model, where new, unseen data is fed into the model to
long-range dependencies and understand the deeper obtain predictions or classifications based on the learned
semantic relationships between words. patterns. System serving ensures the model’s longevity
• Markov assumption language models, refers to those and effectiveness in real-world applications, handling large
models that predict the next word based on the most volumes of requests, monitoring the model’s performance,
recent in the context [24]. Both n-gram and Markov and providing continuous updates or modifications as
assumption language models are commonly used to needed [11], [19], [51]. In the section IV, we provide a
improve task performance in NLP, and information categorization of the most recent frameworks and libraries
retrieval (IR) [36]. utilized for LLMs optimization, structured into three primary
• Machine learning models, these models investigate classes: training, inference, and deployment and system
machine learning algorithms to enhance language serving (presented in Fig. 4). However, certain studies can
1) GPIPE
GPipe [3] introduces a novel pipeline parallelism framework
based on batch partitioning. It divides each mini-batch
example into smaller micro-batches, which are subsequently
FIGURE 3. Language model development. executed in sequence across the cells. During training,
it employs synchronous mini-batch gradient descent, where
gradients from all micro-batches are aggregated and applied
to the model at the end of the mini-batch. GPipe has been
shown to train two large-scale models: a convolutional model
for image classification and a transformer model for machine
translation. The convolutional model, AmoebaNet, was
trained on 480 × 480 input from the ImageNet 2012 dataset.
To enhance its performance, the model width was expanded,
and its parameters were scaled up to 557 million. The model
achieved a top-1 validation accuracy of 84.4%. Meanwhile,
the transformer model a single 128-layer, 6-billion-parameter
multilingual model trained across 103 languages, was also
evaluated. GPipe achieved superior performance compared
to training 350-million-parameter bilingual transformer big
models individually across 100 language pairs. The model
presents its efficiency by boosting the performance on a
variety of devices, with the support of flexibility on any deep
network architectures, utilizing the synchronous gradient
decent, and ensuring consistent training regardless of the
number of partitions.
It has been used by some famous applications including with 8.3 billion parameters achieved a perplexity of 10.8 on
TikTok and Douying of ByteDance. The model was evaluated the WikiText103 benchmark. Also, it achieved an accuracy
on an NVIDIA A100, focusing on the forward pass of of 66.5% on the LAMBADA dataset, which outperforms the
BERT-like transformers, including BERT [8], ALBERT [55], previous SOTA of 63.2%.
DistilBERT, and DeBERTa. It showcased a significant
improvement, enhancing the fused MHA mechanism by 4) LIGHTSEQ2
6.13× compared to PyTorch attention. Additionally, Byte- LightSeq2 [52] proposes a software library that accelerates
Transformer outperformed PyTorch, TensorFlow, Tencent the training of transformer-based models within GPUs.
TurboTransformer [11], Microsoft DeepSpeed [5], and It is a system-level optimization while maintaining accuracy
NVIDIA FasterTransformer by 87%, 131%, 138%, 74%, and and training behavior. The system works with BERT
55%, respectively, in terms of the end-to-end performance of (encoder), GPT (decoder), Transformer (encoder-decoder),
a standard BERT transformer. and vision transformer. The system uses three techniques
for improving training speed and efficiency. First (layer-
3) MEGATRON-LM specific kernels), after analyzing Transformer-specific layers
in detail, rewriting the kernels with dependencies and other
Megatron-LM [19] is a deep learning library for training
techniques to improve parallelism, and using small kernels to
LLMs efficiently and effectively. It enables the library
improve GPU utilization. Second (mixed-precision trainer),
for the training of very large transformer models with
instead of applying batch updates to many individual full-
billions of parameters. It offers a set of optimization
precision updates, it applies batch updates to reduced-
methods for distributed training, it includes strategies like
precision parameters. Finally, introduced an efficient memory
intra-layer model parallelism, and mixed-precision training.
management technique to minimize the need for frequent
These optimization techniques significantly enhance training
allocation and release calls. This strategy involves recycling
efficiency and speed, facilitating effective distributed training
the memory space of tensors that remain unused during the
across multiple GPUs. Megatron-LM operates independently
backward pass.
without requiring new compiler or library changes. This
The system accelerates the entire training process for
makes it orthogonal and complementary to pipeline model
transformer models. LightSeq2 achieves significant perfor-
parallelism, allowing for seamless integration and flexibility
mance improvement on a variety of NLP tasks, includ-
within existing NLP frameworks.
ing machine translation, on the WMT14 English-German
The library has been shown to be highly effective for
machine translation task, it achieved a 308% speedup on the
training LLMs. A Megatron-LM model with 8.3 billion
WMT14 English-German machine translation task compared
parameters was trained on 512 NVIDIA V100 GPUs using 8-
to PyTorch.
way model parallelism and achieved sustained performance
of up to 15.1 PetaFLOPs across the entire application. This
is significantly faster than previous approaches to training 5) COLLIE
LLMs. Additionally, it has been shown to achieve SOTA CoLLiE [53] introduces a library designed to efficiently
results on several NLP benchmarks. A Megatron-LM model facilitate the collaborative training of LLMs using 3D
parallelism [24] (Sections V-C4, V-C1, V-C2), parameter- • Megatron-LM [19] facilitates training of LLMs with
efficient fine-tuning (PEFT) methods, and optimizers. The billions of parameters, achieving SOTA on NLP tasks
library demonstrated significantly improved training effi- while offering high throughput.
ciency compared to existing solutions. The study empirically • LightSeq2 [52] accelerates transformer model training
evaluates the correlation between model size and GPU by up to 308%, showcasing substantial performance
memory consumption under different optimization methods improvements.
and analyzes throughput. Additionally, the study investigates • CoLLiE [53] introduces a library for collaborative
training methods to improve the abilities of the LLaMA-65B LLM training, demonstrating improved efficiency and
model, specifically focusing on following user instructions. effectiveness in training large models like LLaMA-65B.
Techniques like LoRA [33], LOMO [56] (Section V-A4), It explores methods to enhance specific functionalities
AdaLomo [57], and AdamW demonstrated success in boost- without impacting overall performance.
ing the model’s instruction following capabilities without
sacrificing its overall performance. Notably, LoRA and
B. LLM INFERENCE FRAMEWORKS AND LIBRARIES
AdaLomo achieved impressive results, enabling the model to
This section will introduce the LLM frameworks and libraries
achieve an average score of 56.9.
designed particularly for inference tasks, followed by a
summary of each one (see Table 4).
6) LLM TRAINING FRAMEWORKS AND LIBRARIES:
CHALLENGES AND KEY FINDINGS
This section explores five prominent frameworks and 1) DEEPSPEED INFERENCE
libraries: GPipe [3], ByteTransformer [4], Megatron- DeepSpeed Inference [5] presents a comprehensive system
LM [19], LightSeq2 [52], and CoLLiE [53]. Each offers solution for efficient transformer inference. It has the
unique functionalities to overcome limitations in LLM potential to enable new and innovative applications of
training. transformer models in cloud datacenters and other resource-
Addressing Training Challenges: constrained environments. The system consists of two main
• Distributed training: As LLMs grow complex, training parts: DeepSpeed Transformer and ZeRO-Inference [1]. The
them on a single device becomes impractical. Frame- model is a GPU-only solution that leverages a variety
works like Megatron-LM [19] and CoLLiE [53] employ of optimizations to achieve SOTA (minimize) latency and
distributed training algorithms that split the model (maximize) throughput for transformer models of all sizes.
across multiple GPUs, enabling parallel processing and Specifically, in the first phase DeepSpeed Transformer uses
faster training. tensor-slicing and inference-optimized pipeline parallelism to
• Efficiency and speed: LightSeq2 [52] tackles training scale dense transformer models across GPUs.
speed through system-level optimizations. It utilizes For sparse transformer models, it has developed a
techniques like layer-specific kernels and mixed- massive-GPU sparse transformer layer that can extend the
precision training to enhance GPU utilization and reduce scalability of Mixture-of-Experts (MoE) transformer layers
memory usage. Similarly, ByteTransformer [4] accel- to hundreds of GPUs. This is achieved through a combination
erates transformer models for variable-length inputs in of parallelism techniques and optimization strategies for
NLP tasks. communication. Then, DeepSpeed Transformer employs
• Memory management: Efficient memory allocation is optimized sparse kernels to reduce the computational burden
crucial for LLM training. CoLLiE [53] overcomes on a single GPU. ZeRO-Inference [1] is a heterogeneous
memory constraints in LLM training by utilizing 3D solution that leverages GPU, CPU, and NVMe memory
parallelism to efficiently distribute memory across to enable massive transformer inference with limited GPU
training machines and GPUs, enabling the training of resources.
large models even in resource limited environments. It is particularly useful for inferring models that are
• Fine-tuning and performance: CoLLiE [53] investigates too large to fit in GPU memory. It works by partitioning
methods to improve specific LLM capabilities, such the model weights across multiple GPUs and offloading
as following user instructions. It explores parameter- unused weights to CPU and NVMe memory. This allows
efficient fine-tuning methods that enhance model perfor- ZeRO-Inference to infer models that are much larger than
mance in targeted areas without compromising overall would be possible with GPU-only solutions. As a result,
functionality. DeepSpeed Inference boosts throughput by more than 1.5×
Key Findings: for throughput-oriented scenarios and minimizes the latency
• GPipe [3] demonstrates successful training of a large by more than 7.3× compared to existing solutions for latency
multilingual transformer model, achieving superior orientation scenarios. It facilitates real-time inference at
results compared to training individual smaller models. a trillion-parameter scale by utilizing hundreds of GPUs,
• ByteTransformer [4] significantly outperforms existing marking an unparalleled achievement in terms of inference
frameworks in terms of performance for BERT-like scale. This technology allows for the inference of models that
transformers on various benchmarks. are 25 times larger than what GPU-only solutions can handle,
achieving a substantial throughput of 84 TFLOPS, which is outperforms DeepSpeed Zero-Inference [1], [5] and Hugging
over 50% of the A6000 peak performance. Face Accelerate by enabling significantly larger batch
sizes, often reaching orders of magnitude higher than its
2) FLEXGEN competitors. As a result, it can achieve significant speedups
Accelerating LLM inference is achievable by using multiple in throughput. On a single T4 GPU equipped with 208 GB
high-end accelerator technologies, due to their high com- CPU DRAM and a 1.5 TB SSD, input sequence length 512,
putational and memory requirements. FlexGen [17] study and output sequence length 32: With a latency of 5,000
proposes an offloading engine model which focuses on seconds, it (effective batch size 64) surpasses DeepSpeed
using (resource-constrained devices) limited resources to Zero-Inference (batch size 1) by over 40×, whereas Hugging
reach high-throughput LLM inference. The engine is flexible Face Accelerate fails to complete a single batch. Furthermore,
for configuration using different hardware resources by it can reach 69× higher throughput with a higher latency
aggregating memory and computation from the GPU, CPU, of 12000 seconds compared to baselines (effective batch
and disk. To optimize throughput within the search space, size 256, or 8192 tokens in total). Finally, the model can
researchers developed a linear programming-based search achieve 100× higher maximum throughput (effective batch
algorithm, through it the model can find efficient patterns for size 144, or 4608 tokens in total) with 4-bit quantization to
saving and accessing tensors. It has a larger space of batch compression and 4000 seconds by holding all weights in the
size options to choose from without sacrificing accuracy CPU and getting rid of disk offloading. The model achieved
through using 4-bit to compress weights and attention cache these results by aggregating memory and computation from
without the need for retraining or calibration. The model’s the GPU, CPU, and disk, and by using a number of techniques
efficiency has been experimented by using NVIDIA T4 to improve efficiency, such as I/O schedule tasks, possible
(16 GB) GPUs for running OPT-175B. It significantly compression techniques, and distributed pipeline parallelism.
FlexGen is a significant advancement in LLM inference, as it They share the pre-trained model among tasks and only
enables high-throughput generation on resource-constrained fine-tune a specific portion based on the task parameters.
devices. This opens new possibilities for deploying and using Prior to PetS [59], the serving systems did not have any
LLMs in a wider range of applications. mechanism to provide flexibility for PET task management,
and also there is no available efficient method to serve queries
3) NLP-FAST to different task batches. It is the first unified framework
NLP-Fast [58] is a system that accelerates the performance for multi-task PET serving in a single system. As a class
of large-scale heterogeneous NLP models by identifying of transformer models PETs have been designed to be
performance-critical operations and applying holistic model more efficient in terms of both parameters and computation.
partitioning, cross-operation zero skipping, and model/config Therefore, PETs are well-suited for deployment in resource-
adaptive hardware reconfiguration. NLP-Perf, a performance constrained environments. Conventional serving frameworks
analysis tool, collects performance data for NLP models and move data between the GPU and CPU memory when the
identifies performance-critical operations. Holistic model GPU cannot hold all of the data for the tasks that are being
partitioning is a comprehensive optimization technique, processed. This reduces the throughput of the system. It has
which integrates three model partitioning approaches: partial- the potential to revolutionize the way that LLMs are served,
head update, column-based algorithm, and feed-forward making it possible to deploy and run LLMs on a wider
splitting, to facilitate end-to-end model partitioning. Cross- range of devices and with lower resource requirements. This
operation zero skipping, skips zero or near-zero values could make LLMs more accessible to a wider range of
across multiple operations, which can significantly reduce the users and businesses. Pets framework is a flexible PET tasks
amount of computation required, these two optimization can management mechanism and a specialized PET Inference
be executed on different hardware platforms. Model/config Engine (PIE) that allows both inter-task and inter-algorithm
adaptive hardware reconfiguration, reconfigures the model query-batching. It enables 26× more concurrent tasks and
architecture for the specific hardware platform that it is enhances serving throughput by 1.53× on Desktop GPUs and
running on, which can further improve performance. NLP- 1.63× on Server GPUs.
Fast has been evaluated on a variety of NLP models
and hardware platforms, including CPU, GPU, and FPGA. 6) PETALS
The evaluation results show that NLP-Fast can improve PETALS [60] emerges as a collaborative platform specifi-
throughput by up to 2.92×, 1.59×, and 4.47× over the cally designed for the distributed inference and fine-tuning
baseline performance on each platform. of LLMs over the internet. It aims to overcome the
limitations associated with existing approaches, offering a
4) TURBOTRANSFORMERS range of advantages. The platform focuses on achieving high
TurboTransformers [11] is a lightweight, easy-to-use system performance by leveraging pipeline parallelism, effectively
that enables efficient deployment of transformer models enhancing the efficiency of LLM inference and fine-tuning
for online services. It achieves SOTA performance on processes. Furthermore, it showcases scalability, demonstrat-
GPU platforms by proposing three innovative features that ing its capability to support a substantial number of users
distinguish it from other similar models: Firstly, proposes and accommodate large-scale LLMs. This adaptability is
an efficient and parallel GPU-based batch reduction kernels complemented by the provision of a flexible API, allowing
algorithm for Softmax and LayerNorm. Secondly, proposes users to tailor the inference and fine-tuning processes
a sequence-length-aware algorithm for memory allocation according to their specific requirements. The PETALS
to efficiently balance memory allocation and deallocation, key feature is its emphasis on collaboration, providing a
this algorithm overcomes the problem of variability of the framework that enables multiple participants to actively
input sentences. Finally, applying the framework involves engage in LLM inference and fine-tuning tasks collectively.
utilizing a novel batch scheduler that leverages dynamic The collaborative nature of PETALS contributes to its
programming to achieve the optimal throughput on variable- potential in democratizing access to LLMs, making them
length requests. It is a lightweight and easy-to-use system more accessible and valuable across a diverse range of
that can be integrated into PyTorch code with just a few lines applications. In summary, PETALS emerges as a promising
of code. This makes the model a very accessible option for platform with the potential to enhance the accessibility and
researchers and practitioners who want to use transformer utility of LLMs. It can offload a 176B parameter model
models for online services. in 5.5 seconds for a regular setup and 11 seconds for
a multi-GPU setup. These results demonstrate PETALS’s
5) PETS superior performance for running large models with limited
The existing large-scale transformer modes follow the pre- resources.
train-then-fine-tune paradigm, copying the entire model for
each downstream task consumes a lot of storage. This 7) LIGHTSEQ
approach is unsuited for multi-purpose serving. Parameter LightSeq [61] is a lightweight inference library, addresses the
Efficient Transformers (PET) reduce the resource overhead. need for efficient and convenient deployment of Transformer
models in online services. It utilizes a combination of GPU interconnects. This approach enables the design of clusters
optimization techniques, including coarse-grained fused optimized for throughput, cost, or power consumption. The
kernel functions, hierarchical auto-regressive search, and model achieves up to 1.4× higher throughput at 20% lower
dynamic GPU memory reuse strategy, to achieve significant cost or 2.35× higher throughput with the same cost and
performance gains compared to TensorFlow and FasterTrans- power consumption. This approach improves LLM inference
former (FT). It supports a wide range of models and search efficiency by leveraging hardware specialization, leading to
algorithms, encompassing BERT, GPT, Transformer, and more cost-effective and power-efficient deployments.
Variational Autoencoders (VAEs), whereas seamlessly inte-
grating with popular models like BERT [8], RoBERTa [66], 10) LLMCOMPASS
GPT, VAEs, MT Transformer, and Speech Transformer. The Zhang et al. [64] propose LLMCompass, a library that effi-
library is user-friendly, with a serving system and CUDA ciently evaluates hardware design for LLMs. LLMCompass
implementations, enabling easy deployment of popular considers various hardware options and identifies the optimal
models online without code modification. It addresses the configuration for a specific task. The study also uses a
deployment challenges of resource-intensive sequence mod- cost model to recommend the most economical design. The
els, narrowing the performance gap between large models library demonstrates high accuracy, with an average error
and the demands of online services. In machine translation of 10.4% for predicting task execution time and 4.1% for
benchmarks, it consistently outperforms TensorFlow and LLM tasks, compared to real hardware. Notably, the model
FasterTransformer (FT), achieving up to 14× and 1.4× can simulate running a massive LLM like GPT-3 175B on
speedups, respectively. a powerful computer setup with 4× A100 GPUs in just
16 minutes. Leveraging LLMCompass, the study identified
8) EET
hardware designs that are more affordable than current
options (e.g., using less powerful components or cheaper
Easy and Efficient Transformer (EET) [62] offers a library
memory) while still offering good performance. These
designed to accelerate transformer inference. It encompasses
designs could make LLMs more accessible to a wider range of
a range of optimizations for transformer inference, spanning
users.
both algorithmic and implementation aspects. To address
the inefficiencies of explicit matrix addition and masked
11) POWERINFER
attention, the study implements custom CUDA kernels.
PowerInfer [65] is a high performance inference engine
Also, to extend all kernels to support a larger model
designed to run LLMs efficiently on consumer-grade GPUs.
size up to 12288 and a longer sequence above 4096 the
It leverages the power-law distribution of neuron activation
research proposes a new method called thread block fold-
in LLMs, assigning frequently activated (hot) neurons to
ing. Furthermore, the study introduced a CUDA memory
the GPU and input-specific (cold) neurons to the CPU.
management mechanism aimed at minimizing memory usage
This hybrid approach significantly reduces the pressure on
for models of the size. EET evaluated against Fairseq,
GPU memory and minimizes data transfers between CPU
LightSeq, and Faster Transformer (FT), On both a 2080Ti
and GPU. Furthermore, PowerInfer incorporates adaptive
and A100, EET achieves a speedup of 4.48-20.27× and 4.30-
predictors and neuron-aware sparse operators to optimize
27.43×, respectively, compared to Fairseq. On a 2080Ti, EET
performance and maintain model accuracy. Evaluations
outperforms LightSeq [61] with a speedup of 0.82-2.46× for
demonstrate that PowerInfer on an NVIDIA RTX 4090 GPU
model sizes of 768 and 1024. EET attains a speedup of 1.21-
achieves inference speeds up to 11.69× faster inference than
6.30× and 1.62-812× over FT v3.1 on a 2080Ti and A100,
systems like llama.cpp.It delivers an average token generation
respectively, and a speedup of 1.40-4.20× over FT v4.0 on an
rate of 13.20 tokens per second, rivaling the performance of
A100.
top-tier server-grade GPUs.
Key Findings: for deployment and serving applications. This section will
• Hardware specialization: Splitwise [63] proposes sepa- discuss the key challenges and findings associated with LLM
rating compute-intensive and memory-intensive phases deployment and serving.
onto different machines with specialized hardware. Challenges of LLM Deployment and Serving:
• Resource optimization: FlexGen [17] utilizes various • Memory limitations: Large LLMs can easily overwhelm
techniques like I/O scheduling, compression, and dis- the memory capacity of a single GPU. This limits their
tributed processing to efficiently use resources from deployment and serving for real-world applications.
CPU, GPU, and disk. • Scalability: Effectively handling multiple user requests
• Algorithmic optimizations: Libraries like EET [62] simultaneously with large LLMs requires efficient
and LightSeq [61] implement custom algorithms and scaling solutions.
memory management techniques to accelerate inference • Variability of input: LLM performance can suffer
on GPUs. when dealing with input sequences of varying lengths,
• Heterogeneous platforms: NLP-Fast [58] leverages requiring dynamic memory allocation strategies.
different hardware platforms (CPU, GPU, FPGA) by • Ease of deployment: Integrating complex LLM serving
identifying performance-critical operations and apply- systems into existing workflows can be challenging for
ing targeted optimizations. researchers and practitioners.
• Distributed inference: PETALS [60] facilitates collab- Key Findings:
orative inference and fine-tuning of LLMs across a • PagedAttention: This algorithm (introduced by vLLM
network, enabling scalability and efficient resource [67]) breaks down the KV cache into manageable
utilization. blocks, minimizing wasted memory and enabling effi-
• Efficiency gains: Several frameworks achieve signif- cient sharing across requests. This is a significant
icant performance improvements. DeepSpeed Infer- improvement for processing large LLMs.
ence [5] boasts throughput boosts of 1.5× and latency • Efficient GPU utilization: TurboTransformers [11] uti-
reductions of 7.3×. FlexGen demonstrates even greater lize techniques like parallel GPU kernels and dynamic
throughput gains, particularly on resource-constrained batch scheduling to optimize performance on GPUs.
devices. Other frameworks like NLP-Fast [58], Turbo- This translates to faster inference for transformer-based
Transformers [11], LightSeq [61], and EET [62] show models.
promising results in accelerating inference. • System-level optimizations: LightSeq2 [52] demon-
strates how system-level optimizations within the train-
C. LLM DEPLOYMENT AND SERVING LIBRARIES ing process can significantly improve training speed
As mentioned in section IV, some of the frameworks and efficiency for transformer models. This translates to
and libraries are utilized for multiple purposes. Besides faster deployment of LLMs in general.
vLLM [67] (Section IV-C1), the models used for deployment These findings from vLLM [67], TurboTransformers [11],
and serving purposes are mentioned in these sections and LightSeq2 [52] offer promising solutions for overcoming
LightSeq2 IV-A4, TurboTransformer IV-B4, PetS IV-B5. challenges in LLM deployment and serving. By focusing
on memory management, efficient GPU utilization, user-
1) VLLM friendly tools, and co-optimization.
vLLM [67] is a high performance system that efficiently
handles LLMs at a large scale. The model tackles the V. TRAINING OPTIMIZATION
memory limitations of existing LLM serving systems through Training optimization in LLMs involves improving the
a novel algorithm called PagedAttention (Section V-A1). efficiency and effectiveness of the training process. This
PagedAttention splits the KV cache into manageable blocks, encompasses a range of techniques and strategies aimed
minimizing wasted memory and enabling efficient sharing at improving factors such as convergence speed, model
across requests. vLLM is a distributed system that supports generalization, and resource utilization. The goal of training
popular LLMs and even models exceeding single GPU optimization is to achieve the desired model performance
memory. Evaluations present vLLM significantly improves with faster training times, reduced resource requirements,
throughput by 2-4× faster compared to existing systems, and improved overall training effectiveness. In this section,
especially for complex tasks involving long sequences, large we will focus on model optimization, size reduction,
models, and intricate decoding algorithms. This makes vLLM distributed training, and heterogeneous training (Fig. 5).
a significant advancement for efficient LLM processing,
enabling faster and more scalable LLM applications. A. MODEL OPTIMIZATION
Model optimization in LLMs refers to the process of
2) LLM DEPLOYMENT AND SERVING LIBRARIES: improving the model’s architecture, structure, or parameters
CHALLENGES AND KEY FINDINGS to enhance its overall performance. We stated various tech-
As explored in previous sections (Sections IV-A6 and IV-B12) niques aimed at achieving better accuracy, efficiency, or both.
a variety of LLM frameworks exist that hold promise Common model optimization strategies for LLMs include
algorithmic optimization (section V-A1), layer-specific fine-tuning (section V-A4), and scheduler optimization
kernels (section V-A2), model partition (section V-A3), (section V-A5).
It can use different accelerators, each of which supports traditional fine-tuning. Instead of adjusting all pre-trained
re-materialization. GPipe partitions the model across the model weights, LoRA introduces trainable rank decomposi-
accelerators, with each accelerator responsible for a sequence tion matrices into each layer of the Transformer architecture,
of layers (called a cell). significantly reducing the number of trainable parameters
Megatron-LM [19] introduces a new method for training needed for downstream tasks. This approach reduces the
LLMs, which empowers the training of exceptionally large number of trainable parameters by 10,000× and reduces GPU
transformer models with billions of parameters within GPUs. memory requirements by 3× compared to GPT-3 175B fine-
Megatron-LM uses intra-layer model parallelism, a strategy tuned with Adam, while maintaining or improving model
that subdivides the model into smaller submodels capable of quality on benchmarks like RoBERTa, DeBERTa, GPT-2,
being trained separately. and GPT-3. LoRA achieves higher training throughput
with no added inference latency and facilitates efficient
4) FINE-TUNING task-switching by sharing the pre-trained model and only
AlphaTuning [74] is a novel method specifically designed for optimizing the small low-rank matrices, thereby reducing
large-scale pre-trained language models (PLMs). It combines storage and hardware costs. It is versatile and can be com-
the quantization of PLMs with fine-tuning, only a subset bined with other methods, applicable to any neural networks
of quantized parameters is fine-tuned for the target task. with dense layers. For GPT-3 175B, LoRA with 4.7M
This selective approach significantly decreases the overall parameters achieves 73.4% accuracy on WikiSQL, 91.7%
memory footprint and the number of parameters to be trained. on MNLI-m, and Rouge-1/2/L scores of 53.8/29.8/45.9 on
Despite these reductions, it maintains performance levels SAMSum, demonstrating its superior performance and
comparable to full fine-tuning across a diverse range of efficiency.
downstream tasks.
QFT [75] proposes a novel framework designed for 5) SCHEDULER OPTIMIZATION
memory-efficient fine-tuning of LLMs. The model utilizes TurboTransformers [11] (Section IV-B4) introduces a novel
quantization techniques to significantly reduce memory sequence-length-aware batch scheduler that utilizes dynamic
usage during fine-tuning while preserving model perfor- programming (DP) to optimize response throughput. This
mance. The framework adopts the Lion optimizer, known for approach overcomes the limitations of traditional batch
its memory efficiency and compatibility with quantization, schedulers that struggle with varying input lengths. The
and the conversion of all model states into integers to model considers sequence length in batching decisions. The
minimize memory footprint. The study also features a scheduler’s core algorithm operates in O(n2 ) time complexity,
specialized gradient flow and parameter update scheme making it efficient for real-time applications.
tailored for quantized weights. Extensive evaluations show PetS [59] (Section IV-B5) introduces a unified framework
the framework’s effectiveness, allowing fine-tuning of large aimed at enhancing multi-task PET serving efficiency.
LLaMA-7B models with less than 30 GB of memory on It comprises two main components: a flexible PET task man-
a single A6000 GPU a substantial reduction compared to agement mechanism and a specialized PIE. Together, these
standard methods while maintaining similar performance components facilitate both inter-task and inter-algorithm
across various benchmarks. query-batching, streamlining the processing of PET tasks.
LOMO [56] is a novel technique for training LLMs on This approach optimizes resource utilization and enhances
machines with limited GPU capacity. LOMO proposed a the efficiency of PET serving. The PET task scheduler
memory-efficient update method that greatly lowers memory efficiently schedules PET operations to run in parallel on
consumption compared to traditional methods. This enables the GPU, maximizing hardware utilization and performance.
fine-tuning large models, such as those with 65 billion It dynamically assigns PET tasks to CUDA streams, consid-
parameters, on consumer-grade GPUs like the RTX 3090. ering both PET operator characteristics and system resource
The study validates LOMO’s efficiency through analyses constraints. This lightweight online scheduling strategy
of memory usage, performance testing, and benchmark task effectively balances computational and memory-intensive
evaluations. Existing techniques like LOMO reduce memory tasks, leading to improved throughput and reduced latency
usage but compromise performance. in multi-task PET serving scenarios.
AdaLomo [57] offers a better solution. It incorporates a key
feature from the powerful AdamW optimizer (adaptive learn- B. SIZE REDUCTION OPTIMIZATION
ing rate) but uses clever techniques to stay memory-friendly. Minimizing the size or complexity of LLMs is a crucial
This allows AdaLomo to match AdamW’s performance on optimization technique known as size reduction optimiza-
various tasks, making LLM training more accessible with tion. This approach is essential for addressing challenges
less memory needed. On average, AdaLomo achieved scores associated with memory demands, computational efficiency,
of 30.8, 39.7, 51.0, and 56.9 on the LLaMA benchmark for and storage limitations. Size reduction optimization encom-
models with 7B, 13B, 30B, and 65B parameters, respectively. passes various techniques, including model compression and
LoRA [33] is a method designed to adapt LLMs, such quantization (Section V-B1), pruning (Section V-B2), and
as GPT-3, for specific tasks, addressing the challenges of hyperparameter optimization (Section V-B3).
Cramming [16] investigates the trade-offs involved in scal- minimal impact on accuracy, achievable within a day on a
ing down language model training, and investigates different single GPU. This enables the execution of trillion-parameter
parts of the training pipeline to identify the modifications that models on affordable commodity hardware, such as a single
have the biggest impact on performance in a scaled-down server with 4× NVIDIA A6000 or 8× NVIDIA 3090 GPUs,
setting. As a result, the research figured out that even under with less than 5% runtime overhead compared to uncom-
customized and constrained settings, the scaling laws [76] pressed inference. The framework reduces the model size
were almost true as it was observed for performance in large- from 3.2TB in bfloat16 to less than 160 GB, allowing
compute settings. As a predictable outcome of these laws, efficient execution on commodity hardware and enhancing
it is a challenging task to perform downscaling. However, the practical adoption and research of MoE architectures.
a smaller model architecture requires less computation power AlphaTuning [74] is a compression-aware parameter-
and allows to boost up the gradient computations, as a efficient adaptation method for large-scale PLMs. It com-
result the rates of the improved model within the time bines the quantization of PLMs with fine-tuning, but only a
remain nearly unchanged. The study shared that doing subset of quantized parameters are fine-tuned for the target
modifications to the training methodology leverages scaling task. This significantly reduces the total memory footprint
law to bring about enhancements by increasing the effective and the number of trainable parameters, while still achieving
rate of gradient computations without sacrificing the model comparable performance to full fine-tuning on a variety of
size. Two model setups have been analyzed: one utilizing a downstream tasks. It relies on binary-coding quantization,
classical rtx2080ti GPU, and the other employing a modern a technique that decomposes full-precision parameters into
rtxa4000 or rtxa6000 GPUs. Each setup was configured binary parameters alongside a distinct set of scaling factors.
with 4 CPU cores and 32 GB of RAM. The paper proposes The model is evaluated across various PLMs and downstream
several modifications to the standard training pipeline to tasks, and achieves comparable performance to full fine-
make it possible to train a language model on a single GPU in tuning, even at low bitwidths. While it was applied to GPT-2
one day. As a result, each of these modifications has a direct and OPT, it achieved a compression ratio of over 10 times
impact on the model size reduction such as smaller model under 4-bit quantization and a reduction in the number of
architecture, shorter training schedule, lower learning rate, trainable parameters by over 1,000-fold, while still achieving
mixed precision training, and specialized training library. competitive performance on a variety of downstream tasks.
GPTQ [78] proposes a new highly accurate and highly
1) MODEL COMPRESSION AND QUANTIZATION efficient post-training quantization method based on approx-
FlexGen [17] (Sections IV-B, and V-A1) through a lin- imate second-order information which is called a new
ear programming-based search algorithm identifies opti- one-shot weight quantization. This model reaches a level
mal patterns for tensor storage and retrieval. Furthermore, that is considered acceptable to precisely quantize models to
it employs 4-bit quantization to compress weights and 3 or 4 bits per parameter, it requires a few hours at most to
attention cache without compromising accuracy, significantly run on a model that has hundreds of billions of parameters.
reducing model size and memory footprint. These opti- The model experimented on both OPT-175B and BLOOM-
mizations enable it to achieve impressive throughput gains 176B it took approximately 4 GPU hours by reducing the
compared to existing LLM inference systems. bitwidth down to 3 or 4 bits per weight, with minimal
SWARM parallelism [10], proposes a model for training loss of accuracy compared to the uncompressed baseline.
a large model with unreliable heterogeneous devices with Compared to previous one-shot quantization methods the
low network bandwidth by using dynamically generated, model achieves more than twice the compression without
randomized pipelines instead of static pipelines dynam- sacrificing accuracy. Also, within the method for first-time
ically instead of statically. The study incorporates 8-bit models with 175 billion parameters can execute inside a
compression to minimize model size and facilitate train- single GPU for generative inference. The results show that
ing on resource-constrained devices with limited network these enhancements can boost performance by up to 3.25×
bandwidth. This compression technique significantly reduces while using high-end GPUs (NVIDIA A100) over FP16 and
the amount of data that needs to be transferred between reach 4.5× and up to 4.5× while using more cost-effective
nodes during training, leading to improved efficiency and GPUs (NVIDIA A6000). This model can also achieve
throughput. robust accuracy results even using an extreme quantization
QMoE [77] is a compression and execution framework regime, while the weights are quantized to 2-bit or ternary
that reduces memory usage significantly. This is achieved quantization level.
through a scalable compression algorithm that shrinks trillion FPTQ [79] also proposes a novel post-training quantization
parameter MoEs down to less than 1 bit per parameter. This technique to address the deploying LLM challenge. This
impressive compression is facilitated by a custom format technique effectively compresses LLMs into a format using
specifically designed to work with bespoke GPU kernels, 4-bit weights and 8-bit activations (W4A8). This approach
enabling efficient processing with minimal slowdowns. achieves SOTA performance on popular LLMs like BLOOM
QMoE can compress the SwitchTransformer-c2048 model to [80], LLaMA [14], and LLaMA-2 without requiring fur-
under 160 GB (20× compression, 0.8 bits per parameter) with ther fine-tuning. FPTQ offers a significant advantage by
optimizing both memory usage and computation efficiency QFT [75] (Section V-A4) addresses memory limitations
during the inference stage without sacrificing accuracy. This during fine-tuning LLMs by introducing a novel quantization
technique simplifies the deployment process for LLMs and framework. It converts all model states into integers to
makes them more practical for real-world use. The model minimize memory footprint and employs the Lion optimizer
was validated on various datasets, including LAMBADA, for its memory efficiency and compatibility with quantiza-
MMLU, and a set of Common Sense QA tasks. The tion. Additionally, the framework incorporates a specialized
researchers compared the model’s performance to an exist- scheme for handling quantized weights during training.
ing technique called LLM-QAT (LLM-Quantization-Aware QuantEase [82] is a framework for post-training quantiza-
Training). However, limited data availability for LLM-QAT tion of LLMs that enhanced their deployment efficiency. The
restricted the comparison to the Common Sense QA dataset. framework addresses the challenge of layer-wise quantization
On this task, FPTQ achieved results closer to the FP16 by optimizing each layer individually, utilizing Coordinate
compared to LLM-QAT. While the analysis was only possible Descent (CD) to achieve high quality solutions efficiently
for 7B and 13B parameter LLaMA models due to data without complex matrix operations. The framework includes
limitations, FPTQ consistently performed better across all an outlier-aware variant that maintains crucial ‘‘outlier’’
subsets of the dataset. This is evidenced by the average scores: weights in full precision to enhance accuracy. Demonstrating
73.38 and 76.81 for LLaMA-7B and 13B, respectively. These SOTA performance, QuantEase significantly improves per-
findings suggest that FPTQ is an effective approach for LLM plexity and zero-shot accuracy compared to existing methods
quantization. like GPTQ [78], with up to 15% relative improvement.
Norm Tweaking [54] method introduces a novel technique Efficient linear algebra optimizations allow for the quan-
in quantization, specifically for LLMs. While existing quan- tization of large models such as Falcon-180B on a single
tization methods like GPTQ [78] achieve acceptable 4-bit GPU in under 3 hours. The outlier-aware variant supports
weight-only quantization, attempts at lower bit quantization near or sub-3-bit quantization with minimal accuracy loss,
often lead to significant performance degradation. It intro- outperforming methods like SpQR by up to two times in
duces a strategy to rectify the quantized activation distribu- perplexity reduction.
tion, restoring accuracy for LLMs. The method involves gen- LLM-Pruner [83] (Section V-B2) compresses LLMs by
erating calibration data and applying channel-wise distance removing non-essential parts based on gradient information
constraints to normalization layer weights. Experiments show while keeping their functionality. This significantly reduces
significant improvements in both weight-only quantization model size with minimal accuracy loss, achieved through
and joint quantization of weights and activations, achieving fine-tuning with a small amount of data.
high accuracy even at 2-bit quantization. It offers a practical
solution for reducing computational and storage costs in 2) PRUNING
LLMs while maintaining performance. SparseGPT [70] framework has developed an efficient and
FineQuant [81] introduces an innovative weight-only precise post-training pruning technique for significantly
quantization technique that significantly decreases mem- reducing the size of large-scale GPT-family models. This
ory usage and speeds up LLM inference with minimal method achieves at least 50% sparsity in a single step,
quality loss. Key features of this technique include utiliz- without requiring retraining. Remarkably, it enables the
ing pre-trained model weights without further fine-tuning, processing of the largest open-source models, such as OPT-
applying adaptive granularity quantization to minimize 175B and BLOOM-176B, in less than 4.5 hours. It makes
accuracy loss, and implementing an efficient GPU processing the model achieve 50-60% unstructured sparsity with a
approach. Tested on large-scale models like OPT-175B, negligible increase in perplexity and removes more than
FineQunat demonstrates minimal accuracy loss, achieves up 100 billion weights with minimal impact on accuracy. The
to 3.65× higher throughput with the same number of GPUs, study demonstrates that the parameterization of massive GPT
and reduces resource demands. models enables pruning without relying on the gradient
PETALS [60] (Section IV-B6) is a collaborative platform information. It highlights that sparse models with comparable
for distributed inference and fine-tuning of LLMs over the accuracy to dense models can be identified within the ‘‘close
internet. To enhance efficiency, quantization techniques have neighborhood’’ of the dense models. The study’s findings
been employed to store a higher number of parameters reveal that sparse models achieve performance very similar
per GPU, thereby decreasing the need for consecutive to the dense models. The study also shows that it is easier to
devices and communication rounds and use 8-bit precision to prune larger models: for a fixed sparsity level, the accuracy
compress the weights, reducing the nodes required to store drop for larger sparse models is smaller, to the point where
all layers. To achieve more efficient data transfer between there is practically no accuracy decrease when reaching 50%
pipeline stages, dynamic blockwise quantization is utilized. sparsity, to the point where reach 50% sparsity does not result
It utilize 8-bit mixed matrix decomposition for matrix in any noticeable accuracy decrease on the largest models.
multiplication allows the model to quantize the weights to Sheared LLAMA [72] is used to reduce the size of the
8-bit precision, significantly reducing the memory footprint LLaMA2-7B model to 1.3B and 2.7B parameters, and it
compared to 16-bit weights. performed better than other open-source models of the same
size on a variety of downstream and instruction tuning learning models. Its simplicity, scalability, and effectiveness
evaluations. LLM-shearing also requires only 3% of the make it a valuable tool for researchers and practitioners in the
computing resources to train as the same models trained field of machine learning [15], [24], [69], [80].
from scratch. One of the main steps of Sheared LLAMA
is a novel pruning algorithm that can prune a source model 2) MODEL PARALLELISM
to any specified target architecture. The algorithm is an The model parallelism can be classified into two groups
extended version of CoFiPruning that allows the source tensor parallelism (section V-C2a) and pipeline parallelism
model to be pruned to any specified target architecture, based (section V-C2b).
on the desired model size and performance requirements.
Pre-trained models are typically well-optimized to balance
a: TENSOR PARALLELISM
expressivity and inference efficiency, therefore these config-
urations are used as the target architectures. Tensor parallelism involves partitioning a tensor across an
LLM-Pruner [83] introduces a framework for compressing array of devices, necessitating a distributed matrix-matrix
LLMs in a task-agnostic way while minimizing the need for multiplication algorithm for mathematical computations.
the original training corpus. The framework uses structural Using the tensor parallelism reduces the response time for
pruning to remove non-critical parts of the model based individual queries [15], [17]. Megatron-LM introduced 1D
on gradient information. The pruned models’ performance tensor parallelism (Section IV-A3) which partitions the linear
is recovered using LoRA [33] tuning, which takes just layer within the Transformer architecture along either the
3 hours and 50K data samples. Experiments on LLaMA [14], row or column dimensions. Within Megatron-LM, tensors are
Vicuna [84], and ChatGLM [85] show that the compressed broken down into a single dimension [15], [19].
models maintain 94.97% of their original performance even
after removing 20% of parameters. However, higher pruning b: PIPELINE PARALLELISM
rates lead to significant performance drops and incoherent FlexGen [17] (Section IV-B2) utilizes pipeline parallelism to
sentence generation. distribute an l-layer LLM evenly across m GPUs, enabling
parallel execution of all layers. Each GPU executes the same
3) HYPERPARAMETER OPTIMIZATION sequence of operations, essentially reducing the problem to
Selecting the right hyperparameters is essential for devel- training an n/m-layer transformer on a single GPU. This
oping effective LLMs, as these parameters significantly approach leverages the existing policy search algorithm
influence the model’s convergence speed, generalization developed for single-GPU training. In order to implement
ability, and overall performance in various language tasks. micro-batch pipelining, a new repetition statement (for-loop)
Whereas often an iterative and computationally demanding is used within the applied algorithm effectively merging the
process, hyperparameter optimization is crucial for achieving iteration-level pipeline parallel execution schedule with a
optimal model performance. Cramming [16] employs a lower single-device offloading runtime.
learning rate to stabilize the training process and prevent PETALS [60] (Section IV-B6) utilizes pipeline parallelism
overfitting, enabling effective model training within limited to efficiently distribute the computation of LLMs among
computational resources. multiple servers. Servers are organized into a chain, with
each server responsible for executing a portion of the model
pipeline. This approach enables efficient parallel processing,
C. DISTRIBUTED TRAINING
improving the overall performance of inference and fine-
Distributed training refers to the process of training LLMs tuning tasks.
across multiple computing devices or processing units. This GPipe [3] (Section IV-A1) employs a novel pipeline paral-
approach harnesses the power of parallelism to distribute the lelism algorithm based on batch splitting, where mini-batch
computational burden, enabling faster training of large mod- examples are divided into smaller micro-batches and sequen-
els with millions or even billions of parameters. Distributed tially executed across cells during training. The model utilizes
training is crucial for managing the massive datasets and synchronous mini-batch gradient descent, accumulating gra-
computational demands associated with cutting-edge LLMs. dients from all micro-batches and applying them to the model
at the end of the mini-batch. The efficiency of the model is
1) DATA PARALLELISM demonstrated through the successful training of large-scale
Data parallelism is a parallel training technique that replicates models, including a convolutional model (AmoebaNet) for
the entire model across multiple GPUs or devices and image classification and a transformer model for machine
distributes the training data among them. Each device handles translation. The model showcases its flexibility across various
a portion of the data, performs forward and backward deep network architectures, achieving superior results and
propagation, and computes gradients independently. These consistent training performance on diverse devices.
gradients are then aggregated across all devices to update DFX [86] is a low-latency multi-FPGA appliance for
the global model parameters. It is a fundamental and widely accelerating transformer-based text generation. It uses model
used technique for improving the training throughput of deep parallelism to split the transformer model across multiple
FPGAs. This allows each FPGA to process a different part results in a significant boost in memory efficiency. The rest
of the model in parallel, thereby accelerating the overall of the memory consumed by residual states could become
text generation process. Also, it uses an efficient network a secondary memory bottleneck. The study overcame this
to interconnect the FPGAs and reduce communication problem by three factors: Firstly using activation partition
overhead. The network uses a ring topology to minimize to optimize activation memory by locating and deleting
communication overhead. This model utilized four Xilinx activation replication in existing MP (model parallelism),
Alveo U280 FPGAs and evaluated its performance on the and when appropriate offloads activations to CPU. Secondly,
GPT-2 language model. It demonstrated a 5.58× acceleration keeping the balance of memory and computation efficiency
in speed and a 3.99× enhancement in energy efficiency by introducing appropriate size temporary buffers to strike.
compared to four NVIDIA V100 GPUs. In addition to its Finally, during the training process, memory becomes
performance and energy efficiency benefits, this solution fragmented because tensors have varying lifetimes. The lack
proves to be more cost-effective than GPU-based alternatives. of contiguous memory, resulting from this fragmentation,
Moreover, it offers an 8.21× cost advantage over a GPU can lead to memory allocation failures, even when there
appliance delivering similar performance levels. is sufficient free memory space available. To address this
problem, ZeRO-R takes a proactive approach by effectively
3) COMBINED PARALLELISM handling memory based on the distinct lifetimes of tensors,
Narayanan et al.., in [69] proposed a new technique called thereby preventing memory fragmentation. Remarkably, this
PTD-P for training LLMs on GPU clusters. PTD-P combines model achieves a throughput of 15 Petaflops when training
pipeline parallelism, tensor parallelism, and data parallelism models with over 100 billion parameters, demonstrating
to achieve high computational performance and graceful super-linear speedup on 400 GPUs. It indicates an 8×
scaling. Data parallelism divides the training data into smaller increase in model size and a 10× increase in performance
batches, which are then processed in parallel on all the GPU compared to recent SOTA models.
servers. This allows PTD-P to achieve faster training by
leveraging the parallel computing capabilities of the GPU 5) SEQUENCE PARALLELISM
cluster. Also, GPipe [3], and ZeRO [18] (section IV-A1,
Sequence parallelism [15], [87], is a novel approach proposed
and V-C4 respectively) are other examples of combined
to efficiently train Transformers with longer sequences on
parallelism.
GPUs. It addresses the quadratic memory requirements
of self-attention in Transformer models. Unlike traditional
4) ZERO
methods, it does not require a single device to handle
ZeRO [18] proposed solutions to overcome the limitations of
the entire sequence. By splitting sequences into chunks
existing methods and efficiently train large models. While
and distributing them across devices, it achieves effective
using existing systems the memory consumption can be
training with infinitely long sequences. It introduces Ring
classified into two main parts which are model states, and
Self-Attention to enhance the process, demonstrating supe-
residual states. Most of the memory capacity is used by model
rior performance in batch size and sequence length compared
states (such as momentum, variance in Adam, gradients,
to tensor parallelism, handling sequences over 27× longer
and parameters) while working with large models. The rest
than existing methods.
part of the memory is occupied by residual states (such
as activation, temporary buffers, and unusable fragmented
memory). For applying optimization in both model state 6) AUTOMATIC PARALLELISM
memory and residual state memory, efficiently training The automatic selection and parallelization strategies as
models of such colossal sizes is crucial as they grow from the latest advances in parallel training demonstrated by
billions to trillions of parameters. The study introduces a FlexFlow [71] and Alpa [24], [88]. Alpa is an automated
novel memory optimization technique aimed at substantially system that generates execution plans for distributed model-
improving training speed, and with approach enables scaling parallel training. It’s an architecture that can automatically
the model size in proportion to the number of devices derive efficient parallel execution plans at each parallelism
while maintaining high efficiency. Leveraging the latest level. It is different from specialized systems as it can handle
hardware, this model can scale to over 1 trillion parameters models with heterogeneous architectures and models without
by carefully evaluating communication volume and memory manually designed plans. However, it is not hardware-aware
capacity requirements, this boosts memory efficiency for and does not consider network topology. Also, it does
model states. For optimizing model state memory, which not search for activation checkpointing, which could lead
occupies most of the memory during training, the study to suboptimal results. Alpa has been evaluated on large
introduces ZeRO-DP, ZeRO-powered data parallelism which models training with billions of parameters. The model’s
has three main optimization stages: in the first stage, only performance has been compared with the SOTA systems
the optimizer states are partitioned; in the second stage, such as Megatron-LM [19] and DeepSpeed [5], on an
both optimizer states and gradients are partitioned; and in Amazon EC2 cluster with 64 GPUs. It presents the similar
the final stage, all three model states are partitioned. This training performance as Megatron-LM on GPT models and
outperforms DeepSpeed on GShared MoE models with up [63] (section IV-B9) improves LLM inference by separating
to 9.7× speedup. Moreover, it generalized well to models workload onto different machines for high throughput, cost,
without manual strategies and demonstrated 80% liner or power efficiency. It allows for building both homogeneous
scaling efficiency on Wide-ResNet. The results presented that and heterogeneous clusters depending on the optimization
Alpa’s performance in training large models efficiently and goal.
its ability to generalize to diverse models.
E. TRAINING OPTIMIZATION: CHALLENGES AND KEY
D. HETEROGENEOUS TRAINING FINDINGS
ZeRO-Offload [20], aims to democratize large-scale model In the previous sections we have offered a comprehensive
training, making it accessible to a wider audience. It achieves overview of training optimization (Section V) which includes
this by using a single GPU to train models with over model optimization (Section V-A), size reduction optimiza-
13 billion parameters, eliminating the need for data scientists tion (Section V-B), distributed training (Section V-C), and
to modify the models or sacrifice computational efficiency. heterogeneous training (Section V-D). In this section and the
The study introduces ZeRO-Offload, a novel heterogeneous following paragraphs, we will discuss training optimization’s
deep learning (DL) training technology. The model leverages challenges and key findings.
both CPU memory and compute for offloading and offers an Challenges of Model Optimization:
efficient scaling path on multiple GPUs through collaboration • Resource constraints: LMs demand significant memory
with ZeRO-powered data parallelism [18]. Through first- and computational power, limiting training and deploy-
principle analysis, the study asserts that the model provides ment on single devices.
an optimal solution, maximizing memory savings while • Balancing efficiency and accuracy: Optimizing LLMs
minimizing communication and CPU compute overhead for requires finding a balance between efficient resource
large model training. utilization and maintaining model performance.
ZeRO-Infinity [1] introduces an innovative system technol- • Memory bottlenecks: Distributing LMs across devices
ogy that enables the model scaling on constrained resources. introduces memory limitations on each device.
It achieves this without the need for extensive model code • Communication overhead: Data exchange between
modifications by harnessing the power of GPU, CPU, and devices during training can become a bottleneck,
NVMe memory. The model made up of five innovative slowing down the process.
technologies: 1) infinity offload engine, this technique • Hardware heterogeneity: Efficiently utilizing devices
uses simultaneous exploitation of GPU, CPU, and NVMe with varying memory capacities and processing speeds
memory, as well as GPU and CPU compute to fully leverage in a distributed setting is challenging.
heterogeneous architecture on modern clusters, 2) memory- • Scalability limitation: Traditional methods might not
centric tiling, handle extensive operators without necessity scale well with increasing device numbers due to
of model parallelism, 3) bandwidth-centric partitioning, memory and communication constraints.
is employed to make the most of the aggregate memory Key Findings:
bandwidth across all parallel devices, 4) overlap-centric • Algorithmic: Techniques like FlexGen [17], Light-
design, is implemented to enable the simultaneous execution Seq [61], and NLP-Fast [58] improve efficiency by
of compute and communication tasks, 5) ease-inspired optimizing computations, memory access, and utilizing
implementation, to prevent the need for extensive model specialized hardware kernels.
code refactoring. SWARM Parallelism [10] (section V-B1) • Model partitioning: Techniques like GPipe [3] and
introduced a model aimed at training large models efficiently, Megatron-LM [19] partition models for efficient pro-
particularly on unreliable heterogeneous devices with limited cessing across multiple devices.
network bandwidth. Instead of employing static pipelines, • Fine-tuning for efficiency: Techniques like AlphaTun-
the model utilizes dynamically generated and randomized ing [74] and LoRA [33] enable fine-tuning large
pipelines to adapt to varying conditions. This allows each models on limited memory by reducing the number of
device to share its results with any other device that is parameters requiring adjustment.
responsible for the next stage of the pipeline. This enables • Scheduler optimization: Techniques like TurboTrans-
devices with high performance to process inputs from formers [11] improve response throughput and task
multiple predecessors, distribute their results across multiple execution on GPUs.
weaker peers, and rebalance the workload in case of failure • Size reduction optimization: This approach focuses
to improve utilization. on reducing model complexity through techniques
NLP-Fast [58] (Section IV-B3) is a system designed like quantization (reducing storage bits) and pruning
to enhance the performance of large-scale heterogeneous (removing non-essential parts).
NLP models by pinpointing the most resource-intensive • Parallelism strategies: 1) Data parallelism: Distributes
operations and employing a combination of techniques: holis- training data across devices for faster training. 2) Model
tic model partitioning, cross-operation zero skipping, and parallelism: Splits the model across devices for parallel
model/config adaptive hardware reconfiguration. Splitwise computations (tensor, pipeline, sequence parallelism).
1) MEMORY MANAGEMENT
3) Combined parallelism: Combines data and model TurboTransformers [11] (section IV-B4), proposed a
parallelism for even faster training (PTD-P, ZeRO [18], sequence length aware algorithm for memory allocation
GPipe [3]). to efficiently balance memory allocation and deallocation,
• Memory optimization: ZeRO [18] optimizes memory this algorithm overcomes the problem of variability of
for trillions of parameters, Activation Partitioning input sentence. LightSeq2 [52] introduces an innovative
deals with activation memory efficiently, and ZeRO- memory management approach, specifically designed for
Offload [20] and ZeRO-Infinity [1], which allow the Transformer structure. This strategy efficiently reduces
training on single GPUs or limited resources by utilizing peak memory consumption and minimizes the need for
CPU and NVMe memory. frequent allocation and release calls. Notably, LightSeq2
• Heterogeneous optimization: SWARM Parallelism [10] stands out as the pioneer in accelerating the entire training
tackles unreliable devices with limited bandwidth by process of Transformers. In real-time applications where
adapting workloads, NLP-Fast [58] optimizes execution response time is crucial, model parallelism and pipeline
on mixed platforms by pinpointing resource-heavy parallelism can introduce significant delays due to the extra
operations, and Splitwise [63] distributes work across communication overhead caused by splitting tensors or
heterogeneous machines considering different goals like layers, even with technologies like NVLink and GPUDirect.
throughput, cost, and power consumption. EET [62] (section IV-B8) focuses on minimizing memory
• Automatic parallelism: Alpa [88] automatically gen- usage for loading large models in online services. The
erates execution plans for distributed model parallel proposed solution involves dynamic memory management,
training, applicable to diverse models. specifically targeting the reduction of memory consumption
Overcoming these challenges and leveraging these tech- for activation caches and operation result buffers, as weights
niques, model training can be made more efficient, scalable, and certain pre-allocated caches are inherently difficult
and accessible, paving the way for even more powerful and to compress. They introduce a dynamic CUDA memory
versatile LLMs. management mechanism specifically designed to reduce
CUDA memory usage for the same model size, unlike the
VI. HARDWARE OPTIMIZATION
manual memory allocation required by FT.
Hardware optimization is a systematic approach to improving
the performance, efficiency, and functionality of computer
hardware. By identifying and addressing bottlenecks in hard- B. HARDWARE-AWARE OPTIMIZATION
ware architecture [18], software, and the operating system, Hardware-aware optimization (HAO) is the process of
hardware optimization can enhance overall speed, reduce optimizing the hardware utilization of deep learning models
power consumption, and improve the reliability of hardware to achieve maximum performance on specific hardware
components (Fig. 6). Splitwise [63] (Section IV-B9) is a platforms [91]. In this section, we will explain offloading and
technique to optimize hardware utilization by separating mixed precision optimization.
the prompt computation and token generation phases onto
different machines. This approach allows designing clusters 1) OFFLOADING
optimized for cost, throughput, and power consumption. The FlexGen [17] (SectionIV-B2) presents an offloading
model achieves up to 1.4× higher throughput at 20% lower framework for LLMs that optimizes I/O efficiency and
cost or 2.35× higher throughput with the same cost and throughput by considering computation schedule, tensor
power. placement, and computation delegation. It utilizes a linear
programming-based search algorithm and unifies the place-
A. MEMORY OPTIMIZATION ment of weights, activations, and the KV cache, enabling
In the process of training deep learning models, memory significantly larger batch sizes compared to existing methods.
usage is primarily attributed to various factors, including ZeRO-Offload [20] model facilitates the training of
model parameters, layer activations, gradients, and optimizer large model heterogeneous on GPU + CPU systems,
states, such as momentum and variances in the Adam enabling the handling of models up to 10× larger on
algorithm [15], [18]. The terms ‘‘model states’’ [18] or a single GPU without sacrificing efficiency by using a
‘‘model data’’ [15] encompass model parameters, gradients, unique optimal offload strategy. Also, the design achieves
a highly scalable multi-GPU configuration by integrating a system copies each piece of gradients, parameters to/from
the offload strategy with ZeRO-powered data parallelism, its FP32 counterpart in one training step, ensuring the
enabling ZeRO-Offload to achieve nearly linear scalability, accurate update of FP32 parameters with the loaded FP32
and smooth integration with model-parallel training. This gradient by the trainer kernel.
combination allows for the training of even larger models FP8-LM [90] introduces a novel FP8 automatic mixed-
than using ZeRO-Offload or model parallelism indepen- precision framework for training LLMs, optimizing
dently. Moreover, the model enhances CPU performance by mixed-precision and distributed parallel training through
introducing a high-performance Adam optimizer, achieving a three levels of FP8 utilization. By gradually incorporating
6× improvement over SOTA Adam implementations. It also 8-bit gradients, optimizer states, and distributed learning,
employs a one-step delayed parameter update strategy to the framework significantly enhances training efficiency.
overlap GPU forward and backward passes with CPU param- During the training of a GPT-175B model on the H100 GPU
eter updates. Additionally, the model’s size has increased platform, the FP8 framework reduced memory usage by 39%
by a factor of 10 compared to widely used frameworks and increased training speed by 75% compared to the BF16
such as PyTorch. To maintain computational efficiency, the framework, outperforming Nvidia’s Transformer Engine by
model minimizes data traffic to and from the GPU, increases 37%. This advancement leads to substantial cost reductions
GPU memory utilization, and allows offloading data and for training large models and is adaptable to various tasks
computation to the CPU. On a single NVIDIA V100 GPU, the such as instruction tuning and reinforcement learning with
model can achieve 40 TFlops/GPU for 10 billion parameters, human feedback.
and it can scale up to 128 GPUs when available. The model
also supports model parallelism, enabling training models C. HARDWARE OPTIMIZATION: CHALLENGES AND KEY
with more than 70 billion parameters on a single DGX-2 FINDINGS
box, resulting in a 4.5× increase in model size compared to Challenges of Hardware Optimization:
employing model parallelism alone. • Memory limitation: Deep learning models can require
Eliseev and Mazur [89] propose a model to efficiently run vast amounts of memory to store parameters, activations,
large sparse MoE language models on hardware with limited and gradients. This limits the size and complexity of
GPU memory. Using parameter offloading and leveraging models that can be trained on a single device.
the properties of MoE models enabled Mixtral-8 × 7B • Limited hardware utilization: Traditional training meth-
with mixed quantization to operate on desktop hardware ods may not fully utilize the capabilities of modern
and free-tier Google Colab instances. The study showed hardware like GPUs.
that some experts are reused between adjacent tokens, and • Balancing speed and accuracy: Techniques like mixed
early layers can predict subsequent experts. This led to an precision training aim to improve training speed by
MoE-specific offloading strategy employing an LRU (Least reducing memory usage, but this can potentially com-
Recently Used) cache and advanced prediction of needed promise model accuracy.
experts. The model significantly improves speed, achieving Key Findings:
2-3 tokens per second on various consumer GPUs, and offers • Memory management: Techniques like sequence length
a practical solution for running large MoE models on limited aware allocation and dynamic memory management can
hardware. significantly reduce memory usage during training.
• Hardware-aware optimization: Offloading computations
2) MIXED PRECISION to CPUs or leveraging mixed precision training can
Mixed precision training [92] proposes a method for training improve hardware utilization and training speed.
deep neural networks using half-precision floating-point • Model parallelism: Splitting models across multiple
numbers, aiming to reduce memory requirements by almost devices can handle larger models but can introduce
half and accelerate arithmetic on modern GPUs without communication overhead, impacting training speed.
compromising model accuracy or requiring adjustments to • Large model training: Frameworks like ZeRO-Offload
hyperparameters. [20] enable training models significantly larger than
Cramming [16] conducts all experiments and ablation what a single GPU can handle.
studies using a consistent setup that employs automated In the domain of hardware optimization, a continuous
mixed precision for both standard 16-bit and 32-bit floating- stream of novel methodologies is emerging, demonstrably
point precision. expanding the frontiers of feasibility within the training
LightSeq2 [52] (section IV-A4) optimizes the training pro- paradigm.
cess by implementing batched updates on reduced-precision
parameters instead of numerous individual updates on full- VII. SCALABILITY AND RELIABILITY OPTIMIZATION
precision parameters. In mixed precision training, where Scalability optimization focuses on improving hardware sys-
parameters and gradients are in FP16 during forward and tems’ capacity to flexibly handle varying workloads, enabling
backward propagation, maintaining FP32 copies is necessary smooth scaling adjustments to meet evolving demands [1],
for accuracy during the update values calculation. Typically, [5], [18], [19], [20], [69], and reliability optimization aims
TABLE 8. Summary on reviewed papers excluding those already covered in Tables 3 and 4 or the main text.
limited hardware. The initial challenge was the high com- • Accuracy maintenance: The pruned models exhibited
putational and memory requirements that exceeded the negligible increases in perplexity and retained perfor-
capabilities of available resources, making it difficult to mance levels very similar to their dense counterparts.
efficiently train the model within a reasonable timeframe and • Scalability: The study revealed that larger models are
budget. easier to prune, with practically no accuracy decrease
Optimization Strategy: The primary optimization strategies observed at 50% sparsity.
involved in SparseGPT are:
This case study demonstrates the efficacy of SparseGPT’s
One-Shot Pruning: To achieve significant sparsity in
one-shot pruning approach for reducing the size of mas-
the LLM in a single pruning step, eliminating the need sive language models. By leveraging unstructured sparsity
for iterative pruning and retraining. One-Shot Pruning: and parametrization strategies without gradient dependence,
SparseGPT implements its pruning strategy through a
SparseGPT achieves substantial reductions in model size
streamlined process. First, a thorough model analysis is
and resource requirements while maintaining high levels
conducted to pinpoint parameters that can be removed
of performance. This approach enables more efficient and
without significant impact. This analysis leverages pruning accessible deployment of large language models in various
criteria that assess parameter importance without requiring applications, making them more practical for real-world use.
gradient calculations, saving on computational resources.
Finally, SparseGPT employs a single step pruning approach,
achieving substantial sparsity (at least 50% for massive B. ENHANCING INFERENCE EFFICIENCY WITH QMOE
models) in a single step. This one-shot approach significantly Background: LLMs with trillions of parameters are becoming
reduces the time and complexity compared to iterative increasingly common. However, training and deploying these
pruning methods. models is challenging due to their immense computational
Unstructured Sparsity: To reduce the number of parame- and memory demands. Existing compression techniques
ters while maintaining model accuracy through unstructured struggle to handle such large models effectively. QMoE [77]
pruning, where individual weights are removed based on framework addresses this challenge by introducing novel
their importance. This approach focuses on eliminating compression methods to make these models more practical
individual weights within the model that are deemed less for real-world use.
important. By analyzing the model’s internal structure, Strategy Selection: QMoE was chosen as the optimization
SparseGPT achieves impressive sparsity levels of 50-60%, strategy. This approach allows for the compression of large
significantly reducing model size. This aggressive pruning models by quantizing their parameters to extremely low
strategy is remarkable because it achieves this with minimal precision, which drastically reduces the model size while
impact on the model’s ability to perform language modeling maintaining its performance. This strategy is particularly
tasks accurately. For instance, SparseGPT can remove over useful for handling the large parameter counts typical of MoE
100 billion weights from massive models like OPT-175B and models.
BLOOM-176B without compromising their performance on Optimization Strategy: The core optimization strategies
language modeling tasks. involved in QMoE are:
Parametrization Without Gradient Dependence: To lever- Scalable Compression Algorithm: QMoE tackles the
age the parametrization of massive GPT models to enable challenge of massive model sizes with a scalable com-
pruning without relying on gradient information. This method pression algorithm. This innovative technique achieves
allows the identification of sparse counterparts within a impressive sub-1-bit compression for trillion-parameter MoE
close range of the original dense model, ensuring these models, without requiring retraining. In the case of the
sparse models maintain similar performance. Interestingly, SwitchTransformer-c2048 model, this translates to a dramatic
the strategy highlights that larger models are even easier size reduction from 3.2 TB to a mere 160 GB (roughly 0.8 bits
to prune using this approach. They experience minimal per parameter). Remarkably, this is achieved with minimal
accuracy drops even at significant sparsity levels (e.g., compromise on accuracy, as measured by performance on
50%). This observation underscores the effectiveness of the pretraining validation tasks and zero-shot data.
parametrization technique in enabling aggressive pruning Customized Compression Format and GPU Kernels:
while preserving model performance. QMoE takes advantage of custom designed GPU kernels
Outcomes: The application of SparseGPT led to remark- to unlock the potential of its compressed format. These
able results: specialized kernels enable swift, on-the-fly decoding of
• Model size reduction: SparseGPT achieved 50-60% the model, ensuring efficient processing during use. This
sparsity, significantly reducing the model size by allows the compressed model to run seamlessly on common
removing more than 100 billion weights in models like hardware like 8 NVIDIA RTX 3090 or 4× NVIDIA
OPT-175B and BLOOM-176B. A6000 GPUs. Even with this readily available hardware,
• Processing time: The pruning process was completed in the runtime overhead stays below 5% compared to an
less than 4.5 hours for the largest open-source models, uncompressed model, which would require a staggering
demonstrating high efficiency. 20 times more GPUs.
Outcomes: The implementation of QMoE resulted in enhances training speed through system-level optimizations
significant improvements: such as layer-specific kernels and mixed-precision training,
• Compression ratio: The model size was reduced by which improve GPU utilization and reduce memory usage.
approximately 95%, allowing the SwitchTransformer- Similarly, ByteTransformer [4] is designed to accelerate
c2048 model to fit within the memory constraints of transformer models, particularly for variable-length inputs
standard hardware. This reduction from 3.2 TB to less in NLP tasks, thereby improving performance and reducing
than 160 GB translates to a compression ratio of around latency.
0.8 bits per parameter. Memory Management: Efficient memory allocation is
• Inference speed: The QMoE framework enables the crucial for training large models. CoLLiE [53] addresses
efficient execution of massive MoE models on com- memory constraints in LLM training through a comprehen-
modity hardware with a runtime overhead of less sive strategy. It implements 3D parallelism to effectively
than 5%. This efficiency allows the trillion-parameter distribute memory across training machines and GPUs. This
SwitchTransformer-c2048 model to run on a single approach allows CoLLiE to train large language models even
commodity GPU server. in environments with limited resources.
• Accuracy: Despite the substantial compression, the Fine-Tuning and Performance: CoLLiE [53] also focuses
model maintains high performance on pretraining val- on enhancing specific capabilities of LLMs through PEFT
idation tasks and zero-shot data, with only a minor methods. These methods allow models to be fine-tuned for
decline in accuracy. particular tasks or user instructions without compromising
This case study demonstrates the feasibility of deploying their overall performance. This targeted improvement is vital
trillion-parameter models in real-world applications through for developing models that can adapt to specific application
the use of advanced compression techniques. The QMoE needs while maintaining high general performance.
approach not only reduces resource requirements but also
enhances the deployability of cutting-edge language models
B. LLM TRAINING KEY FINDINGS
across various environments. By leveraging a scalable
compression algorithm, a customized compression format, The advancements in these frameworks have led to several
and bespoke GPU kernels, QMoE achieves significant significant findings:
improvements in model efficiency and performance. This GPipe: Demonstrates the successful training of a large
makes large-scale models more accessible and practical for multilingual transformer model, achieving superior results
real-world applications. It addresses key limitations of MoE compared to smaller, individually trained models [3].
architectures and promotes their wider adoption, paving the ByteTransformer: Outperforms existing frameworks in
way for further research and advancements in this field. terms of performance for BERT-like transformers on various
benchmarks [4].
IX. DISCUSSION
Megatron-LM: Enabled the training of LLMs with billions
This section examines optimization and acceleration tech- of parameters, achieving SOTA results on numerous NLP
niques for LLMs. We will discuss the relevant libraries tasks while providing high throughput [19].
and frameworks that facilitate these advancements, alongside LightSeq2: Accelerates transformer model training by
challenges and key findings of various optimization strate- up to 308%, showcasing substantial performance improve-
gies. ments [52].
CoLLiE: Introduces collaborative training methodologies
A. LLM TRAINING CHALLENGES
that improved efficiency and effectiveness in training large
models like LLaMA-65B, exploring ways to enhance specific
Training LLMs poses significant challenges due to their com-
functionalities without impacting overall performance [53].
plexity and resource requirements. Recent advancements in
frameworks like GPipe [3], ByteTransformer [4], Megatron-
LM [19], LightSeq2 [52], and CoLLiE [53] have made C. LLM INFERENCE CHALLENGES
significant strides in addressing these challenges: Efficient inference of LLMs is critical for their practical
Distributed Training: As LLMs become increasingly application, as these models are computationally expensive
complex, training them on a single device becomes imprac- due to their size and complexity. In this section, we will
tical. Megatron-LM [19] and CoLLiE [53] address this discuss and explore the challenges and key findings of various
by employing distributed training algorithms that partition frameworks and libraries designed to enhance the efficiency
the model across multiple GPUs. This approach enables of LLM inference.
parallel processing and significantly accelerates training Computational Expense: The massive size and complex
times. By distributing the workload, these frameworks architecture of LLMs make traditional inference methods
mitigate the memory bottlenecks that arise when trying to inefficient, especially on resource-constrained devices.
train massive models on single devices. Balancing Speed, Accuracy, and Resource Utilization:
Efficiency and Speed: Efficiency and speed are critical Achieving an optimal balance between these factors are
for the practical deployment of LLMs. LightSeq2 [52] crucial for real-world deployment of LLMs.
A. OPTIMIZATION FOR RESOURCE-CONSTRAINED framework. This provides greater control and adaptability for
ENVIRONMENTS different use cases.
Hybrid Processing: Develop hybrid processing techniques,
where computation is split between GPUs and CPUs to E. PERFORMANCE OPTIMIZATION TECHNIQUES
optimize memory usage and computational load. Adaptive Algorithms: Develop algorithms that can adapt to
Efficient Offloading Mechanisms: Extend the capabilities varying input sizes and sequences, optimizing both memory
of models like FlexGen [17] and DeepSpeed Inference [5] allocation and computational load dynamically.
by refining offloading techniques. This includes better Custom Kernel Implementations: Continue to develop and
utilization of CPU, GPU, and NVMe memory to handle larger refine custom kernel implementations for key operations
models with fewer resources. like Softmax and LayerNorm to achieve better performance,
Resource-Aware Scheduling: Implement intelligent as seen in TurboTransformers [11]. This could also involve
scheduling mechanisms that consider the specific resource hardware-specific optimizations for different GPU architec-
constraints of the hardware, optimizing the allocation of tures.
GPU, CPU, and memory resources for different types of
tasks. F. ADVANCED COMPRESSION AND QUANTIZATION
Sophisticated Compression Techniques: To reduce model size
B. MEMORY AND COMPUTATION OPTIMIZATION without significant accuracy loss instigate new methods for
Advanced Memory Management: Implement various tech- both lossless and lossy compression going beyond FlexGen’s
niques like dynamic catching, memory recycling, and 4-bit quantization [17].
efficient layer normalization (as presented in ByteTrans- Dynamic Quantization: Develop dynamic quantization
former [4] and LightSeq2 [52]) to overcome the memory techniques that adjust the precision of weights and activations
overhead problem. in real time based on the computational requirements and
Mixed-Precision Training In order to significantly reduce available resources.
training time and resource consumption without sacrificing
accuracy, develop robust mixed-precision methods (like XI. LIMITATIONS
Megatron-LM [19] and LightSeq2 [52]). m In this section, we will present the limitations of our SLR.
Dynamic Input Handling: Focusing on variable-length Here, we acknowledge that while our review offers valuable
inputs, like ByteTransformer [4], is seen as a promising insights, it is essential to consider its scope and boundaries.
area for improvement in ML, especially for NLP tasks that The limitations of our SLR can be stated as follows:
often deal with data of varying lengths. By developing more Timeframe: This SLR focused on studies published
advanced algorithms to handle these inputs and minimize between 2017 and December 2023. While this timeframe
unnecessary computations, frameworks could achieve signif- deliberately captured a period of significant advancement
icant performance gains in NLP. in LLM optimization techniques, it is acknowledged that
relevant research published before 2017 or after December
C. PARALLELISM AND DISTRIBUTION
2023 might have been excluded. This could potentially limit
the comprehensiveness of the analysis, particularly regarding
Adaptive Parallelism: Develop more advanced techniques
foundational concepts or emerging advancements outside the
that can dynamically adapt the parallelism strategy based on
chosen timeframe.
the model size and hardware configuration. This includes
Search Strategy: The chosen search queries might not have
both data and model parallelism that can be adjusted on-the-
encompassed all possible relevant terminology used in LLM
fly to optimize performance.
optimization research. This limitation could result in missing
Distributed Training and Inference: Improve frameworks
out on studies that use different terminologies or keywords to
like PETALS [60] and CoLLiE [53] to better leverage dis-
describe similar concepts and techniques.
tributed and heterogeneous hardware resources for efficient
Database Coverage: If the search excluded specific
training and inference.
databases that are highly relevant to LLM research, signif-
icant studies might have been overlooked. Comprehensive
D. SCALABLE AND MODULAR ARCHITECTURE
database coverage is crucial to ensure the inclusion of all
Composable Frameworks: Design frameworks with modular pertinent research.
components, similar to NLP-Fast [58]. These components act
like building blocks for inference pipelines. Users can easily LIST OF ABBREVIATIONS
swap or optimize individual components independently, AdaLomo Low-Memory Optimization with Adaptive
allowing for greater flexibility and customization. Learning Rate
Flexible APIs: Create user-friendly APIs, like those in BART Bidirectional and Auto-Regressive Trans-
PETALS [60]. These APIs allow users to customize inference formers
and fine-tuning processes according to their specific needs BERT Bidirectional Encoder Representations from
without having to make extensive changes to the underlying Transformers
BLOOM BigScience Large Open-science Open- [2] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, ‘‘Pre-trained
access Multilingual models for natural language processing: A survey,’’ Sci. China Technol.
Language Model Sci., vol. 63, no. 10, pp. 1872–1897, 2020.
[3] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee,
CD Coordinate Descent J. Ngiam, Q. V. Le, and Y. Wu, ‘‘Gpipe: Efficient training of giant neural
EET Easy and Efficient Transformer networks using pipeline parallelism,’’ in Proc. Adv. Neural Inf. Process.
FPGA Field Programmable Gate Arrays Syst., vol. 32, 2019, pp. 1–20.
FPTQ Fine-Grained Post-Training Quantization [4] Y. Zhai, C. Jiang, L. Wang, X. Jia, S. Zhang, Z. Chen, X. Liu, and
Y. Zhu, ‘‘ByteTransformer: A high-performance transformer boosted for
FT Faster Transformer variable-length inputs,’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp.,
GLM General Language Model May 2023, pp. 344–355.
GPT Generative Pre-trained Transformer [5] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng,
O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, ‘‘DeepSpeed-
GPU Graphical Processing Unit inference: Enabling efficient inference of transformer models at unprece-
HAO Hardware-Aware Optimization dented scale,’’ in Proc. Int. Conf. High Perform. Comput., Netw., Storage
IR Information Retrieval Anal., Nov. 2022, pp. 1–15.
KV Key Value [6] Y. Gong, ‘‘Multilevel large language models for everyone,’’ 2023,
arXiv:2307.13221.
LAMBADA LAnguage Modeling Broadened to Account [7] B. Spector and C. Re, ‘‘Accelerating LLM inference with staged
for Discourse Aspects speculative decoding,’’ 2023, arXiv:2308.04623.
LLaMA Large Language Model Meta AI [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
LLM-QAT LLM-Quantization-Aware Training of deep bidirectional transformers for language understanding,’’ 2018,
arXiv:1810.04805.
LM Language Model [9] L. Torbarina, T. Ferkovic, L. Roguski, V. Mihelcic, B. Sarlija, and
LOMO Low-Memory Optimization Z. Kraljevic, ‘‘Challenges and opportunities of using transformer-based
LoRA Low-Rank Adaptation multi-task learning in NLP through ML lifecycle: A survey,’’ 2023,
arXiv:2308.08234.
MHA Multi-Head Attention
[10] M. Ryabinin, T. Dettmers, M. Diskin, and A. Borzunov, ‘‘Swarm
MoE Mixture-of-Experts parallelism: Training large models can be surprisingly communication-
MMLU Massive Multitask Language Understanding efficient,’’ in Proc. Int. Conf. Mach. Learn., 2023, pp. 29416–29440.
NLP Natural Language Processing [11] J. Fang, Y. Yu, C. Zhao, and J. Zhou, ‘‘TurboTransformers: An efficient
GPU serving system for transformer models,’’ in Proc. 26th ACM
NN Neural Network SIGPLAN Symp. Princ. Pract. Parallel Program., Feb. 2021, pp. 389–402.
OPT Open Pre-trained Transformer [12] R. Anil et al., ‘‘PaLM 2 technical report,’’ 2023, arXiv:2305.10403.
PET Parameter Efficient Transformers [13] L. J. Laki and Z. G. Yang, ‘‘Sentiment analysis with neural models for Hun-
PetS Parameter-Efficient Transformers Serving garian,’’ Acta Polytechnica Hungarica, vol. 20, no. 5, pp. 109–128, 2023.
[14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T.
PEFT Parameter-Efficient Fine-Tuning Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A.
PIE PET Inference Engine Joulin, E. Grave, and G. Lample, ‘‘LLaMA: Open and efficient foundation
PLM Pre-trained Language Model language models,’’ 2023, arXiv:2302.13971.
PRISMA Preferred Reporting Items for Systematic [15] S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and
Y. You, ‘‘Colossal-AI: A unified deep learning system for large-scale
Reviews parallel training,’’ in Proc. 52nd Int. Conf. Parallel Process., Aug. 2023,
and Meta-Analyses pp. 766–775.
PTM Pre-Trained Model [16] J. Geiping and T. Goldstein, ‘‘Cramming: Training a language model
on a single gpu in one day,’’ in Proc. Int. Conf. Mach. Learn., 2023,
PTQ Post-Training Quantization pp. 11117–11143.
SLR Systematic Literature Review [17] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y. Fu, Z. Xie,
SWARM Stochastically Wired Adaptively Rebal- B. Chen, C. Barrett, J. E. Gonzalez, P. Liang, C. Ré, I. Stoica, and C. Zhang,
anced Model ‘‘FlexGen: High-throughput generative inference of large language models
with a single GPU,’’ 2023, arXiv:2303.06865.
VAE Variational Autoencoder [18] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, ‘‘ZeRO: Memory
W4A8 4-bit weights and 8-bit activations optimizations toward training trillion parameter models,’’ in Proc. Int.
Conf. High Perform. Comput., Netw., Storage Anal., Nov. 2020, pp. 1–16.
[19] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,
ACKNOWLEDGMENT ‘‘Megatron-LM: Training multi-billion parameter language models using
model parallelism,’’ 2019, arXiv:1909.08053.
The authors are grateful to the members of the Applied
[20] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang,
Machine Learning Research Group of Óbuda University John D. Li, and Y. He, ‘‘ZeRO-offload: Democratizing billion-scale model
von Neumann Faculty of Informatics for constructive com- training,’’ in Proc. USENIX Annu. Tech. Conf., 2021, pp. 551–564.
ments and suggestions. They would also like to acknowledge [21] T. Chen, B. Xu, C. Zhang, and C. Guestrin, ‘‘Training deep nets with
sublinear memory cost,’’ 2016, arXiv:1604.06174.
the support of the Doctoral School of Applied Informatics and
[22] B. Yuan, Y. He, J. Davis, T. Zhang, T. Dao, B. Chen, P. S. Liang, C. Re, and
Applied Mathematics of Óbuda University. C. Zhang, ‘‘Decentralized training of foundation models in heterogeneous
environments,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022,
pp. 25464–25477.
REFERENCES [23] M. Ryabinin and A. Gusev, ‘‘Towards crowdsourced training of large
[1] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, ‘‘ZeRO-infinity: neural networks using decentralized mixture-of-experts,’’ in Proc. Adv.
Breaking the GPU memory wall for extreme scale deep learning,’’ in Neural Inf. Process. Syst., vol. 33, 2020, pp. 3659–3672.
Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., Nov. 2021, [24] W. X. Zhao et al., ‘‘A survey of large language models,’’ 2023,
pp. 1–15. arXiv:2303.18223.
[25] M. Shah Jahan, H. U. Khan, S. Akbar, M. Umar Farooq, S. Gul, and [48] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
A. Amjad, ‘‘Bidirectional language modeling: A systematic literature V. Stoyanov, and L. Zettlemoyer, ‘‘BART: Denoising sequence-to-
review,’’ Sci. Program., vol. 2021, pp. 1–15, May 2021. sequence pre-training for natural language generation, translation, and
[26] F. Yu, D. Wang, L. Shangguan, M. Zhang, X. Tang, C. Liu, and X. Chen, comprehension,’’ 2019, arXiv:1910.13461.
‘‘A survey of large-scale deep learning serving system optimization: [49] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia,
Challenges and opportunities,’’ 2021, arXiv:2111.14247. A. Poulton, V. Kerkez, and R. Stojnic, ‘‘Galactica: A large language model
[27] G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi, Z. Yu, for science,’’ 2022, arXiv:2211.09085.
M. Zhu, Y. Zhang, C. Yang, Y. Cheng, and L. Zhao, ‘‘Beyond efficiency: [50] M. Shanahan, ‘‘Talking about large language models,’’ Commun. ACM,
A systematic survey of resource-efficient large language models,’’ 2024, vol. 67, no. 2, pp. 68–79, Feb. 2024.
arXiv:2401.00625. [51] C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li,
[28] H. Wang, Z. Qu, Q. Zhou, H. Zhang, B. Luo, W. Xu, S. Guo, and R. Li, ‘‘A V. Rajashekhar, S. Ramesh, and J. Soyke, ‘‘TensorFlow-serving: Flexible,
comprehensive survey on training acceleration for large machine learning high-performance ML serving,’’ 2017, arXiv:1712.06139.
models in IoT,’’ IEEE Internet Things J., vol. 9, no. 2, pp. 939–963, [52] X. Wang, Y. Wei, Y. Xiong, G. Huang, X. Qian, Y. Ding, M. Wang, and
Jan. 2022. L. Li, ‘‘LightSeq2: Accelerated training for transformer-based models on
[29] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, GPUs,’’ in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal.,
E. Agirre, I. Heintz, and D. Roth, ‘‘Recent advances in natural language Nov. 2022, pp. 1–14.
processing via large pre-trained language models: A survey,’’ ACM [53] K. Lv, S. Zhang, T. Gu, S. Xing, J. Hong, K. Chen, X. Liu, Y. Yang, H. Guo,
Comput. Surveys, vol. 56, no. 2, pp. 1–40, Feb. 2024. T. Liu, Y. Sun, Q. Guo, H. Yan, and X. Qiu, ‘‘CoLLiE: Collaborative
[30] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, J. Liu, Z. Qu, S. Yan, training of large language models in an efficient way,’’ in Proc. Conf.
Y. Zhu, Q. Zhang, M. Chowdhury, and M. Zhang, ‘‘Efficient large language Empirical Methods Natural Lang. Process., Syst. Demonstrations, 2023,
models: A survey,’’ 2023, arXiv:2312.03863. pp. 527–542.
[31] V. Cole and M. Boutet, ‘‘Researchrabbit,’’ J. Can. Health Libraries Assoc., [54] L. Li, Q. Li, B. Zhang, and X. Chu, ‘‘Norm tweaking: High-performance
vol. 44, no. 2, p. 43, 2023. low-bit quantization of large language models,’’ 2023, arXiv:2309.02784.
[32] M. Ouzzani, H. Hammady, Z. Fedorowicz, and A. Elmagarmid, [55] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
‘‘Rayyan—A Web and mobile app for systematic reviews,’’ Systematic ‘‘ALBERT: A lite BERT for self-supervised learning of language
Rev., vol. 5, no. 1, pp. 1–10, Dec. 2016. representations,’’ 2019, arXiv:1909.11942.
[33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and [56] K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu, ‘‘Full parameter
W. Chen, ‘‘LoRA: Low-rank adaptation of large language models,’’ 2021, fine-tuning for large language models with limited resources,’’ 2023,
arXiv:2106.09685. arXiv:2306.09782.
[34] J. Gao and C.-Y. Lin, ‘‘Introduction to the special issue on statistical [57] K. Lv, H. Yan, Q. Guo, H. Lv, and X. Qiu, ‘‘AdaLomo: Low-memory
language modeling,’’ ACM Trans. Asian Lang. Inf. Process., vol. 3, no. 2, optimization with adaptive learning rate,’’ 2023, arXiv:2310.10195.
pp. 87–93, Jun. 2004. [58] J. Kim, S. Hur, E. Lee, S. Lee, and J. Kim, ‘‘NLP-fast: A fast, scalable,
[35] A. Pauls and D. Klein, ‘‘Faster and smaller N -gram language models,’’ and flexible system to accelerate large-scale heterogeneous NLP models,’’
in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics, Hum. Lang. 2017, arXiv:1712.06139.
Technol., 2011, pp. 258–267. [59] Z. Zhou, X. Wei, J. Zhang, and G. Sun, ‘‘PetS: A unified framework
[36] S. M. Thede and M. P. Harper, ‘‘A second-order hidden Markov model for parameter-efficient transformers serving,’’ in Proc. USENIX Annu.
for part-of-speech tagging,’’ in Proc. 37th Annu. Meeting Assoc. Comput. Tech. Conf., 2022, pp. 489–504.
Linguistics Comput. Linguistics, 1999, pp. 175–182. [60] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada,
[37] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, A. Chumachenko, P. Samygin, and C. Raffel, ‘‘Petals: Collaborative
N. Akhtar, J. Wu, and S. Mirjalili, ‘‘Large language models: A inference and fine-tuning of large models,’’ 2022, arXiv:2209.01188.
comprehensive survey of its applications, challenges, limitations, and [61] X. Wang, Y. Xiong, Y. Wei, M. Wang, and L. Li, ‘‘LightSeq: A high
future prospects,’’ 2023, doi: 10.36227/techrxiv.23589741.v4. performance inference library for transformers,’’ 2020, arXiv:2010.13887.
[38] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, [62] G. Li, Y. Xi, J. Ding, D. Wang, B. Liu, C. Fan, X. Mao, and Z. Zhao,
and H. A. Najada, ‘‘Survey of review spam detection using ‘‘Easy and efficient transformer: Scalable inference solution for large NLP
machine learning techniques,’’ J. Big Data, vol. 2, no. 1, pp. 1–24, model,’’ 2021, arXiv:2104.12470.
Dec. 2015. [63] P. Patel, E. Choukse, C. Zhang, A. Shah, Ì. Goiri, S. Maleki, and
[39] A. López-Chau, D. Valle-Cruz, and R. Sandoval-Almazán, ‘‘Sentiment R. Bianchini, ‘‘Splitwise: Efficient generative LLM inference using phase
analysis of Twitter data through machine learning techniques,’’ Softw. Eng. splitting,’’ 2023, arXiv:2311.18677.
Era Cloud Comput., vol. 1, pp. 185–209, 2020. [64] H. Zhang, A. Ning, R. Prabhakar, and D. Wentzlaff, ‘‘A hardware
[40] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, evaluation framework for large language model inference,’’ 2023,
and P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’ arXiv:2312.03134.
J. Mach. Learn. Res., vol. 12 pp. 2493–2537, Aug. 2011. [65] Y. Song, Z. Mi, H. Xie, and H. Chen, ‘‘PowerInfer: Fast large language
[41] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, model serving with a consumer-grade GPU,’’ 2023, arXiv:2312.12456.
‘‘Distributed representations of words and phrases and their compo- [66] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
sitionality,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 26, 2013, L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized BERT
pp. 1–17. pretraining approach,’’ 2019, arXiv:1907.11692.
[42] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of [67] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez,
word representations in vector space,’’ 2013, arXiv:1301.3781. H. Zhang, and I. Stoica, ‘‘Efficient memory management for large
[43] Y. Bengio, R. Ducharme, and P. Vincent, ‘‘A neural probabilistic language language model serving with PagedAttention,’’ in Proc. 29th Symp.
model,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 13, 2000, pp. 1–11. Operating Syst. Princ., Oct. 2023, pp. 611–626.
[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [68] C.-C. Huang, G. Jin, and J. Li, ‘‘SwapAdvisor: Pushing deep learning
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. beyond the GPU memory limit via smart swapping,’’ in Proc. 25th Int.
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–16. Conf. Architectural Support Program. Lang. Operating Syst., Mar. 2020,
[45] T. A. Chang and B. K. Bergen, ‘‘Language model behavior: A comprehen- pp. 1341–1355.
sive survey,’’ 2023, arXiv:2303.11504. [69] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary,
[46] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro,
D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, A. Phanishayee, and M. Zaharia, ‘‘Efficient large-scale language model
O. Vinyals, P. Liang, J. Dean, and W. Fedus, ‘‘Emergent abilities of large training on GPU clusters using megatron-LM,’’ in Proc. Int. Conf. High
language models,’’ 2022, arXiv:2206.07682. Perform. Comput., Netw., Storage Anal., Nov. 2021, pp. 1–14.
[47] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, [70] E. Frantar and D. Alistarh, ‘‘Sparsegpt: Massive language models can be
‘‘Language models are unsupervised multitask learners,’’ OpenAI blog, accurately pruned in one-shot,’’ in Proc. Int. Conf. Mach. Learn., 2023,
vol. 1, no. 8, p. 9, 2019. pp. 10323–10337.
[71] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, ‘‘FlexFlow: A flexible ZHYAR RZGAR K. ROSTAM (Member, IEEE)
dataflow accelerator architecture for convolutional neural networks,’’ in received the B.Sc. and M.Sc. degrees in computer
Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2017, science from the University of Sulaimani, KRG-
pp. 553–564. Iraq, in 2013 and 2019, respectively. He is
[72] M. Xia, T. Gao, Z. Zeng, and D. Chen, ‘‘Sheared LLaMA: Accel- currently pursuing the Ph.D. degree in information
erating language model pre-training via structured pruning,’’ 2023, science and technology with Óbuda University,
arXiv:2310.06694. Budapest, Hungary. His current research interests
[73] H. Jin, X. Han, J. Yang, Z. Jiang, C.-Y. Chang, and X. Hu, ‘‘GrowLength:
include large language models, scientific text
Accelerating LLMs pretraining by progressively growing training length,’’
classification, deep learning techniques, machine
2023, arXiv:2310.00576.
[74] S. Jung Kwon, J. Kim, J. Bae, K. Min Yoo, J.-H. Kim, B. Park,
learning algorithms, and artificial intelligence.
B. Kim, J.-W. Ha, N. Sung, and D. Lee, ‘‘AlphaTuning: Quantization-
aware parameter-efficient adaptation of large-scale pre-trained language
models,’’ 2022, arXiv:2210.03858.
[75] Z. Li, X. Liu, B. Zhu, Z. Dong, Q. Gu, and K. Keutzer, ‘‘QFT:
Quantized full-parameter tuning of LLMs with affordable resources,’’
2023, arXiv:2310.07147.
[76] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,
S. Gray, A. Radford, J. Wu, and D. Amodei, ‘‘Scaling laws for neural
language models,’’ 2020, arXiv:2001.08361.
[77] E. Frantar and D. Alistarh, ‘‘QMoE: Practical sub-1-bit compression of
trillion-parameter models,’’ 2023, arXiv:2310.16795.
[78] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, ‘‘GPTQ: Accurate
post-training quantization for generative pre-trained transformers,’’ 2022,
arXiv:2210.17323. SÁNDOR SZÉNÁSI (Member, IEEE) received the
[79] Q. Li, Y. Zhang, L. Li, P. Yao, B. Zhang, X. Chu, Y. Sun, L. Du, and Ph.D. degree from the Doctoral School of Applied
Y. Xie, ‘‘FPTQ: Fine-grained post-training quantization for large language Informatics and Applied Mathematics, Óbuda
models,’’ 2023, arXiv:2308.15987. University, Budapest, Hungary, in 2013. Currently,
[80] T. L. Scao, ‘‘BLOOM: A 176B-parameter open-access multilingual he is a Professor with the John von Neumann
language model,’’ 2022, arXiv:2211.05100. Faculty of Informatics, Óbuda University. His
[81] Y. Jin Kim, R. Henry, R. Fahim, and H. Hassan Awadalla, ‘‘FineQuant: research interests include (data) parallel algo-
Unlocking efficiency with fine-grained weight-only quantization for rithms, GPU programming, and medical image
LLMs,’’ 2023, arXiv:2308.09723.
processing. He engages both in theoretical funda-
[82] K. Behdin, A. Acharya, A. Gupta, Q. Song, S. Zhu, S. Keerthi, and
mentals and in algorithmic issues with respect to
R. Mazumder, ‘‘QuantEase: Optimization-based quantization for language
models,’’ 2023, arXiv:2309.01885. realization of practical requirements and given constraints.
[83] X. Ma, G. Fang, and X. Wang, ‘‘LLM-Pruner: On the structural pruning of
large language models,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 36,
2023, pp. 21702–21720.
[84] B. Peng, C. Li, P. He, M. Galley, and J. Gao, ‘‘Instruction tuning with GPT-
4,’’ 2023, arXiv:2304.03277.
[85] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
W. Zheng, X. Xia, W. Lam Tam, Z. Ma, Y. Xue, J. Zhai, W. Chen, P. Zhang,
Y. Dong, and J. Tang, ‘‘GLM-130B: An open bilingual pre-trained model,’’
2022, arXiv:2210.02414.
[86] S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y. Kim, ‘‘DFX:
A low-latency multi-FPGA appliance for accelerating transformer-based
text generation,’’ in Proc. 55th IEEE/ACM Int. Symp. Microarchitecture
(MICRO), Oct. 2022, pp. 616–630.
[87] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, ‘‘Sequence parallelism: Long GÁBOR KERTÉSZ (Senior Member, IEEE)
sequence training from system perspective,’’ 2022, arXiv:2105.13120. received the Ph.D. degree in information science
[88] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, and technology, in 2019.
D. Zhuo, E. P. Xing, J. E. Gonzalez, and I. Stoica, ‘‘Alpa: Automating inter- He is currently an Associate Professor and the
and intra-operator parallelism for distributed deep learning,’’ in Proc. 16th Vice-Dean for Research with the John von Neu-
USENIX Symp. Operating Syst. Design Implement., 2022, pp. 559–578. mann Faculty of Informatics, Óbuda University,
[89] A. Eliseev and D. Mazur, ‘‘Fast inference of mixture-of-experts language Budapest, Hungary, and also a part-time Research
models with offloading,’’ 2023, arXiv:2312.17238. Fellow with the HUN-REN SZTAKI (Institute for
[90] H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni,
Computer Science and Control). He is the Leader
J. Hu, R. Li, M. Zhang, C. Li, J. Ning, R. Wang, Z. Zhang, S. Liu, J. Chau,
of the Applied Machine Learning Research Group,
H. Hu, and P. Cheng, ‘‘FP8-LM: Training FP8 large language models,’’
2023, arXiv:2310.18313.
the John von Neumann Faculty of Informatics; the main areas of his research
[91] Z. Dong, Y. Gao, Q. Huang, J. Wawrzynek, H. K. H. So, and K. Keutzer, were computer vision, parallel processing, and deep machine learning. His
‘‘HAO: Hardware-aware neural architecture optimization for efficient current research interests include distributed deep learning, metric learning,
inference,’’ in Proc. IEEE 29th Annu. Int. Symp. Field-Program. Custom and applied machine intelligence.
Comput. Mach. (FCCM), May 2021, pp. 50–59. Dr. Kertész is the Founding President of the High Performance Computing
[92] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, Division, John von Neumann Computer Society, and the President of the
B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, ‘‘Mixed IEEE Computational Intelligence Hungary Chapter.
precision training,’’ 2017, arXiv:1710.03740.