EDGE-LLM: Enabling Efficient Large Language Model Adaptation On Edge Devices Via Layerwise Unified Compression and Adaptive Layer Tuning & Voting
EDGE-LLM: Enabling Efficient Large Language Model Adaptation On Edge Devices Via Layerwise Unified Compression and Adaptive Layer Tuning & Voting
Abstract 1 Introduction
Efficient adaption of large language models (LLMs) on edge In recent days, large language models (LLMs), such as GPT-
devices is essential for applications requiring continuous 4 [1], have shown dominating performance across various
and privacy-preserving adaptation and inference. However, applications that revolutionize human life. Following this
existing tuning techniques fall short because of the high trend, there is an increasing demand to develop efficient
computation and memory overheads. To this end, we intro- tuning techniques for LLMs to enable them on applications
duce a computation- and memory-efficient LLM tuning that require continuous and privacy-preserving adaptation.
framework, called Edge-LLM, to facilitate affordable and However, the massive model size of LLMs hinders directly
effective LLM adaptation on edge devices. Specifically, Edge- achieving the LLM adaptation on edge devices (e.g., on edge
LLM features three core components: (1) a layer-wise uni- GPUs and smartphones). The challenges are twofold: (1) the
fied compression (LUC) technique to reduce the computa- excessive computation overhead encountered when calcu-
tion overhead by generating layer-wise pruning sparsity lating the forward and backward passes of LLMs [2], and (2)
and quantization bit-width policies, (2) an adaptive layer the cumbersome memory overhead introduced for storing
tuning and voting scheme to reduce the memory overhead massive model weights and activations through the tuning
by reducing the backpropagation depth, and (3) a comple- process. As shown in recent works [2, 11], LLMs are typically
mentary hardware scheduling strategy to handle the irreg- tuned on cutting-edge GPUs (e.g., with 40GB or 80GB GPU
ular computation patterns introduced by LUC and adap- memory), taking more than a GPU day to complete. Even for
tive layer tuning, thereby achieving efficient computation the state-of-the-art (SOTA) efficient tuning method, effec-
and data movements. Extensive experiments demonstrate tively tuning relatively small-scale LLMs (e.g., LLaMA-7B)
that Edge-LLM achieves a 2.92× speed up and a 4× memory on edge devices remains impractical [2].
overhead reduction as compared to vanilla tuning methods Although several existing efforts aim to address the afore-
with a comparable task accuracy. Our code is available at mentioned challenges, each has its own drawbacks. (1) To
https://github.com/GATECH-EIC/Edge-LLM reduce computation overhead, compressing target LLMs first
ACM Reference Format: to reduce the model size is a common approach [2, 3]. How-
Zhongzhi Yu1 , Zheng Wang1 , Yuhan Li1 , Haoran You1 , Ruijie Gao1 , ever, how to effectively reduce the redundancy of LLMs
Xiaoya Zhou3 , Sreenidhi Reedy Bommu1 , Yang (Katie) Zhao2 , Yingyan while maintaining their adaptability is still largely un-
(Celine) Lin1 . 2024. EDGE-LLM: Enabling Efficient Large Language explored [2]. (2) To mitigate memory overhead, existing
Model Adaptation on Edge Devices via Layerwise Unified Com- methods primarily focus on shortening the backpropagation
pression and Adaptive Layer Tuning & Voting. In 61st ACM/IEEE depth [19, 23]. Unfortunately, the reduced backpropagation
Design Automation Conference (DAC ’24), June 23–27, 2024, San depth results in only a fraction of blocks in LLMs being
Francisco, CA, USA. ACM, New York, NY, USA, 6 pages. https: updated, limiting the achievable performance.
//doi.org/10.1145/3649329.3658473
In this paper, we develop a comprehensive solution to
Permission to make digital or hard copies of part or all of this work for tackle the two aforementioned memory and computation
personal or classroom use is granted without fee provided that copies are challenges, achieving an effective LLM adaptation. Specifi-
not made or distributed for profit or commercial advantage and that copies
cally, we make the following contributions.
bear this notice and the full citation on the first page. Copyrights for third-
party components of this work must be honored. For all other uses, contact • We propose a comprehensive framework, dubbed Edge-
the owner/author(s). LLM, that tackles the memory and computation chal-
DAC’24, June 23–27, 2024, San Francisco, CA lenges of the LLM adaptation from both algorithm
© 2024 Copyright held by the owner/author(s). and hardware perspectives, enabling the effective LLM
ACM ISBN 979-8-4007-0601-1/24/06
https://doi.org/10.1145/3649329.3658473
adaptation on edge devices with limited memory and
computation resources.
DAC’24, June 23–27, 2024, San Francisco, CA Zhongzhi Yu, et al.
memory required for tuning and the memory available on Bit-Width Dim.
Removed
Updated Weight Forward Passes
Vote
Voting
Scheme
commonly used edge devices (e.g., 8 GB for TX2 [14] and 12 Sparsity Dim. Weight Frozen Weight Backward Passes
techniques often adopt a uniform compression approach, Figure 2. Comparison between (a) the compressing-then-
reducing redundancy from only one dimension [2]. tuning baseline and (b/c) our proposed Edge-LLM method.
On the other hand, to further reduce the memory over-
selected layer (potentially different in each iteration) to the
head, based on our analysis in Sec. 2.1, we summarize that
final classification layer with a skip connection during the
the key to improving the achievable accuracy-memory trade-
forward pass. During backpropagation, only a few preceding
off lies in the ability to update all layers in the LLM with
layers of the selected layer receive gradient updates. Because
a limited backpropagation depth. Inspired by the early exit
the layers selected for updates vary with different inputs, this
mechanism developed for efficient model inference [21], we
approach ensures that all layers are effectively updated while
hypothesize that the outputs from early layers in the LLM
minimizing memory overhead. This efficiency is achieved
can provide meaningful information for prediction. Thus,
through the reduced depth of backpropagation enabled by
it is possible to start backpropagation from an early exit
the introduction of skip connections. Furthermore, during
layer and still effectively update the model. In this scenario,
inference, we introduce a voting mechanism to enhance the
since backpropagation can be initiated from various early
accuracy of LLMs tuned with adaptive layer tuning. This
exit layers, the backpropagation depth required for updating
method capitalizes on the ability of adaptively tuned LLMs
all layers in the LLM can be minimized.
to produce reasonable outputs from multiple layers. Conse-
3 Edge-LLM Algorithm quently, each layer generates logits, and a voting process is
3.1 Overview employed to determine the final output (see Sec. 3.3).
Motivated by the opportunities identified in Sec. 2.3, we then 3.2 Layer-wise Unified Compression (LUC)
introduce the algorithm design of our proposed Edge-LLM Motivating observation on LLM’s layer-wise sensitivity.
framework to facilitate effective and efficient LLM adaptation In prior studies on model compression, a common under-
with limited computation and memory overhead. As shown standing is that different layers in a model exhibit different
in Fig. 2, our proposed Edge-LLM tuning algorithm integrates sensitivities to different compression techniques [4]. How-
two key enablers each leveraging one of the aforementioned ever, the sensitivities of different layers in LLMs to different
opportunities in reducing the computation and memory over- compression techniques remain an open question. To address
head. Specifically: (1) To reduce the computation overhead, this question, we first explore the layer-wise sensitivities of
we propose the LUC technique to diminish the redundancy the target LLM to pruning and quantization. Specifically, we
of the target LLM. This technique is motivated by our em- apply different quantization bit-widths and pruning sparsi-
pirical observation of the diverse layer-wise sensitivities of ties to each layer of a pretrained LLaMA-7B [22] model. By
LLMs to quantization and pruning. Based on the observa- comparing the averaged MSE of the compressed and original
tion above, we develop a low-cost, mean-square-error-based layer outputs in the target LLM fed with the same input
(MSE-based) identifier in LUC to generate a layer-wise com- from the WikiText dataset [12], we observe that, as shown
pression policy (e.g., layer-wise bit-width and pruning spar- in Fig. 3, only a small fraction of layers in the LLM have high
sity allocation), aiming to improve the accuracy-efficiency sensitivities to compression.
trade-off of LUC over existing compression techniques in Our hypothesis and the proposed LUC. Based on the
compressing-then-tuning frameworks (Sec. 3.2). (2) To re- observation above, we hypothesize that the high sensitivity
duce the memory overhead, we propose an adaptive layer (i.e., high MSE) is due to limited redundancy in the corre-
tuning scheme that dynamically connects the output of a sponding layer, thereby necessitating a lower compression
DAC’24, June 23–27, 2024, San Francisco, CA Zhongzhi Yu, et al.
Table 1. Benchmarking Edge-LLM on the MMLU dataset. Table 2. Ablation on LUC’s performance with its variants
Method Avg. Bit Sparsity Norm. Mem. MMLU Method Avg. Bit Sparsity Perplexity
LoRA 8.0 0% 1.00× 33.60 SparseGPT 8.0 50% 15.88
Partial Tuning 5.0 50% 0.25× 30.94 LLM-QAT 8.0 0% 13.34
Ours 5.1 50% 0.25× 31.64
Uniform 5.0 50% 17.61
LST 4.0 0% 0.29× 29.04 Random 5.1 50% 16.21
Partial Tuning 4.0 50% 0.25× 28.70 Ours 5.1 50% 15.71
Ours 4.1 50% 0.25× 29.89
Uniform 4.0 50% 19.86
Partial Tuning 3.0 50% 0.25× 26.61
Random 4.1 50% 19.81
Ours 3.1 50% 0.25× 27.68
Ours 4.1 50% 18.58
SOTA compression techniques including SparseGPT and Uniform 3.0 50% 32.52
LLM-QAT, and two variants: (1) Uniform: using the same Random 3.1 50% 31.71
Ours 3.1 50% 30.03
quantization bit-width and pruning sparsity across all layers
and (2) Random: Randomly assign our generated layer-wise References
pruning sparsities and quantization bits across all layers. As [1] Bubeck et al. 2023. Sparks of artificial general intelligence: Early
shown in Table 2, our proposed method achieves a 1.28∼2.49 experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
lower perplexity compared to the Uniform baseline under [2] Dettmers et al. 2023. Qlora: Efficient finetuning of quantized llms.
arXiv (2023).
similar resource constraints and a 0.50∼1.68 lower perplexity
[3] Kim et al. 2024. Memory-efficient fine-tuning of compressed large
compared to the Random baseline under the same efficiency, language models via sub-4-bit integer quantization. NeurIPS 36 (2024).
showing the effectiveness of our proposed LUC. [4] Yu et al. 2022. Unified visual transformer compression. arXiv preprint
arXiv:2203.08243 (2022).
5.3 Hardware Evaluation
[5] Yu et al. 2023. Hint-aug: Drawing hints from foundation vision trans-
We evaluate the proposed techniques based on the base- formers towards boosted few-shot parameter-efficient tuning. In CVPR.
line systolic accelerator designed for transformer training 11102–11112.
with proper modifications for supporting the proposed tech- [6] Yu et al. 2023. Master-ASR: achieving multilingual scalability and low-
resource adaptation in ASR with modular learning. In ICML. PMLR,
niques [17]: (1) Since the proposed adaptive layer tuning 40475–40487.
can be naturally run on the baseline accelerator, there is [7] Frantar et al. 2023. SparseGPT: Massive Language Models Can Be
no need to modify the baseline accelerator; and (2) For the Accurately Pruned in One-Shot. (2023).
LUC, we make these modifications: we update the baseline [8] Fu et al. 2021. Enabling random precision switch for winning both
to store the compressed weights on DRAM and SSD. To sim- adversarial robustness and efficiency. In MICRO. 225–237.
[9] Hendrycks et al. 2020. Measuring massive multitask language under-
plify the design, we do not modify the compute core for standing. arXiv preprint arXiv:2009.03300 (2020).
sparsity and use a simple spatial-temporal flexible-precision [10] Hu et al. 2021. Lora: Low-rank adaptation of large language models.
MAC unit [8]. We apply our proposed hardware scheduling arXiv preprint arXiv:2106.09685 (2021).
searching method to find the optimal algorithm-to-hardware [11] Liu et al. 2023. LLM-QAT: Data-Free Quantization Aware Training for
mappings. Scale-Sim simulation results show that the adap- Large Language Models. arXiv (2023).
[12] Merity et al. 2016. Pointer sentinel mixture models. arXiv preprint
tive layer tuning can achieve 2.24× speedup; the pruning arXiv:1609.07843 (2016).
and adaptive layer tuning can introduce 2.37× speedup; and [13] Meta. 2022. Quest Pro. https://www.meta.com/quest/quest-pro/.
combing LUC (4-bit/5-bit) and the adaptive layer tuning can [14] NVIDIA. 2020. NVIDIA Jetson TX2. www.nvidia.com/en-us/
give 3.38×/2.92× overall speedup, respectively. autonomous-machines/embedded-systems/jetson-tx2/.
[15] Pearce et al. 2021. Understanding softmax confidence and uncertainty.
6 Conclusion arXiv preprint arXiv:2106.04972 (2021).
[16] Samajdar et al. 2023. Systolic CNN AcceLErator Simulator (SCALE
In this paper, we introduce an LLM tuning framework, Edge-
Sim). https://github.com/ARM-software/SCALE-Sim.
LLM, achieving efficient LLM adaptation on edge devices. [17] Shao et al. 2023. An Efficient Training Accelerator for Transformers
Experiments demonstrate that Edge-LLM achieves efficient With Hardware-Algorithm Co-Optimization. VLSI (2023).
adaptation with comparable performance as vanilla tuning [18] Sheng et al. 2023. FlexGen: High-Throughput Generative Inference of
with a 2.92× speed up and a 4× memory reduction. Large Language Models with a Single GPU. (2023).
[19] Sung et al. 2022. Lst: Ladder side-tuning for parameter and memory
Acknowledgement efficient transfer learning. NeurIPS 35 (2022), 12991–13005.
[20] Taori et al. 2023. Stanford alpaca: An instruction-following llama
This work was supported in part by CoCoSys, one of the model.
seven centers in JUMP 2.0, a Semiconductor Research Corpo- [21] Teerapittayanon et al. 2016. Branchynet: Fast inference via early
ration (SRC) program sponsored by DARPA, and the National exiting from deep neural networks. In ICPR.
Science Foundation (NSF) through the NSF CAREER funding [22] Touvron et al. 2023. Llama: Open and efficient foundation language
(Award number: 2048183). models. arXiv preprint arXiv:2302.13971 (2023).
[23] Zhang et al. 2023. Llama-adapter: Efficient fine-tuning of language
models with zero-init attention. arXiv (2023).