0% found this document useful (0 votes)
47 views6 pages

EDGE-LLM: Enabling Efficient Large Language Model Adaptation On Edge Devices Via Layerwise Unified Compression and Adaptive Layer Tuning & Voting

Uploaded by

diegobonesso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

EDGE-LLM: Enabling Efficient Large Language Model Adaptation On Edge Devices Via Layerwise Unified Compression and Adaptive Layer Tuning & Voting

Uploaded by

diegobonesso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

EDGE-LLM: Enabling Efficient Large Language Model

Adaptation on Edge Devices via Layerwise Unified


Compression and Adaptive Layer Tuning & Voting
Zhongzhi Yu1 , Zheng Wang1 , Yuhan Li1 , Haoran You1 , Ruijie Gao1 , Xiaoya Zhou3 , Sreenidhi Reedy Bommu1 , Yang (Katie) Zhao2 ,
Yingyan (Celine) Lin1
1 Georgia Institute of Technology, 2 University
of Minnesota, Twin Cities, 3 University of California, Santa Barbara
{zyu401, zwang3478, yli3326, hyou37, eiclab.gatech, sbommu3, celine.lin}@gatech.edu,
[email protected], [email protected]
arXiv:2406.15758v1 [cs.LG] 22 Jun 2024

Abstract 1 Introduction
Efficient adaption of large language models (LLMs) on edge In recent days, large language models (LLMs), such as GPT-
devices is essential for applications requiring continuous 4 [1], have shown dominating performance across various
and privacy-preserving adaptation and inference. However, applications that revolutionize human life. Following this
existing tuning techniques fall short because of the high trend, there is an increasing demand to develop efficient
computation and memory overheads. To this end, we intro- tuning techniques for LLMs to enable them on applications
duce a computation- and memory-efficient LLM tuning that require continuous and privacy-preserving adaptation.
framework, called Edge-LLM, to facilitate affordable and However, the massive model size of LLMs hinders directly
effective LLM adaptation on edge devices. Specifically, Edge- achieving the LLM adaptation on edge devices (e.g., on edge
LLM features three core components: (1) a layer-wise uni- GPUs and smartphones). The challenges are twofold: (1) the
fied compression (LUC) technique to reduce the computa- excessive computation overhead encountered when calcu-
tion overhead by generating layer-wise pruning sparsity lating the forward and backward passes of LLMs [2], and (2)
and quantization bit-width policies, (2) an adaptive layer the cumbersome memory overhead introduced for storing
tuning and voting scheme to reduce the memory overhead massive model weights and activations through the tuning
by reducing the backpropagation depth, and (3) a comple- process. As shown in recent works [2, 11], LLMs are typically
mentary hardware scheduling strategy to handle the irreg- tuned on cutting-edge GPUs (e.g., with 40GB or 80GB GPU
ular computation patterns introduced by LUC and adap- memory), taking more than a GPU day to complete. Even for
tive layer tuning, thereby achieving efficient computation the state-of-the-art (SOTA) efficient tuning method, effec-
and data movements. Extensive experiments demonstrate tively tuning relatively small-scale LLMs (e.g., LLaMA-7B)
that Edge-LLM achieves a 2.92× speed up and a 4× memory on edge devices remains impractical [2].
overhead reduction as compared to vanilla tuning methods Although several existing efforts aim to address the afore-
with a comparable task accuracy. Our code is available at mentioned challenges, each has its own drawbacks. (1) To
https://github.com/GATECH-EIC/Edge-LLM reduce computation overhead, compressing target LLMs first
ACM Reference Format: to reduce the model size is a common approach [2, 3]. How-
Zhongzhi Yu1 , Zheng Wang1 , Yuhan Li1 , Haoran You1 , Ruijie Gao1 , ever, how to effectively reduce the redundancy of LLMs
Xiaoya Zhou3 , Sreenidhi Reedy Bommu1 , Yang (Katie) Zhao2 , Yingyan while maintaining their adaptability is still largely un-
(Celine) Lin1 . 2024. EDGE-LLM: Enabling Efficient Large Language explored [2]. (2) To mitigate memory overhead, existing
Model Adaptation on Edge Devices via Layerwise Unified Com- methods primarily focus on shortening the backpropagation
pression and Adaptive Layer Tuning & Voting. In 61st ACM/IEEE depth [19, 23]. Unfortunately, the reduced backpropagation
Design Automation Conference (DAC ’24), June 23–27, 2024, San depth results in only a fraction of blocks in LLMs being
Francisco, CA, USA. ACM, New York, NY, USA, 6 pages. https: updated, limiting the achievable performance.
//doi.org/10.1145/3649329.3658473
In this paper, we develop a comprehensive solution to
Permission to make digital or hard copies of part or all of this work for tackle the two aforementioned memory and computation
personal or classroom use is granted without fee provided that copies are challenges, achieving an effective LLM adaptation. Specifi-
not made or distributed for profit or commercial advantage and that copies
cally, we make the following contributions.
bear this notice and the full citation on the first page. Copyrights for third-
party components of this work must be honored. For all other uses, contact • We propose a comprehensive framework, dubbed Edge-
the owner/author(s). LLM, that tackles the memory and computation chal-
DAC’24, June 23–27, 2024, San Francisco, CA lenges of the LLM adaptation from both algorithm
© 2024 Copyright held by the owner/author(s). and hardware perspectives, enabling the effective LLM
ACM ISBN 979-8-4007-0601-1/24/06
https://doi.org/10.1145/3649329.3658473
adaptation on edge devices with limited memory and
computation resources.
DAC’24, June 23–27, 2024, San Francisco, CA Zhongzhi Yu, et al.

• On the algorithm side, we accomplish this goal from


two directions, each primarily focusing on one of the
aforementioned challenges: (1) To reduce the compu-
tation overhead, we propose a low-cost layer-wise
unified compression (LUC) method based on our em-
pirical observation on LLMs’ layer-wise sensitivities to
quantization and pruning. (2) To reduce the memory
overhead, we introduce an adaptive layer tuning and
voting scheme. In adaptive layer tuning, we propose
to selectively update distinct segments of the target Figure 1. Profiling results on the memory footprint when
LLM and reduce the memory footprint by directly con- tuning LLaMA-7B with LoRA [10] and QLoRA [2] on the
necting the output of the current updating segment Alpaca [20] dataset.
to the final layer. Further, in adaptive layer voting, we backpropagation depth, thereby decreasing the number of
harness the outputs of different segments of the target activations required to be stored in memory [19, 23]. Existing
LLM by voting for an optimized output. MET techniques achieve this goal either using partial tuning
• On the hardware side, to better handle the irregular to only tune the final few layers [23] or leveraging side tun-
computation patterns (i.e., diverse layer-wise quantiza- ing to add a bypass connection between each adapter module
tion bit-width, layer-wise pruning sparsity, and LLM with the final output [19]. While the reduction of memory
segments to update) introduced by the proposed algo- footprint during tuning is highly desirable, existing MET
rithms, we further integrate a complementary hard- techniques still face an unsatisfactory trade-off between ac-
ware scheduling module into Edge-LLM. The hard- curacy and memory footprint in LLM tuning. Specifically,
ware scheduling module includes a search space and a for partial tuning, existing attempts on LLMs need to tune
search strategy considering potential offloading strate- more than 80% of layers of the target LLM to achieve a sat-
gies, computation schedules, and tensor placements, isfactory task accuracy [23], while side tuning suffers from
aiming to better convert the theoretical reduction in biased optimization and struggles to achieve task accuracy
computation overhead to the hardware efficiency im- comparable to SOTA PET techniques [19].
provement. Compressing-then-tuning is a series of emerging effi-
• Experiment results and ablation studies validate the cient tuning techniques motivated by the observation that
effectiveness of our proposed Edge-LLM framework. the computation overhead in LLM tuning is dominated by
Specifically, Edge-LLM achieves a 0.70%∼1.29% higher the forward and backward passes of the LLM’s backbone, due
MMLU score compared with the baseline methods to the excessive size of the LLM’s backbone [2]. Thus, some
tuned under the same resource constraints and a com- pioneering works propose to compress the LLM backbone
parable perplexity on WikiText-2 as LoRA tuning with before tuning to reduce the computation and data move-
a 2.92× lower latency and a 4× reduction in memory ment overheads [2]. However, existing SOTA compressing-
overhead during each iteration. then-tuning techniques primarily aim to improve tuning
2 Background and Motivation speed, neglecting the extreme memory overhead (e.g., the
2.1 Efficient Tuning Techniques SOTA compressing-then-tuning method still needs an A100
Parameter-efficient tuning (PET) comprises techniques GPU with 40GB memory to achieve effective tuning on the
for tuning LLMs to new tasks using a limited number of Llama-70B model [2]). This oversight limits the effectiveness
trainable parameters, typically less than 10% of the total pa- of compressing-then-tuning techniques in tuning LLMs on
rameters in the target LLMs [5, 6, 10, 19]. It offers two major resource-constraint edge devices.
advantages: (1) reduced storage overhead, facilitating scal- 2.2 Memory Overhead During Tuning
able multitask deployment, and (2) a marginal reduction in To better understand the gap between the memory needed
computation and memory overhead, thanks to the reduced in existing tuning techniques and the memory available on
number of trainable parameters [10]. Despite PET’s wide- edge devices, we profile the memory requirements to tune a
spread use, directly applying it for on-device LLM adaptation Llama-7B model [23] with LoRA [10], one of the SOTA PET
remains impractical due to the remaining memory overhead techniques, and QLoRA [2], one of the SOTA compressing-
is still significant. This is because PET typically inserts a then-tuning techniques, respectively. As shown in Fig. 1,
learnable adapter to most, if not all, layers of the target LLM, the memory overhead of LoRA is dominated by storing the
leading to significant memory overhead to store intermediate LLM’s backbone weights and the activations for backprop-
activations during tuning. agation. Even after QLoRA compressed the LLM backbone
Memory-efficient tuning (MET) aims to minimize the to 4-bit and reduced the overall memory footprint by 41.2%
memory footprint during the tuning process by reducing over LoRA, there remains a 1.48× ∼ 2.22× gap between the
Edge-LLM DAC’24, June 23–27, 2024, San Francisco, CA

memory required for tuning and the memory available on Bit-Width Dim.
Removed
Updated Weight Forward Passes
Vote
Voting
Scheme
commonly used edge devices (e.g., 8 GB for TX2 [14] and 12 Sparsity Dim. Weight Frozen Weight Backward Passes

GB for Quest Pro [13]). Any Output Output 1 Output 2

2.3 Opportunities for Efficient LLM Tuning


To tackle the aforementioned limitations of existing tuning
methods, we identify potential opportunities to improve
these methods to develop effective LLM tuning frameworks. 1

On one hand, to further reduce the computation over- X Vote


head, we identify a mismatch between the previously success-
ful practice aimed at reducing the model redundancy and the
vanilla compression technique used in existing compressing-
then-tuning techniques. Specifically, previous efforts (e.g., [4]
observe that deep learning models exhibit redundancy across
Any Input
different dimensions (e.g., bit-width and sparsity) and at dif- (a) Compressing-
Input 1 Input 2 Any Input

(b) Edge-LLM Tuning (c) Edge-LLM Inference


ferent layers. In contrast, existing compressing-then-tuning Then-Tuning

techniques often adopt a uniform compression approach, Figure 2. Comparison between (a) the compressing-then-
reducing redundancy from only one dimension [2]. tuning baseline and (b/c) our proposed Edge-LLM method.
On the other hand, to further reduce the memory over-
selected layer (potentially different in each iteration) to the
head, based on our analysis in Sec. 2.1, we summarize that
final classification layer with a skip connection during the
the key to improving the achievable accuracy-memory trade-
forward pass. During backpropagation, only a few preceding
off lies in the ability to update all layers in the LLM with
layers of the selected layer receive gradient updates. Because
a limited backpropagation depth. Inspired by the early exit
the layers selected for updates vary with different inputs, this
mechanism developed for efficient model inference [21], we
approach ensures that all layers are effectively updated while
hypothesize that the outputs from early layers in the LLM
minimizing memory overhead. This efficiency is achieved
can provide meaningful information for prediction. Thus,
through the reduced depth of backpropagation enabled by
it is possible to start backpropagation from an early exit
the introduction of skip connections. Furthermore, during
layer and still effectively update the model. In this scenario,
inference, we introduce a voting mechanism to enhance the
since backpropagation can be initiated from various early
accuracy of LLMs tuned with adaptive layer tuning. This
exit layers, the backpropagation depth required for updating
method capitalizes on the ability of adaptively tuned LLMs
all layers in the LLM can be minimized.
to produce reasonable outputs from multiple layers. Conse-
3 Edge-LLM Algorithm quently, each layer generates logits, and a voting process is
3.1 Overview employed to determine the final output (see Sec. 3.3).
Motivated by the opportunities identified in Sec. 2.3, we then 3.2 Layer-wise Unified Compression (LUC)
introduce the algorithm design of our proposed Edge-LLM Motivating observation on LLM’s layer-wise sensitivity.
framework to facilitate effective and efficient LLM adaptation In prior studies on model compression, a common under-
with limited computation and memory overhead. As shown standing is that different layers in a model exhibit different
in Fig. 2, our proposed Edge-LLM tuning algorithm integrates sensitivities to different compression techniques [4]. How-
two key enablers each leveraging one of the aforementioned ever, the sensitivities of different layers in LLMs to different
opportunities in reducing the computation and memory over- compression techniques remain an open question. To address
head. Specifically: (1) To reduce the computation overhead, this question, we first explore the layer-wise sensitivities of
we propose the LUC technique to diminish the redundancy the target LLM to pruning and quantization. Specifically, we
of the target LLM. This technique is motivated by our em- apply different quantization bit-widths and pruning sparsi-
pirical observation of the diverse layer-wise sensitivities of ties to each layer of a pretrained LLaMA-7B [22] model. By
LLMs to quantization and pruning. Based on the observa- comparing the averaged MSE of the compressed and original
tion above, we develop a low-cost, mean-square-error-based layer outputs in the target LLM fed with the same input
(MSE-based) identifier in LUC to generate a layer-wise com- from the WikiText dataset [12], we observe that, as shown
pression policy (e.g., layer-wise bit-width and pruning spar- in Fig. 3, only a small fraction of layers in the LLM have high
sity allocation), aiming to improve the accuracy-efficiency sensitivities to compression.
trade-off of LUC over existing compression techniques in Our hypothesis and the proposed LUC. Based on the
compressing-then-tuning frameworks (Sec. 3.2). (2) To re- observation above, we hypothesize that the high sensitivity
duce the memory overhead, we propose an adaptive layer (i.e., high MSE) is due to limited redundancy in the corre-
tuning scheme that dynamically connects the output of a sponding layer, thereby necessitating a lower compression
DAC’24, June 23–27, 2024, San Francisco, CA Zhongzhi Yu, et al.

outputs of different layers. To this end, we propose a voting


mechanism to enhance the performance by making predic-
tions based on the outputs from all exit layers. Specifically,
inspired by existing findings about the relationship between
post-softmax probability and prediction confidence [15], we
determine the final output index by choosing the one with
the highest post-softmax probability across all exit layers.
Specifically, given an output probability matrix M, with each
Figure 3. Visualization of LLaMA-7B’s layer-wise sensitivity element m (𝑖,𝑗 ) representing the output probability for index
to (a) quantization and (b) pruning. 𝑗 from layer 𝑡𝑖 ∈ T . We first find the location of the maxi-
ratio. To this end, we propose the following mapping func- mum value in M with (𝑖𝑚𝑎𝑥 , 𝑗𝑚𝑎𝑥 ) = arg max𝑖,𝑗 (m (𝑖,𝑗 ) ), then
tions to map the layer-wise MSE to the layer-wise quan- we generate the final output as 𝑜 = 𝑗𝑚𝑎𝑥 .
tization bit-width and pruning sparsity, respectively. For
quantization, given an LLM 𝑀 with 𝐿 layers, formulating 4 Edge-LLM Hardware Scheduling
L = {𝑙 0, 𝑙 1, · · · , 𝑙𝐿−1 }, a base quantization bit-width 𝐵, and
Motivation. The aforementioned algorithm designs intro-
the quantization sensitivity (i.e., the MSE between the output
duce an irregular computation pattern (i.e., diverse layer-
of the original layer and the output of the 𝐵-bit quantized
wise quantization bit-width, layer-wise pruning sparsity,
layer) for layer 𝑙𝑖 as 𝑠𝑞𝑢𝑎𝑛𝑡 𝑖 , we define the optimized quanti-
and layers to update). This complexity makes it challenging
zation bit-width 𝑏 𝑗 at layer 𝑙 𝑗 as Í for real devices to fully benefit from the algorithm’s theo-
𝐿−1 𝑖
𝑖=0 𝑠𝑞𝑢𝑎𝑛𝑡 retical reduction in computation overhead. To address this
𝑏 𝑗 = 𝐵 + 1 (𝑠𝑞𝑢𝑎𝑛𝑡
𝑗
≥ ), (1)
𝐿 challenge, we propose a complementary hardware schedul-
where 1 (.) is the indicator function. For pruning, given ing module, focusing on efficient scheduling and offloading
a target overall pruning sparsity 𝑃, we define the pruning strategies tailored for optimizing LLM inference throughput.
sparsity 𝑝 𝑗 at layer 𝑙 𝑗 as 𝑗 The on-chip accelerator SRAM size limitation (512KB∼1MB)
𝑠𝑝𝑟𝑢𝑛𝑒
𝑝 𝑗 = 𝑃 × 𝐿 × Í𝐿−1 𝑖 , (2) highlights the inability to load all model weights and acti-
𝑖=1 𝑠𝑝𝑟𝑢𝑛𝑒 vations, necessitating offloading to secondary storage medi-
where 𝑠𝑝𝑟𝑢𝑛𝑒
𝑗
is the pruning sensitivity for layer 𝑙 𝑗 . ums like DRAM (8GB∼16GB) and SSD (128GB∼256GB). Our
3.3 Adaptive Layer Tuning and Voting hardware acceleration is motivated by the need to establish
a comprehensive cost model, serving as the basis for efficient
In this enabler, our objective is to facilitate effective tun- memory scheduling or offloading strategies for each early
ing with reduced memory overhead, thereby fitting the tun- exit block in the system.
ing process into edge devices with limited memory capac-
ity. To achieve this, the primary challenge we’ve identified 4.1 Overview
is enabling efficient updates across all layers of the target In the pursuit of optimizing the scheduling and offloading
LLM with restricted backpropagation depth, as analyzed in strategies for LLM hardware accelerators, our methodol-
Sec. 2.3. ogy allocates bit-widths and pruning sparsities to each layer
In Edge-LLM, we alleviate this challenge by constructing a based on sensitivity (see Sec. 3.2). Subsequently, we conduct
set of exit layers T = {𝑡 0, 𝑡 1, · · · , 𝑡𝑇 −1 }. Each exit layer 𝑡𝑖 con- a nuanced exploration to identify the optimal offloading
nects to the output of layer 𝑙 Ceil( (𝑖+1) ×𝐿/𝑇 ) in the target LLM, strategy for each early exit block. As depicted in Fig. 4 (a)
functioning as the final output layer. Note that 𝑇 represents and (b), these two steps take algorithm hyperparameters as
the number of selectable exit layers, and 𝐿 denotes the total inputs and yield the final allocation strategy and hardware
number of layers in the target LLMs, ensuring that 𝑇 < 𝐿. schedulings as outputs.
In each tuning iteration, we randomly select 𝑡𝑖 ∈ T as the 4.2 Searching Objective
only exit layer to use, and update the following set of layers We conceptualize the LLM tuning with offloading as a graph
{𝑙 Ceil( (𝑖+1) ×𝐿/𝑇 ) −𝑚 , 𝑙 Ceil( (𝑖+1) ×𝐿/𝑇 ) −𝑚+1, · · · , 𝑙 Ceil( (𝑖+1) ×𝐿/𝑇 ) , 𝑡𝑖 }. traversal problem following [18]. In Fig. 4 (c), we present
Each layer in this set is equipped with LoRA adapters. Here, an illustrative computational graph consisting of three di-
𝑚 = Ceil(𝐿/𝑇 ) denotes the number of layers that have un- mensions of batches, layers, and tokens. In the depicted
frozen trainable parameters in this configuration. graph, each square denotes the computation of a specific
Furthermore, with the adaptive layer tuning described layer. Squares sharing the same color indicate the utilization
above, the tuned LLM can generate outputs from all layers of identical layer weights. A valid path is defined as a trajec-
𝑡 ∈ T . Although directly using the final output layer 𝑡𝑇 −1 can tory that traverses (i.e., computes) all squares, adhering to
achieve competitive performance, having multiple available the following constraint:
exit layers provides an opportunity to further enhance the • During LLM forwarding or backpropagation, a square’s
performance at inference time by adaptively combining the computation depends on the left or right layers in its
Edge-LLM DAC’24, June 23–27, 2024, San Francisco, CA

Figure 4. The overview of our hardware scheduling.


row being completed, respectively. 4.4 Cost Models
• To compute a square, all its inputs (weights, activations, Having established the search objective and the search space,
cache) must be loaded onto the on-chip SRAM. the next step is the development of an analytical cost model.
• At any given time, the cumulative size of tensors stored This model serves the purpose of estimating the execution
on an accelerator must not exceed its memory capacity. time based on the specified algorithm parameters and hard-
The objective is to identify a valid path that minimizes the ware specifications. The total latency for computing a block
overall execution time, encompassing both compute costs can be estimated as 𝑇dec . Assuming perfect overlapping, 𝑇dec
and I/O costs incurred during the movement of tensors. can be estimated as
4.3 Block Search Space 𝑇dec = max(𝑟 to_sram, 𝑤 to_dram, 𝑟 to_dram, 𝑤 to_ssd,𝑇comp ) (3)
Building upon the aforementioned search objective, we estab- where 𝑟 to_sram , 𝑤 to_dram , 𝑟 to_dram , 𝑤 to_ssd , and 𝑇comp denote the
lish a search space encompassing potential valid strategies. latency of read from DRAM to SRAM, write from SRAM to
• Row-by-row. Existing systems often use solely row- DRAM, read from SSD to DRAM, write from DRAM to SSD,
by-row traversal for the activation footprint savings. and computation, respectively, during LLM tuning.
However, this strategy does not consider the weight 5 Evaluation
sharing between adjacent squares among different
bathes, leading to repetitive weight loading I/O costs. 5.1 Evaluation Setup
• Mixed column-by-column and row-by-row. Al- Datasets: Two commonly used benchmarking dataset in-
ternatively, to reduce I/O costs related to weights, cluding MMLU [9] and WikiText [12]. Model: LLaMA-7B [22].
an approach involves traversing the graph column- Algorithm baselines: The SOTA PET technique, LoRA [10];
by-column. This leverages weight sharing among all the SOTA MET technique, LST [19]; the SOTA compression
squares in a column, allowing DRAM preservation for techniques, Sparse-GPT [7] and LLM-QAT [11]; and seven
reuse, with activations being loaded and unloaded. As variants of our proposed methods. Hardware baselines:
our proposed algorithm techniques can greatly reduce The SOTA systolic accelerator [17] dedicated for transformer
the activation memory footprint requirement, we in- training. Algorithm implementation: We use LLM-QAT
clude mixed column-by-column and row-by-row in and Sparse-GPT as the quantization and pruning techniques,
search space. respectively, and tune the model following the settings in [2].
Considerations. Overlapping. Another optimization is Hardware configuration: The accelerator’s DRAM is set
overlapping. This entails concurrently handling a load of to 8GB LPDDR4 and on-chip SRAM to be 1MB, in line with
weights for the next layer, the load of activations for the sub- SOTA edge devices [14], with other hardware configurations
sequent batch, the storage of activations from the preceding following the baseline training accelerator design. Evalua-
batch, and the computation of the current batch. The inte- tion methodology: We use the SOTA Scale-Sim [16] sim-
gration of overlapping into the block schedule is necessary ulator to simulate both the baseline accelerator and those
for delivering the final scheduling. after applying our techniques on the baseline accelerator.
Tensor Placement. In addition to the computation schedule, 5.2 Algorithm Evaluation
an effective strategy must delineate the placement of tensors To evaluate the performance of our proposed method, we
within the memory hierarchy. Three variables, namely 𝑤𝑠𝑟𝑎𝑚 , first benchmark our proposed method with existing baseline
𝑤𝑑𝑟𝑎𝑚 , and 𝑤𝑠𝑠𝑑 , define the percentages of weights stored methods including partial tuning, LST and LoRA tuning on
on the SRAM, DRAM, and SSD, respectively. Similarly, three the commonly used MMLU dataset. As shown in Table 1, our
variables, 𝑎𝑠𝑟𝑎𝑚 , 𝑎𝑑𝑟𝑎𝑚 , and 𝑎𝑠𝑠𝑑 articulate the percentages method consistently achieves a 0.70%∼1.29% higher accuracy
of activations; and three variables, 𝑔𝑠𝑟𝑎𝑚 , 𝑔𝑑𝑟𝑎𝑚 , and 𝑔𝑠𝑠𝑑 with the same computation efficiency and a 4× reduction
articulate the percentages of gradients. in memory over the baseline methods. To further validate
the key enablers in Edge-LLM, we first evaluate the LUC’s
perplexity separately on the WikiText-2 dataset over two
DAC’24, June 23–27, 2024, San Francisco, CA Zhongzhi Yu, et al.

Table 1. Benchmarking Edge-LLM on the MMLU dataset. Table 2. Ablation on LUC’s performance with its variants
Method Avg. Bit Sparsity Norm. Mem. MMLU Method Avg. Bit Sparsity Perplexity
LoRA 8.0 0% 1.00× 33.60 SparseGPT 8.0 50% 15.88
Partial Tuning 5.0 50% 0.25× 30.94 LLM-QAT 8.0 0% 13.34
Ours 5.1 50% 0.25× 31.64
Uniform 5.0 50% 17.61
LST 4.0 0% 0.29× 29.04 Random 5.1 50% 16.21
Partial Tuning 4.0 50% 0.25× 28.70 Ours 5.1 50% 15.71
Ours 4.1 50% 0.25× 29.89
Uniform 4.0 50% 19.86
Partial Tuning 3.0 50% 0.25× 26.61
Random 4.1 50% 19.81
Ours 3.1 50% 0.25× 27.68
Ours 4.1 50% 18.58
SOTA compression techniques including SparseGPT and Uniform 3.0 50% 32.52
LLM-QAT, and two variants: (1) Uniform: using the same Random 3.1 50% 31.71
Ours 3.1 50% 30.03
quantization bit-width and pruning sparsity across all layers
and (2) Random: Randomly assign our generated layer-wise References
pruning sparsities and quantization bits across all layers. As [1] Bubeck et al. 2023. Sparks of artificial general intelligence: Early
shown in Table 2, our proposed method achieves a 1.28∼2.49 experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
lower perplexity compared to the Uniform baseline under [2] Dettmers et al. 2023. Qlora: Efficient finetuning of quantized llms.
arXiv (2023).
similar resource constraints and a 0.50∼1.68 lower perplexity
[3] Kim et al. 2024. Memory-efficient fine-tuning of compressed large
compared to the Random baseline under the same efficiency, language models via sub-4-bit integer quantization. NeurIPS 36 (2024).
showing the effectiveness of our proposed LUC. [4] Yu et al. 2022. Unified visual transformer compression. arXiv preprint
arXiv:2203.08243 (2022).
5.3 Hardware Evaluation
[5] Yu et al. 2023. Hint-aug: Drawing hints from foundation vision trans-
We evaluate the proposed techniques based on the base- formers towards boosted few-shot parameter-efficient tuning. In CVPR.
line systolic accelerator designed for transformer training 11102–11112.
with proper modifications for supporting the proposed tech- [6] Yu et al. 2023. Master-ASR: achieving multilingual scalability and low-
resource adaptation in ASR with modular learning. In ICML. PMLR,
niques [17]: (1) Since the proposed adaptive layer tuning 40475–40487.
can be naturally run on the baseline accelerator, there is [7] Frantar et al. 2023. SparseGPT: Massive Language Models Can Be
no need to modify the baseline accelerator; and (2) For the Accurately Pruned in One-Shot. (2023).
LUC, we make these modifications: we update the baseline [8] Fu et al. 2021. Enabling random precision switch for winning both
to store the compressed weights on DRAM and SSD. To sim- adversarial robustness and efficiency. In MICRO. 225–237.
[9] Hendrycks et al. 2020. Measuring massive multitask language under-
plify the design, we do not modify the compute core for standing. arXiv preprint arXiv:2009.03300 (2020).
sparsity and use a simple spatial-temporal flexible-precision [10] Hu et al. 2021. Lora: Low-rank adaptation of large language models.
MAC unit [8]. We apply our proposed hardware scheduling arXiv preprint arXiv:2106.09685 (2021).
searching method to find the optimal algorithm-to-hardware [11] Liu et al. 2023. LLM-QAT: Data-Free Quantization Aware Training for
mappings. Scale-Sim simulation results show that the adap- Large Language Models. arXiv (2023).
[12] Merity et al. 2016. Pointer sentinel mixture models. arXiv preprint
tive layer tuning can achieve 2.24× speedup; the pruning arXiv:1609.07843 (2016).
and adaptive layer tuning can introduce 2.37× speedup; and [13] Meta. 2022. Quest Pro. https://www.meta.com/quest/quest-pro/.
combing LUC (4-bit/5-bit) and the adaptive layer tuning can [14] NVIDIA. 2020. NVIDIA Jetson TX2. www.nvidia.com/en-us/
give 3.38×/2.92× overall speedup, respectively. autonomous-machines/embedded-systems/jetson-tx2/.
[15] Pearce et al. 2021. Understanding softmax confidence and uncertainty.
6 Conclusion arXiv preprint arXiv:2106.04972 (2021).
[16] Samajdar et al. 2023. Systolic CNN AcceLErator Simulator (SCALE
In this paper, we introduce an LLM tuning framework, Edge-
Sim). https://github.com/ARM-software/SCALE-Sim.
LLM, achieving efficient LLM adaptation on edge devices. [17] Shao et al. 2023. An Efficient Training Accelerator for Transformers
Experiments demonstrate that Edge-LLM achieves efficient With Hardware-Algorithm Co-Optimization. VLSI (2023).
adaptation with comparable performance as vanilla tuning [18] Sheng et al. 2023. FlexGen: High-Throughput Generative Inference of
with a 2.92× speed up and a 4× memory reduction. Large Language Models with a Single GPU. (2023).
[19] Sung et al. 2022. Lst: Ladder side-tuning for parameter and memory
Acknowledgement efficient transfer learning. NeurIPS 35 (2022), 12991–13005.
[20] Taori et al. 2023. Stanford alpaca: An instruction-following llama
This work was supported in part by CoCoSys, one of the model.
seven centers in JUMP 2.0, a Semiconductor Research Corpo- [21] Teerapittayanon et al. 2016. Branchynet: Fast inference via early
ration (SRC) program sponsored by DARPA, and the National exiting from deep neural networks. In ICPR.
Science Foundation (NSF) through the NSF CAREER funding [22] Touvron et al. 2023. Llama: Open and efficient foundation language
(Award number: 2048183). models. arXiv preprint arXiv:2302.13971 (2023).
[23] Zhang et al. 2023. Llama-adapter: Efficient fine-tuning of language
models with zero-init attention. arXiv (2023).

You might also like