On-Device Language Models - A Comprehensive Review
On-Device Language Models - A Comprehensive Review
Qun Wang∗
Computer Science Department, San Francisco State University
{qunwang}@sfsu
arXiv:2409.00088v2 [cs.CL] 14 Sep 2024
Abstract
1 Introduction
The emergence of Large Language Models (LLMs) has catalyzed a transformative shift in
natural language processing (NLP) applications. By leveraging the transformer architecture
(Vaswani et al., 2017), LLMs such as OpenAI’s GPT series (Radford et al., 2019; Brown
et al., 2020; Achiam et al., 2023) and Meta’s LLaMA series (Touvron et al., 2023a;b; Meta,
2024; Dubey et al., 2024) have demonstrated unparalleled proficiency in understanding
and generating human-like text, profoundly influencing fields ranging from automated
customer support to advanced content creation. The ability of these models to seamlessly
perform a variety of NLP tasks has positioned them as the backbone of modern AI-driven
∗ Equal contribution
1
Preprint, under review
applications (Wu et al., 2023b; Ge et al., 2024; Nam et al., 2024; Zheng et al., 2024a; Yang
et al., 2024b).
However, the traditional deployment of LLMs predominantly on cloud servers presents
several challenges, particularly in terms of latency, security, and the need for continuous
Internet connectivity. These concerns are driving the burgeoning interest in deploying
LLMs on edge devices—a shift that promises reduced response times, and personalized
user experiences directly on user devices such as smartphones, automotive systems, and
personal wearables. This paradigm shift not only aligns with the increasing user demand
for immediate and personalized assistance but also mitigates the bandwidth and energy
costs associated with cloud computing.
2022 15.2
Manufacturing
2023 19.1
Automotive
Healthcare
2027 49.3
Year
2029 72.0
2030 88.2
2031 111.1
2032 143.6
Figure 1: The global market size for on-device edge AI, by end-user, from 2022 to 2032, in
USD Billion. The market will grow at the CAGR of 25.9%. The forecasted market size for
2032 is $143.6B (Market.us, 2024).
The growing interest in on-device AI deployment is reflected in the rapidly expanding edge
AI market. As illustrated in Figure 1, the edge AI market is projected to experience substan-
tial growth across various sectors from 2022 to 2032. The market size is expected to increase
from $15.2 billion in 2022 to $143.6 billion by 2032, representing a nearly tenfold growth over
a decade (Market.us, 2024). This growth spans multiple industries, with manufacturing, au-
tomotive, and government sectors showing significant contributions. The projected market
expansion underscores the increasing demand for edge AI solutions, including on-device
language models, driven by the need for faster, more private, and efficient AI capabilities
across diverse applications. This market trend aligns with the technological push towards
more localized AI processing, further emphasizing the importance of developing efficient
on-device LLM solutions.
Despite the compelling advantages, integrating computationally intensive language models
within the constraints of edge devices poses significant challenges. The primary obstacles
include limited computational power, reduced memory capacity, and energy constraints,
which collectively complicate the direct adoption of cloud-based LLM architectures. For
2
Preprint, under review
Sparse Update
Limitations of Cloud-Based LLM Inference and
Advantages of On-Device Inference Tiny Training Engine (TTE)
Contribution Analysis
The Performance Indicator of On-Device LLM
Parameter Sharing
Efficient Architectures for
Architectural Design Principles and Innovations Modular Architectures
On-Device LLMs for On-Device LLMs
Compact Representations
Model Compression and Parameter Sharing
FPGA
Gemini Nano
Honor MagicLM
Examples and Applications Examples of On-Device Large Models Apple OpenELM and Ferret-v2
MiniCPM
NOMI GPT
Gemma2-9B
Qwen2-0.5B
Translation
Healthcare Application
3
Preprint, under review
This review paper provides a comprehensive exploration of the current strategies and
advancements in the deployment of LLMs on edge devices. We aim to critically analyze
the various techniques and architectures that have been developed to adapt LLMs to the
constraints of edge computing. This includes a detailed examination of model compression
techniques, energy-efficient computing strategies, and the development of novel lightweight
model architectures. Furthermore, the paper will delve into deployment strategies that
enable the effective use of LLMs in edge scenarios, highlighting key industry applications
and the resulting benefits.
Through this review, we intend to illuminate the pathways and challenges in transitioning
from cloud-based to on-device language models, providing insights into how this shift
could redefine the landscape of applications and AI accessibility. The structure of this paper
is illustrated in Fig. 2. We begin by exploring the foundations and preliminaries in Section
2, including the evolution of LLMs on-device, architectural foundations, and on-device
training techniques. Section 3 delves into efficient architectures for on-device language
models, discussing innovative design principles, model compression, and collaborative
approaches. Section 4 continues with an in-depth examination of model compression and
optimization techniques, covering quantization, pruning, knowledge distillation, and low-
rank factorization. Section 5 investigates hardware acceleration and deployment strategies,
highlighting popular on-device LLM frameworks and hardware-specific optimizations. To
contextualize these advancements, in Section 6, we present examples of existing on-device
language models and their real-world applications across various domains. Finally, Section
7 discusses future directions and open challenges in the field, and Section 8 concludes
our review. By focusing on the intersection of LLM capabilities and edge computing
requirements, this paper contributes to the ongoing discourse in AI research, offering a
comprehensive perspective on achieving the delicate balance between model performance
and computational efficiency in resource-constrained environments.
The evolution of on-device LLMs is a process closely linked to technological progress. Figure
3 provides a comprehensive timeline of on-device language model development since 2023,
illustrating the rapid advancement in this field. As shown in the figure, the exploration and
experimentation of large language models on the edge began in earnest in 2023. We saw
the emergence of several influential model series with parameters below 10B, making it
possible for LLMs to run on edge devices. Notable examples include:
• Meta’s LLaMA series (Touvron et al. (2023a;b); Meta (2024); Dubey et al. (2024))
• Microsoft’s Phi series (Gunasekar et al. (2023); Li et al. (2023c); Abdin et al. (2024))
• Zhipu AI’s ChatGLM series (GLM et al. (2024))
• Alibaba’s Qwen series (Bai et al. (2023a); Qwen Team (2024))
• 01.AI’s Yi series (Young et al. (2024); 01.AI (2024))
• Mistral’s series (Jiang et al. (2023; 2024a))
• Shanghai AI Laboratory’s InternLM series (Team (2023); Cai et al. (2024b))
In addition, there are also models such as Falcon released by TII (Almazrouei et al., 2023)
and the MPT model released by Mosaic ML (MosaicML, 2023) that have participated in the
competition of such models. Although the performance of these small-parameter models
is not as good as that of traditional large-parameter models, they make it possible for
LLMs to run on edge devices. Their appearance marks the importance of the language
model industry to the application scenarios of edge devices using LLMs. At the same time,
with the application of technologies such as mixed experts, quantization, and compression,
the performance of small-parameter models is constantly making great progress while
maintaining the parameter volume.
4
Preprint, under review
NEXA AI
Octopus v2
Octopus v3
Octo-planner
0.5B, 2B 1B 3B
Meta
Llama 1
Llama 2
Llama 3
7B 7B 8B
Zhipu AI
ChatGLM
ChatGLM 2
ChatGLM 3
GLM 4, 4v
6B 6B 6B
9B, 9B
Mosaic ML
MPT
7B
Miscrosoft
Phi 1
Phi 1.5
Phi 2
7B
11B, 11B
Baichuan AI
Baichuan
Baichuan 2
7B 7B
Mistral
Mistral
Mistral 8x
7.3B 7B
Alibaba Cloud
Qwen 1
Qwen 1
Qwen VL
Qwen 1.5
Qwen 2
1.8B
7B 9.6B 4B, 7B
0.5B, 1.8B, 0.5B, 1.5B, 7B
University of
Wisconsin-Madison
LLaVA 1.5
LLaVA NeXT
LLaVA 1.0
6.7B 7B
13B
01.AI
Yi
Yi VL
Yi 1.5
6B 6B 6B
Google
Gemini Nano
1.8B Gemma 1
Gemma 2
2B, 8B 9B
ModelBest
MiniCPM
MiniCPM V2.0
8B
Apple
OpenELM
1.1B DCLM
AI2
OLMo
1B, 7B
Shanghai AI
Laboratory InternLM2
InternLM2.5
7B 7B
M-A-P
MAP Neo
7B
Hugging Face
SmolLM
02 03 04 05 06 07 08 09 10 11 12 01 02 03 04 05 06 07
Text
Multimodal
2023 2024 Model Model
Figure 3 also highlights the emergence of multimodal models since 2023, such as the LLaVa
series (Liu et al., 2024a;b), QwenVL (Bai et al., 2023b), Gemini Nano (Team et al., 2023), and
Yi VL (Young et al., 2024). These models represent valuable attempts to deploy multimodal
LLMs on the edge, adapting to more complex and changing user scenarios on mobile
devices.
Entering 2024, the pace of innovation accelerated, as evident from the dense cluster of new
models in the figure’s rightmost section. This period saw the introduction of:
Figure 3 clearly shows an increased focus on multimodal capabilities in 2024, with many new
models offering both text and multimodal functionalities to address diverse task-processing
scenarios. As illustrated by the variety and progression of models, on-device language
models are rapidly evolving and diversifying. This trend, coupled with the continuous
maturation of intelligent hardware and software technologies, enables the integration of
these models into smartphones, Internet-connected cars, computers, robots, and other
terminal equipment, showcasing their growing application potential and value.
5
Preprint, under review
6
Preprint, under review
2. Sparse update: Selectively update the weights of a portion of the layers in the
network, skip the gradient calculations of less important layers and sub-tensors,
thereby reducing memory usage and computational costs (Liu et al., 2023; Ansell
et al., 2024).
3. Tiny Training Engine (TTE): Includes redundant nodes in the reverse graph, such
as gradient nodes that freeze weights, and reorder operations to achieve in-place
updates (Lin et al., 2023a; Khouas et al., 2024).
4. Contribution analysis: Automatically determine the sparse update scheme, that is,
determine which parameters (weights/biases) contribute the most to downstream
accuracy, so as to select which layers or parts of tensors should be updated under a
limited memory budget (Lin et al., 2022; Ren et al., 2024; Zeng et al., 2023a).
Figure 4: Vote Distribution of different LLM deployment strategies in Personal LLM strate-
gies (Li et al., 2024c)
Although cloud-based LLMs offer powerful capabilities, they come with certain drawbacks,
including potential latency issues (Wang et al., 2024b) and data concerns due to their depen-
dency on networks. Hence, the concept of on-device deployment through edge computing
7
Preprint, under review
has emerged to reduce latency and safeguard user data (Gerganov, 2023). Processing oc-
curs locally, eliminating the need for data transmission. Moreover, the proliferation of
customized hardware accelerators on mobile devices has made it feasible to run large LLMs
with billions of parameters directly on devices.
On-device inference provides a compelling case for reducing latency because it allows
models to run directly on the user’s device without sending data to a cloud server. This
approach is particularly beneficial for applications that require real-time responses. In the
case of GPT-4, which gets responses based on the cloud, each token is generated at a speed
of about 200 ms, while common end-side models can already generate tokens faster than
this (taivo, 2023).
The ability to run models offline reduces reliance on network connectivity, making applica-
tions more accessible in areas with poor network coverage or other offline environments. For
example, Google’s Gemini Nano-based TalkBack, a feature that uses multimodal capabilities
to recognize image content to provide voice broadcasts to people with disabilities, can work
properly even when completely offline (Google, 2024b). On-device reasoning also optimizes
the use of limited computing resources through techniques such as model quantization,
allowing language models to run efficiently even on devices with limited memory.
The deployment of LLMs on mobile devices is further facilitated by user-friendly interfaces
that abstract away the complexities of AI, making the technology accessible to users without
specialized knowledge. Moreover, these applications are not just limited to text generation
but can extend their functionality to interact with device features, such as making calls,
conducting web searches, and managing calendar events, through innovative text-to-actions
features.
Latency is the time it takes from the user inputting a request to the system starting to
respond. It usually refers to the time from when the model receives the input text to when it
starts generating the first output. We generally use TTFT (Time-to-First-Token) to measure
this metric (Hu et al., 2024a; Agrawal et al., 2024b;a).
Inference speed refers to the speed at which the LLM makes an autoregression prediction
of the next token based on all the previous tokens seen so far. However, in addition to
the initial prompt decoding, inferring the next token also requires the logic of decoding
one token at a time. This is because each new token depends on the previous token, and
the previous token cannot be known in advance. This step takes up the most time in the
reasoning of the large language model. Because of this, the speed of this step will mainly
determine whether the user dialogue model is smooth, thus directly affecting the user
experience (Çöplü et al., 2023; Cai et al., 2024a; Zheng et al., 2024b).
The size of RAM/VRAM used is also one of the performance indicators of language models
operation. Due to the operation mechanism of language models, they consume correspond-
ing RAM according to the size of model parameters during inference. For example, it is
impractical to deploy a model with 70B parameters on a personal office laptop. This is
crucial for many edge devices with limited RAM size. Engineers must use various model
compression technologies to minimize the memory occupied by language model inference
(Kwon et al., 2023; Zhao et al., 2024b;c).
In addition, the storage space occupied by models and the energy consumed during infer-
ence, for example, will become important indicators on edge devices. These indicators are
particularly critical to whether LLMs can run on edge devices and how long they can run.
In most cases, LLMs inference will put the processor into a fully loaded working state. If
the operation time is too long, it will seriously consume the battery of the mobile device,
thus bringing new problems. For example, a 7B parameter LLM inference will consume
about 0.7J per token. For an iPhone with a battery capacity of about 50kJ, this means that
the conversation with the model can only last for two hours at most. This does not take into
account other issues such as device heating caused by model inference (Liu et al., 2024c;
Stojkovic et al., 2024; Jiang et al., 2024b).
8
Preprint, under review
Designing language models for on-device deployment involves several architectural princi-
ples and innovations aimed at overcoming the resource constraints typical of mobile and
edge devices. Key strategies include 1) parameter sharing (Lin et al., 2023b; Cao et al., 2024),
which involves reusing weights across different parts of the model to reduce the overall
parameter count; 2) modular architectures (Ning et al., 2023; Ostapenko et al., 2024; Shen
et al., 2024), which break down the LLM into smaller, independent components or modules
that can be processed separately or in parallel; and 3) compact representations, which focus
on reducing the memory footprint of LLMs through techniques like quantization and weight
pruning (Liu et al., 2024c; Zhang et al., 2024b; Xu et al., 2023). To provide a comprehensive
comparison of these architectures, we consider their performance, computational efficiency,
and memory requirements, which are summarized on Table 1.
9
Preprint, under review
10
Preprint, under review
Efficient memory and computational resource utilization are critical for deploying large
language models (LLMs) on mobile and edge devices. Various techniques and innovations
aim to optimize the use of limited resources to ensure that LLMs can perform effectively
without overwhelming the device’s capabilities. This subsection reviews key research works
focusing on enhancing memory and computational efficiency for on-device LLMs.
The researchers from Samsung Electronics proposes innovative memory solutions to address
the memory bottlenecks in LLM deployment (Kim et al., 2024c). The authors introduce
Processing-in-Memory (PIM) and Processing-near-Memory (PNM) technologies:
Aquabolt-XL (Kim et al., 2021) and LPDDR-PIM (Kim et al., 2024a): These PIM devices
embed logic within the memory core, boosting internal memory bandwidth and supporting
high-performance computing tasks, including LLM acceleration. AXDIMM (Ke et al., 2021)
and CXL-PNM: These PNM solutions place computational logic near the memory core,
enhancing memory bandwidth and capacity. CXL-PNM integrates computational logic into
the CXL memory controller, significantly improving memory capacity and performance.
Experimental results show that these memory solutions achieve up to 4.5× performance
improvement and 71% energy reduction compared to traditional memory architectures,
making them highly suitable for LLM inference on resource-constrained devices.
MELTing Point introduces the MELT infrastructure, designed to facilitate the execution and
benchmarking of LLMs on mobile devices (Laskaridis et al., 2024). The MELT framework
supports Android, iOS, and Nvidia Jetson devices and provides detailed performance and
energy metrics. MELT systematically evaluates on-device LLM execution, providing insights
into performance, energy efficiency, and memory usage across various models. The paper
examines the impact of model quantization on performance and accuracy, demonstrating
that while quantization reduces memory requirements, it incurs an accuracy cost. The
results highlight the importance of balancing memory and computational efficiency with
performance to make LLMs viable for mobile applications.
Memory and computational efficiency are paramount for deploying LLMs on mobile and
edge devices. The research works reviewed in this subsection present innovative solutions
to overcome the memory wall and optimize resource usage. Samsung’s memory solutions,
such as PIM and PNM, significantly enhance memory bandwidth and capacity, enabling
efficient LLM inference. The MELT infrastructure provides a comprehensive evaluation
framework, offering valuable insights into the trade-offs between performance, energy
efficiency, and memory usage. These advancements are crucial for ensuring that LLMs can
operate effectively on resource-constrained devices, paving the way for more practical and
efficient AI applications in mobile and edge environments.
11
Preprint, under review
Achieving efficient deployment of LLMs on edge devices involves a range of strategies aimed
at improving overall performance while managing computational and memory constraints.
This subsection reviews key research works that introduce innovative approaches to enhance
the efficiency and effectiveness of on-device LLMs.
Any-Precision LLM proposes a novel method to deploy various LLMs with different pre-
cisions in a memory-efficient manner (Park et al., 2024). Any-Precision model extends
any-precision deep neural networks to LLMs, allowing a single n-bit quantized model
to support multiple lower bit-width models down to 3 bits. This reduces memory usage
without significant performance loss. Post-training quantization (PTQ) creates low-bit
models and incrementally upscales them to higher bit widths. This avoids multiple training
phases for each precision, saving time and resources. A new software engine optimized for
any-precision support manages memory bandwidth and improves serving efficiency, ensur-
ing practical deployment of LLMs on edge devices. The experimental results demonstrate
substantial memory savings and improved serving efficiency, making any-precision LLMs
suitable for a variety of on-device applications.
Yan et al. (2023) explores the use of LLMs in software-hardware co-design to optimize the
development of compute-in-memory (CiM) deep neural network (DNN) accelerators. The
LCDA framework integrates LLMs into the design process of hardware and software, lever-
aging their extensive training on diverse datasets to speed up co-design. By incorporating
heuristic knowledge from pre-trained LLMs, the framework bypasses the cold start problem,
enabling faster convergence to optimal solutions. The framework shows a 25x speedup in
the design process compared to state-of-the-art methods while maintaining comparable
12
Preprint, under review
performance levels in designing efficient DNN models and hardware architectures. This
approach highlights the potential of LLMs to enhance the co-design process, improving
both software and hardware efficiency for advanced AI applications.
General efficiency and performance improvements are crucial for the practical deployment
of LLMs on edge devices. The research works reviewed in this subsection introduce innova-
tive methods to enhance memory efficiency, computational speed, and overall performance.
The Any-Precision LLM approach offers a flexible and memory-efficient solution for deploy-
ing multiple LLMs with different precisions, while the LCDA framework demonstrates the
benefits of integrating LLMs into the co-design process for optimizing both software and
hardware. These advancements contribute to making LLMs more accessible and effective in
resource-constrained environments, enabling a broader range of AI applications on mobile
and edge devices.
4.1 Quantization
13
Preprint, under review
4.2 Pruning
1. Structured Pruning: This approach removes entire subsets of parameters like lay-
ers, channels, or filters, which is beneficial for hardware optimization due to more
regular memory access patterns and simplified computations. The ‘LLM-Pruner’
(Kaddour et al., 2023) employs structured pruning to eliminate non-essential groups
based on gradient data, thus maintaining critical functionalities. It also facilitates
performance recovery through techniques such as LoRA, allowing efficient restora-
tion with minimal data.
2. Unstructured Pruning: Unlike structured pruning, unstructured pruning removes
individual weights across the model, offering finer granularity and potentially
higher compression rates (Li et al., 2023a). However, this method typically results
in sparse matrices, which can be less compatible with traditional hardware architec-
tures, compromising computational efficiency. It is most suitable where maximum
compression is needed without constraints on structural preservation.
3. Contextual Pruning: This advanced method prunes based on the operational con-
text of the model, targeting weights or neurons that are only relevant under specific
conditions or for particular tasks. Contextual pruning ensures that reductions align
dynamically with the model’s operational needs, thereby preserving performance
where it matters most.
Knowledge Distillation (KD) is a technique for transferring knowledge from a large, com-
putationally intensive model (teacher) to a smaller, more efficient model (student). This
method is crucial for condensing the capabilities of large language models (LLMs) into more
manageable forms without significantly impacting performance.
14
Preprint, under review
low-rank factors, which has proven indispensable in applications such as image processing,
dimensionality reduction in machine learning models, and data compression (Saha et al.,
2023). This methodology not only maintains essential data characteristics but also ensures
efficient storage and processing, highlighting its crucial role in modern computational
tasks. Further extending its application, a study by Yao et al. (2024b) integrates LRF with
Post-training Quantization (PTQ) in Large Language Models. This innovative approach,
termed Low-Rank Compensation (LoRC), enhances model efficiency by significantly reduc-
ing model size and preserving accuracy, effectively mitigating the detrimental effects of
activation quantization. This synthesis of LRF and PTQ demonstrates a significant advance-
ment in optimizing computational efficiency while maintaining performance integrity in
complex models.
Deployment strategies for LLMs can vary significantly depending on the use case and the
available infrastructure, ranging from fully cloud-based solutions to edge-only deployments.
1. Edge-only
(a) Llama.cpp
• Description: Llama.cpp (Gerganov, 2023) is a C/C++ library designed
for efficient inference of large language models on a broad range of hard-
ware platforms. It supports integer quantization, GPU acceleration, and
CPU+GPU hybrid inference.
• Training: Supports fine-tuning LORA adapters on-device.
• Inference: Supports CPU and CPU+GPU hybrid inference across ARM and
x86 architectures.
(b) MNN
• Description: MNN (Alibaba, 2024) leverages Mobile Neural Network tech-
nology for efficient LLM inference on various platforms, optimized for
mobile devices with dynamic inputs and multimodal interactions.
• Training: Supports full-sized fine-tuning and LORA fine-tuning on-device.
• Inference: Supports model deployment for ONNX and MNN formats
across diverse backends including CPU, CUDA, and OpenCL.
(c) PowerInfer
• Description: PowerInfer (Song et al., 2023) and PowerInfer2 (Xue et al.,
2024b) is a high-speed inference engine optimized for deploying LLMs on
PCs with consumer-grade GPUs, utilizing a locality-centric design.
• Training: No built-in training capabilities.
• Inference: Supports various computing platforms including x86-64 CPUs
and Apple M Chips, optimized for Windows and Linux.
15
Preprint, under review
(d) ExecuTorch
• Description: ExecuTorch (PyTorch, 2024) is part of the PyTorch Edge ecosys-
tem, designed for deploying PyTorch models efficiently on edge devices
like mobile phones and wearables.
• Training: No built-in training capabilities.
• Inference: Leverages full hardware capabilities like CPUs, NPUs, and DSPs
across various computing platforms.
(e) MediaPipe
• Description: Developed by Google, MediaPipe (AI, 2024b) is a framework
for building and deploying multimodal machine learning pipelines involv-
ing video, audio, and other time-series data.
• Training: No built-in training capabilities.
• Inference: Supports multiple platforms including Android, iOS, macOS,
Windows, and Linux, leveraging CPU and GPU resources.
2. Edge-cloud
(a) MLC-LLM
• Description: MLC-LLM (team, 2023) is a machine learning compiler and
high-performance deployment engine, supporting universal LLM deploy-
ment on edge devices and in cloud environments.
• Training: No built-in training capabilities.
• Inference: Supports inference on various platforms including CPUs and
GPUs across ARM and x86 architectures.
(b) VLLM
• Description: VLLM (Team, 2024) is optimized for edge-cloud environments,
supporting advanced quantization methods for efficient key and value
memory management during inference.
• Training: No built-in training capabilities.
• Inference: Supports multiple GPU platforms and integrates with Vulkan,
CUDA, Metal, and WebGPU technologies.
(c) OpenLLM by BentoML
• Description: OpenLLM (BentoML, 2024) enables the deployment of various
open-source LLMs as OpenAI-compatible API endpoints, optimized for
high throughput and streamlined cloud deployment.
• Training: No built-in training capabilities.
• Inference: Compatible with various model architectures and backend im-
plementations for efficient deployment in production settings.
1. GPU: Graphics Processing Units (GPUs) have become the standard for training
and accelerating large language models due to their massive parallelism and high
memory bandwidth. NVIDIA’s Tensor Cores, introduced in the Volta architecture
and improved in subsequent generations, offer specialized hardware for mixed-
precision matrix multiply-accumulate operations, which are crucial for transformer-
based models. Recent advancements like NVIDIA’s A100 GPU with 80GB HBM2e
memory enable training of models with billions of parameters on a single device.
Techniques such as tensor parallelism and pipeline parallelism, implemented in
frameworks like Megatron-LM, allow efficient scaling of LLMs across multiple
GPUs. The use of mixed-precision training, particularly FP16 and BF16 formats,
significantly reduces memory footprint and increases computational throughput on
modern GPUs.
2. NPU: Neural Processing Units (NPUs), also known as AI accelerators, are special-
ized chips designed for machine learning workloads. Google’s Tensor Processing
16
Preprint, under review
Units (TPUs) are a prominent example, with the latest v4 offering 275 TFLOPS of
BF16 performance per chip. TPUs utilize a systolic array architecture for efficient
matrix multiplications, which is particularly well-suited for transformer layers in
LLMs. The TPU Pod configuration allows scaling to thousands of chips, enabling
training of models like GPT-3 and PaLM. Huawei’s Ascend AI processors and
Apple’s Neural Engine are other examples of NPUs that offer on-device acceleration
for inference of smaller LLMs, utilizing techniques like quantization and pruning
to reduce model size and computational requirements.
3. FPGA: Field-Programmable Gate Arrays (FPGAs) offer a flexible hardware platform
for accelerating LLMs, particularly for inference. Recent work has demonstrated
efficient implementations of transformer layers on FPGAs, utilizing techniques such
as sparse matrix multiplication and quantization. For example, Microsoft’s Project
Brainwave uses Intel Stratix 10 FPGAs to accelerate BERT inference, achieving low
latency and high throughput. FPGAs excel in energy efficiency and can be optimized
for specific model architectures, making them suitable for edge deployment of
smaller LLMs. However, their lower computational density compared to GPUs and
ASICs limits their application in training large-scale models.
In the past years, the rapid development of artificial intelligence technology and the con-
tinuous upgrade of mobile device hardware have made the deployment of large language
models on edge devices a reality. Smartphones are one of the most commonly used devices
in people’s daily lives, and the language models on them are particularly eye-catching. At
present, the world’s major mobile phone brand manufacturers have developed and released
a number of advanced models that are deployed on the device side or adopt device-cloud
collaboration strategies, as displayed in Table 2. These models not only mark a major leap
forward in mobile computing but also bring users a series of advantages that traditional
cloud deployments cannot match.
17
Preprint, under review
the Octopus model on mobile devices demonstrated fast response times, completing
function calls in 1.1 to 1.7 seconds for a typical query of 20 to 30 tokens, even on a
standard Android phone (Chen et al., 2024b; Chen & Li, 2024a;b;c).
3. Apple OpenELM and Ferret-v2: Apple has developed OpenELM (Mehta et al.,
2024), a substantial large language model integrated within iOS to augment ap-
plication functionalities, analogous to essential system services such as location
tracking. OpenELM employs a layer-wise scaling architecture, efficiently deploying
its 1.1 billion parameters to achieve a 2.36% increase in accuracy compared to prior
models, while requiring only half the pre-training tokens. Moreover, it is compatible
with the MLX library, facilitating direct fine-tuning on Apple devices. In parallel,
Ferret-v2 (Zhang et al., 2024a) marks a significant upgrade over its predecessor,
incorporating features such as any-resolution grounding, multi-granularity visual
encoding through the integration of a DINOv2 encoder, and a sophisticated three-
stage training regimen. These enhancements markedly improve performance by
advancing high-resolution image processing and enriching visual comprehension,
thereby ensuring robust, on-device functionality for iOS users.
4. Microsoft Phi series: Microsoft’s latest Phi-3-mini (Abdin et al., 2024) a compact
yet powerful 3.8 billion parameter language model, trained on an extensive 3.3
trillion token dataset. Despite its small size suitable for mobile deployment, Phi-3-
mini delivers performance competitive with larger models like Mixtral 8x7B and
GPT-3.5, achieving 69% on MMLU and 8.38 on MT-bench. This model benefits
from a unique training dataset, an expanded version of the one used for Phi-2,
which combines heavily filtered publicly available web data with synthetic data,
enhancing robustness, safety, and chat functionality. Additionally, we present
initial results from our scaled models, Phi-3-small and Phi-3-medium, trained on
4.8 trillion tokens, with 7 billion and 14 billion parameters respectively, showing
superior capabilities (75% and 78% on MMLU, and scores of 8.7 and 8.9 on MT-
bench). Expanding further, we introduce Phi-3-vision, a 4.2 billion parameter model
derived from Phi-3-mini, designed with enhanced reasoning abilities for both image
and text prompts.
5. MiniCPM: The MiniCPM-Llama3-V 2.5, a recent addition to the open-source
MiniCPM-V lineup crafted by the collaborative efforts of Tsinghua University and
ModelBest, boasts a substantial parameter count of 8.5 billion (Tsinghua University,
2024). This model has demonstrated exceptional performance across the Open-
Compass assessment platform, which encompasses a wide array of 11 multimodal
benchmarks. With a noteworthy average score of 65.1, MiniCPM-Llama3-V 2.5 has
surpassed leading industry models, including GPT-4V-1106 at 63.5, Gemini Pro at
62.9, Claude 3, and Qwen-VL-Max, even though it possesses only a fraction of the
parameters these models have.
In specific evaluations focusing on Optical Character Recognition (OCR) and scene
text comprehension, MiniCPM-Llama3-V 2.5 has excelled, securing a score surpass-
ing the 700-point mark on OCRBench, thereby outdoing its counterparts such as
GPT-4 and Gemini Pro. Moreover, it has attained remarkable accuracy rates of
76.6% on the TextVQA benchmark and an impressive 84.8% on DocVQA, effectively
establishing a new standard for the performance of open-source models in these
domains.
6. Gemma2-9B: Gemma is a lightweight, state-of-the-art family of open models from
Google. Gemma2 is Google’s upgraded version of Gemma, available in two dif-
ferent sizes, 9B and 27B. For the 9B version, Gemma2 has a training data volume
of 8 TB Tokens of web data, code and math data. The authors have taken a novel
approach to combining attention, with one layer of sliding window attention and
one layer of global attention. Techniques such as knowledge distillation, model
merging, etc., were also used. Gemma2-9B model also performs well in its equiva-
lent volume category, outperforming Llama 3-8B and other similar open models
in several domains such as reasoning, math, and code. This model also has good
compatibility with major AI frameworks such as HuggingFace, as well as Keras 3.0,
vLLM, Gemma.cpp, and Llama.cpp (Google, 2024a).
18
Preprint, under review
7. Qwen2-0.5B: Qwen team, Alibaba Cloud has upgraded the Qwen model series to
Qwen2 and brought the series to five sizes. Among them, Qwen2-0.5B is the one
with the smallest number of parameters and a context length of 32K. In multiple
tests, Qwen2-0.5B performs similarly to Gemma-2B and Phi-2 (Qwen Team, 2024),
but has a smaller number of parameters, which makes it possible to play a big
role in the future of the smart home industry. In addition, for the problem of short
context length, the Qwen-Agent framework adopts the idea of Agentic RAG, which
can extend the processing context to 1M, thus realizing long text understanding
(Bai et al., 2023a).
On-device language models are ushering in a new era of intelligent, responsive, and per-
sonalized applications. By bringing the power of advanced natural language processing
directly to end-user devices, these models are transforming how we interact with technology
in our daily lives and professional endeavors. From instantaneous message suggestions
to real-time language translation, from confidential medical consultations to cutting-edge
autonomous vehicles, on-device LLMs are proving to be versatile tools with far-reaching
implications. The following examples, as summarized in Figure 5, illustrate the breadth and
depth of on-device LLM applications, showcasing how this technology is not only enhanc-
ing existing services but also enabling entirely new categories of intelligent, responsive, and
secure applications across diverse domains.
Translation
Messaging Meeting
Applications of
Automobile On-Device LLMs Healthcare
Disability Research
Robot
1. Text Generating For Messaging: In the past, the quick reply function based on cloud
LLM was limited by the generation speed and network latency, so it would be slow
to generate reply for users. This is inefficient in fast-paced instant conversations.
Thanks to on-device LLMs, Gboard (Keyboard app by Google) can use the Gemini
Nano, an on-device LLM by Google (AI, 2024a). When it detects that the user
is chatting online, Gemini Nano can quickly generate conversation-aware quick
replies for the user to choose from based on the chat content. Because the language
19
Preprint, under review
models used does not need to be connected to the Internet to wait for the server to
respond, this function can reflect the true response speed.
2. Translation: LLMs have been widely used in the field of language translation. This
method can use terminology and style suitable for a specific field for translation,
which is not possible with traditional machine translation methods. However,
cloud-based LLMs still face problems such as slow response speed and the need
to upload information. On-device LLMs better solve these problems, with smaller
parameters, faster response speed, and can also run in offline environments. This
also provides data security for many scenarios. In terms of translation quality, using
small-size models does not significantly reduce the accuracy of translation. The
token generation accuracy using the T5-small model is only 4% lower than the
T5-language models (Xu et al., 2023). In addition, faster response speed means that
the on-device model will be more suitable for more immediate translation situations
such as simultaneous interpretation.
3. Meeting Summarizing: Distill-CLI, a cloud-based solution released by Amazon
CTO, uses Anthropic’s Claude 3 Sonnet model and Amazon Transcribe technology
to generate real-time meeting summaries (Vogels, 2024). Similar applications such as
Plaud Note with GPT-4o model (Plaud, 2024), Zoom-IQ (Zoom, 2024), etc. However,
the disadvantage of using cloud-based models is that subscription service fees
will be incurred, as well as network latency problems caused by networking. By
employing an on-device model, the data remains localized and does not require
uploading to a cloud-based server.
4. Healthcare application: Current medical models, like Med-Palm Multimodal (Tu
et al., 2024) can combine and analyze patient statements, electronic record informa-
tion, X-rays and other medical images to generate long-form responses with high
accuracy. Edge deployment can help patients answer questions offline, thereby en-
suring the emergency availability of the model and keeping the patient’s condition
localized. What is exciting is that models fine-tuned based on pre-trained models
in professional medical fields have emerged, such as BioMistral-7B (Labrak et al.,
2024), HuatuoGPT-7B-II (Chen et al., 2023), etc. These low-parameter models have
the potential to be deployed on terminal devices.
5. Scientific Research Support: Traditional research support LLMs like GatorTronGPT
(Peng et al., 2023) use large amount of certain professional data to train. This enables
them to generate high-quality professional text, thereby accelerating the progress of
science research, especially in research areas where data is scarce or sensitive.
After changing to on-device LLMs, it can reduce the hardware cost of using language
models to assist scientific research tasks, obtain faster responses, and protect the
confidentiality of scientific research information.
6. Companion Robot: There are already some research cases that use language models
to enhance the capabilities of robots or Internet of Thing (IoT) devices (Ahn et al.,
2022; Xu et al., 2024a). LLM’s powerful planning and reasoning capabilities can
decompose human instructions into a series of text subtasks, allowing robots to bet-
ter understand natural language instructions (Zeng et al., 2023b). For example, the
Figure 01 robot based on Open AI’s multimodal language models can communicate
deeply with people and make independent decisions and actions based on the con-
tent of the conversation (AI, 2024c). With the rise of small-size models, robots that
deploy on-device language models can outperform traditional cloud-based model
robots in terms of corresponding generation speed. At the same time, the client-side
model can ensure that the robot can still maintain its intelligent capabilities when
offline.
7. Disability Support: For visually impaired users, converting images into text is
a very basic and important function. Currently, there are many on-device large
multimodal models, like Octopus v3 (Chen & Li, 2024b), MiniCPM-Llama3-V 2.5
(Tsinghua University, 2024) that can achieve this function by multimodel ability.
With them, blind people can also easily know the picture and video information in
the conversation.
20
Preprint, under review
Google is about to launch Talkback feature based on Gemini Nano, helping people
who are blind or have low vision to describe what is happening in the image more
richly and clearly (Google, 2024b). Because Gemini Nano is a model deployed on
the edge, these descriptions will appear quickly and work even without a network
connection.
Similar capabilities can also be used for sign language recognition, and there are
projects that use the ChatGPT model for sign language translation (Sincan et al.,
2024). In comparison, the on-device model can generate text translations corre-
sponding to sign language with lower latency and ensure its offline availability.
8. Autonomous Vehicles: Using language models to drive autonomous cars may be
an ideal future, but we already have examples of this today. DriveVLM Dual is
a system that combines autonomous driving technology with a large-scale visual
language model (VLM) to improve the understanding of complex and long-tail
scenes in urban environments. The system uses language to describe the driving
environment and identify key objects in the scene. It gradually develops a plan
from meta-action and decision descriptions to waypoints. DriveVLM surpasses
existing state-of-the-art methods on both public benchmarks and the researchers’
own benchmarks, especially in handling complex and dynamic scenes. Excitingly,
DriveVLM can be deployed locally on the car, which also provides convenience for
its immediate response (Tian et al., 2024).
Data Adaptive
Security Edge-Cloud
Techniques Collaboration
Continual Multi-Modal
Learning & Cross-Modal
& Personalization Future Directions Learning
& Open
Challenges
Scalability
Resource-Efficient
& Deployment
Solutions
Optimization
Hardware-
Robustness &
Software
Reliability
Co-Design
As on-device LLMs continue to evolve, several vital areas emerge as promising future
research and development directions. The field of on-device LLMs is rapidly advancing,
driven by the increasing demand for 1) data security, 2) low-latency, and 3) personalized AI
experiences on edge devices. This progress is exemplified by recent developments such as
21
Preprint, under review
TinyLlama (Zhang et al., 2024c), MobileVLM (Murthy et al., 2024; Chu et al., 2024), and novel
approaches like the OpenELM (Mehta et al., 2024). However, deploying LLMs on resource-
constrained devices presents unique challenges that differ significantly from traditional
cloud-based implementations. These challenges span multiple areas, including model
compression, efficient inference, security, energy efficiency, and seamless integration with
diverse hardware platforms. Moreover, the dynamic nature of edge environments and the
need for continuous adaptation introduce additional complexities that must be considered.
We outline the most pressing challenges and opportunities in advancing the field of LLMs
on-device. By identifying these key areas and stimulating innovation in developing more
capable, efficient, and reliable on-device language models, we aim to provide insights for
future research efforts. We should notice that the challenges and opportunities discussed
here are interconnected: the progress in one area often has implications for others. Therefore,
a holistic approach that considers the interplay between different aspects of on-device LLM
deployment is essential for achieving significant advancements in the field. We delve into
the current state of research, identifying key challenges and proposing potential directions
for future work, summarized in Fig. 6. By addressing these challenges, researchers and
practitioners can push the boundaries of what is possible with on-device LLMs, ultimately
leading to more intelligent, efficient, and user-centric computing experiences across various
applications and domains.
On-device language models may offer inherent data security advantages, since all the data
can remain localized. Future work should focus on:
As on-device language models continue to evolve, the synergy between edge computing
and cloud infrastructure presents both opportunities and challenges. Future research in
adaptive edge-cloud collaboration for on-device LLMs should explore:
22
Preprint, under review
As LLMs expand to incorporate multiple modalities, there is a growing need for efficient
multi-modal architectures suitable for on-device deployment (Carreira et al., 2023; Liu et al.,
2024c). Key research directions include:
The deployment of LLMs on edge devices raises concerns about energy consumption and
environmental impact. Future research should prioritize:
23
Preprint, under review
Closer integration between hardware and software development is crucial for optimizing
on-device LLM performance. Future research directions include:
Ensuring the robustness and reliability of on-device language models under various operat-
ing conditions is paramount for their widespread adoption. Future work should address:
• Investigating methods for detecting and mitigating potential biases and hallucina-
tions in on-device LLM outputs, particularly in safety-critical applications (Ailem
et al., 2024).
• Exploring formal verification and validation frameworks for assessing the reliability
of on-device language models in real-world scenarios (Zhang et al., 2023b).
• Leveraging ensemble methods for variance and bias reduction (Xu & Sen, 2023;
2024). Exploring probabilistic inference methods to quantify and propagate uncer-
tainty through the LLM pipeline.
Efficiently scaling on-device LLMs to support a growing number of users and applications
presents significant challenges. Future research should explore:
• Developing dynamic resource allocation and load balancing techniques for dis-
tributed LLM inference across heterogeneous edge devices (Yang et al., 2024c;
Wilkins et al., 2024).
• Investigating optimization strategies for reducing latency and improving through-
put in collaborative edge computing scenarios, potentially leveraging techniques
such as model sharding and pipelined inference (Zhang et al., 2024b; Dhar et al.,
2024).
• Exploring efficient methods for managing and updating multiple LLM versions
across diverse edge devices, considering factors such as network constraints and
device capabilities. Building cyber-infrastructure to enhance the reusibility and
reproducibility of models and datasets (Wolf et al., 2019; Lhoest et al., 2021; Deng
et al., 2019).
24
Preprint, under review
8 Conclusion
This comprehensive review has illuminated the state-of-the-art in on-device language
models. The extensive analysis presented herein has highlighted significant advancements
in model compression techniques, efficient architectural designs, and hardware-software co-
optimization strategies, all of which collectively facilitate the deployment of sophisticated
language models on resource-constrained edge devices. The potential impact of these
improvements is extensive, equipping improved data protection, decreased delay, and equal
access to advanced AI capabilities across different industries and applications.
25
Preprint, under review
The transition from cloud-centric to edge-based LLM deployment signifies more than a
mere technological progression; it represents a shift of human-AI interaction paradigms. By
bringing advanced natural language processing capabilities directly to end-user devices, this
transformation opens new avenues for personalized, context-aware, and instant AI experi-
ences. On-device LLMs will revolutionize user interactions and facilitate more intelligent,
responsive technologies, from mobile phones and the IoT to healthcare and autonomous
systems.
However, the trajectory towards ubiquitous on-device LLMs has significant challenges.
Striking an optimal balance between model performance and the inherent resource limi-
tations of edge devices remains a critical research problem. Ensuring model robustness
across heterogeneous operating conditions and developing effective continual learning
mechanisms present additional hurdles. Furthermore, as the boundaries of on-device AI
are pushed, questions about energy efficiency, sustainability, and responsible deployment
become increasingly salient, necessitating innovative solutions and careful ethical consider-
ations.
Realizing the full potential of on-device language models requires a concerted, multidis-
ciplinary effort. The research community must continue advancing the frontiers of model
compression techniques and efficient architecture design while concurrently addressing po-
tential issues of data security and system reliability. Practitioners in the field should explore
novel hardware-software co-design methodologies and adaptive edge-cloud collaboration
strategies to optimize real-world deployments. Industry stakeholders play a pivotal role in
developing specialized hardware accelerators and promoting open standards for on-device
AI deployment.
As research in this area evolves, on-device language models are positioned at the forefront
of imminent technological breakthroughs. The convergence of increasingly efficient models,
more powerful edge hardware, and innovative deployment strategies promises to unlock
unprecedented possibilities in human-AI interaction. By addressing the challenges and
capitalizing on the opportunities in this survey, the research community can work towards a
future where sophisticated AI capabilities are seamlessly integrated into daily life, augment-
ing human abilities while respecting personalization and individuality. The journey towards
ubiquitous, intelligent computing is well underway, and on-device LLMs are poised to play
a pivotal role in shaping this exciting future.
In conclusion, this review serves as a comprehensive resource for researchers and practition-
ers, thoroughly analyzing the current state of on-device LLMs and illuminating critical areas
for future research and development. As the field of on-device LLMs continues to evolve
rapidly, it is imperative that the research community remains committed to addressing the
challenges and embracing the opportunities presented by this transformative technology.
References
01.AI. Yi 1.5. https://github.com/01-ai/Yi-1.5, 2024.
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah,
Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl,
Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio
César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen,
Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo
de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao,
Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng
Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos
Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat
Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung
Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik
Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet,
Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji
Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning
Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini,
26
Preprint, under review
Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp
Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav,
Fan Yang, Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong
Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang,
and Xiren Zhou. Phi-3 technical report: A highly capable language model locally on your
phone, 2024.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Amey Agrawal, Anmol Agarwal, Nitin Kedia, Jayashree Mohan, Souvik Kundu, Nipun
Kwatra, Ramachandran Ramjee, and Alexey Tumanov. Metron: Holistic performance
evaluation framework for llm inference systems. arXiv preprint arXiv:2407.07000, 2024a.
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S
Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency
tradeoff in llm inference with sarathi-serve. arXiv preprint arXiv:2403.02310, 2024b.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,
Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can,
not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691,
2022.
Google AI. Gboard smart reply. Google AI Developer Website, 2024a. URL https://
developer.android.com/ai/aicore#gboard-smart.
Google AI. Mediapipe solutions guide. Google AI Developer Website, 2024b. URL https:
//ai.google.dev/edge/mediapipe/solutions/guide.
Open AI. Figure 01 robot. Figure website, 2024c. URL https://www.figure.ai/.
Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, and James Bono. Examining the
robustness of llm evaluation to the distributional assumptions of benchmarks. arXiv
preprint arXiv:2404.16966, 2024.
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón,
and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from
multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
Alibaba. Mnn: A lightweight deep neural network inference engine. https://github.com/
alibaba/MNN, 2024.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan-
dra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay,
Quentin Malartic, et al. The falcon series of open language models. arXiv preprint
arXiv:2311.16867, 2023.
Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, and Edoardo M Ponti. Scaling
sparse fine-tuning to large language models. arXiv preprint arXiv:2401.16405, 2024.
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi,
Ziyang Yu, Mengdan Zhu, Yifei Zhang, et al. Beyond efficiency: A systematic survey of
resource-efficient large language models. arXiv preprint arXiv:2401.00625, 2024.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
2023a.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang
Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding,
localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023b.
27
Preprint, under review
BentoML. Openllm: Open-source library for language model lifecycle management. https:
//github.com/bentoml/OpenLLM, 2024.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020.
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and
Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding
heads. arXiv preprint arXiv:2401.10774, 2024a.
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui
Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
2024b.
Zouying Cao, Yifei Yang, and Hai Zhao. Head-wise shareable attention for large language
models. arXiv preprint arXiv:2402.11819, 2024.
Samuel Carreira, Tomás Marques, José Ribeiro, and Carlos Grilo. Revolutionizing mobile in-
teraction: Enabling a 3 billion parameter gpt llm on mobile. arXiv preprint arXiv:2310.01434,
2023.
Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He,
Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A dataset for gui-oriented
multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang,
Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, et al. Huatuogpt-ii, one-stage training
for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023.
Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent. arXiv
preprint arXiv:2404.01744, 2024a.
Wei Chen and Zhiyuan Li. Octopus v3: Technical report for on-device sub-billion multi-
modal ai agent. arXiv preprint arXiv:2404.11459, 2024b.
Wei Chen and Zhiyuan Li. Octopus v4: Graph of language models. arXiv preprint
arXiv:2404.19296, 2024c.
Wei Chen, Zhiyuan Li, and Mingyuan Ma. Octopus: On-device language model for function
calling of software apis. arXiv preprint arXiv:2404.01549, 2024b.
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding
the mixture-of-experts layer in deep learning. Advances in neural information processing
systems, 35:23049–23062, 2022.
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun,
Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for
vision language model. arXiv preprint arXiv:2402.03766, 2024.
Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J Bouw, and Stephen
Cobb. A performance evaluation of a quantized large language model on various smart-
phones. arXiv preprint arXiv:2312.12472, 2023.
Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of
large language models: A survey. arXiv preprint arXiv:2402.00888, 2024.
Yunxiao Deng, Carl Kesselman, Suvrajeet Sen, and Jiajun Xu. Computational operations
research exchange (core): A cyber-infrastructure for analytics. In 2019 Winter Simulation
Conference (WSC), pp. 3447–3456. IEEE, 2019.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit
matrix multiplication for transformers at scale. Advances in Neural Information Processing
Systems, 35:30318–30332, 2022.
28
Preprint, under review
Nobel Dhar, Bobin Deng, Dan Lo, Xiaofeng Wu, Liang Zhao, and Kun Suo. An empirical
analysis and resource footprint study of deploying large language models on edge devices.
In Proceedings of the 2024 ACM Southeast Conference, pp. 69–76, 2024.
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu,
Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of
language models with mixture-of-experts. In International Conference on Machine Learning,
pp. 5547–5569. PMLR, 2022.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.
Minds and Machines, 30:681–694, 2020.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate
post-training quantization for generative pre-trained transformers. arXiv preprint
arXiv:2210.17323, 2022.
Zihao Fu, Wai Lam, Qian Yu, Anthony Man-Cho So, Shengding Hu, Zhiyuan Liu, and Nigel
Collier. Decoder-only or encoder-decoder? interpreting language model as a regularized
encoder-decoder. arXiv preprint arXiv:2304.04052, 2023.
Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. Llm-based nlg evaluation:
Current status and challenges. arXiv preprint arXiv:2402.01383, 2024.
Yingqiang Ge, Wenyue Hua, Kai Mei, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang,
et al. Openagi: When llm meets domain experts. Advances in Neural Information Processing
Systems, 36, 2024.
Georgi Gerganov. llama.cpp: Lightweight library for approximate nearest neighbors and
maximum inner product search. https://github.com/ggerganov/llama.cpp, 2023.
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas,
Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language
models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun
Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language
model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
Google. Gemma 2-9b. Google website, 2024a. URL https://storage.googleapis.com/
deepmind-media/gemma/gemma-2-report.pdf.
Google. Google talkback. Google website, 2024b. URL https://store.google.com/intl/
en/ideas/articles/gemini-nano-google-pixel/.
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Acceler-
ating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large
language models. In The Twelfth International Conference on Learning Representations, 2023.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno,
Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi,
et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin,
Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities
with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 26584–26595, 2024.
29
Preprint, under review
Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, and Ting Cao. Hybrid slm and llm for
edge-cloud collaborative inference. In Proceedings of the Workshop on Edge and Mobile
Foundation Models, pp. 36–41, 2024.
Yongjun He, Yao Lu, and Gustavo Alonso. Deferred continuous batching in resource-
efficient large language model serving. In Proceedings of the 4th Workshop on Machine
Learning and Systems, pp. 98–106, 2024.
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao
Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. Inference without interference: Disag-
gregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181,
2024a.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei
Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small
language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024b.
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele
Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.
arXiv preprint arXiv:2402.04291, 2024a.
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li,
Xiaofan Zhang, and Deming Chen. New solutions on llm acceleration, optimization, and
application. arXiv preprint arXiv:2406.10903, 2024b.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International conference on machine learning,
pp. 448–456. pmlr, 2015.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive
mixtures of local experts. Neural computation, 3(1):79–87, 1991.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian
Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024a.
Peng Jiang, Christian Sonne, Wangliang Li, Fengqi You, and Siming You. Preventing the
immense increase in the life-cycle energy and carbon footprints of llm-powered intelligent
chatbots. Engineering, 2024b.
Christoforos Kachris. A survey on hardware accelerators for large language models. arXiv
preprint arXiv:2401.09890, 2024.
Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and
Robert McHardy. Challenges and applications of large language models. arXiv preprint
arXiv:2307.10169, 2023.
Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin-Haeng Kang, Sukhan Lee, Songyi Han,
YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, et al. Near-memory processing in action:
Accelerating personalized recommendation with axdimm. IEEE Micro, 42(1):116–127,
2021.
Aymen Rayane Khouas, Mohamed Reda Bouadjenek, Hakim Hacid, and Sunil Aryal.
Training machine learning models at the edge: A survey. arXiv preprint arXiv:2403.02619,
2024.
Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang,
Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory
solutions for improved performance on llm inference. IEEE Micro, 2024a.
30
Preprint, under review
Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang,
Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory
solutions for improved performance on llm inference. IEEE Micro, 2024b.
Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang,
Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory
solutions for improved performance on llm inference. IEEE Micro, 2024c.
Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan
Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, et al. Aquabolt-xl:
Samsung hbm2-pim with in-memory processing for ml accelerators and beyond. In 2021
IEEE Hot Chips 33 Symposium (HCS), pp. 1–26. IEEE, 2021.
Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh.
Propile: Probing privacy leakage in large language models. Advances in Neural Information
Processing Systems, 36, 2024d.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,
Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large
language model serving with pagedattention. In Proceedings of the 29th Symposium on
Operating Systems Principles, pp. 611–626, 2023.
Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier,
and Richard Dufour. Biomistral: A collection of open-source pretrained large language
models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
Stefanos Laskaridis, Kleomenis Kateveas, Lorenzo Minto, and Hamed Haddadi. Melting
point: Mobile evaluation of language transformers. arXiv preprint arXiv:2403.12844, 2024.
Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick
Von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall,
et al. Datasets: A community library for natural language processing. arXiv preprint
arXiv:2109.02846, 2021.
Chenyang Li, Jihoon Chung, Biao Cai, Haimin Wang, Xianlian Zhou, and Bo Shen. On model
compression for neural networks: Framework, algorithm, and convergence guarantee.
arXiv preprint arXiv:2303.06815, 2023a.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal,
Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next
generation of training sets for language models. arXiv preprint arXiv:2406.11794, 2024a.
Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao,
and Xin Chen. Locmoe: A low-overhead moe for large language model training. arXiv
preprint arXiv:2401.13920, 2024b.
Yansong Li, Zhixing Tan, and Yang Liu. Privacy-preserving prompt tuning for large lan-
guage model services. arXiv preprint arXiv:2305.06212, 2023b.
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu,
Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about
the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024c.
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu,
Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about
the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024d.
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat
Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463,
2023c.
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning,
and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv
preprint arXiv:2401.15947, 2024a.
31
Preprint, under review
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-
device training under 256kb memory. Advances in Neural Information Processing Systems,
35:22941–22954, 2022.
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, and Song Han. Tiny machine learning:
progress and futures [feature]. IEEE Circuits and Systems Magazine, 23(3):8–34, 2023a.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangx-
uan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight
quantization for on-device llm compression and acceleration. Proceedings of Machine
Learning and Systems, 6:87–100, 2024b.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangx-
uan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight
quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2024c.
Ye Lin, Mingxuan Wang, Zhexi Zhang, Xiaohui Wang, Tong Xiao, and Jingbo Zhu. Under-
standing parameter sharing in transformers. arXiv preprint arXiv:2306.09380, 2023b.
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual
instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 26296–26306, 2024a.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.
Advances in neural information processing systems, 36, 2024b.
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov,
Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. Mo-
bilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv
preprint arXiv:2402.14905, 2024c.
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali
Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual
sparsity for efficient llms at inference time. In International Conference on Machine Learning,
pp. 22137–22176. PMLR, 2023.
Lefteris Loukas, Ilias Stogiannidis, Odysseas Diamantopoulos, Prodromos Malakasiotis,
and Stavros Vassos. Making llms worth every penny: Resource-limited text classification
in banking. In Proceedings of the Fourth ACM International Conference on AI in Finance, pp.
392–400, 2023.
Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I Venieris, Hongxi-
ang Fan, et al. Hardware-aware parallel prompt decoding for memory-efficient accelera-
tion of llm inference. arXiv preprint arXiv:2405.18628, 2024.
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang,
Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large
language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang.
Llm-powered conversational voice assistants: Interaction patterns, opportunities, chal-
lenges, and design guidelines. arXiv preprint arXiv:2309.13879, 2023.
Market.us. Edge ai market. Market.us Online Report, July 2024. Accessed on 2024-07-28.
Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey. Artificial
Intelligence Review, 42:275–293, 2014.
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp
Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods,
analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611,
2024.
32
Preprint, under review
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin,
Chenfan Sun, Seyed Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal,
et al. Openelm: An efficient language model family with open training and inference
framework. In Workshop on Efficient Systems for Foundation Models II, 2024.
Meta. Meta llama 3. https://ai.meta.com/blog/meta-llama-3/, 2024.
MosaicML. Mpt-7b. https://www.databricks.com/blog/mpt-7b, 2023.
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby
Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, et al. Mobileaibench: Benchmarking
llms and lmms for on-device use cases. arXiv preprint arXiv:2406.10290, 2024.
Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Over-
coming oscillations in quantization-aware training. In International Conference on Machine
Learning, pp. 16318–16330. PMLR, 2022.
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers.
Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th
International Conference on Software Engineering, pp. 1–13, 2024.
Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, and Yu Wang. Skeleton-of-thought:
Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux,
Matheus Pereira, Lucas Caccia, and Alessandro Sordoni. Towards modular llms by
building and reusing a library of loras. arXiv preprint arXiv:2405.11157, 2024.
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm:
Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517,
2024.
Bohao Peng, Zhuotao Tian, Shu Liu, Mingchang Yang, and Jiaya Jia. Scalable language
model with generalized continual learning. arXiv preprint arXiv:2404.07470, 2024.
Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa,
Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative
large language model for medical research and healthcare. NPJ digital medicine, 6(1):210,
2023.
Plaud. Plaud note summarizer. Plaud website, 2024. URL https://www.plaud.ai/.
PyTorch. executorch: Overview. PyTorch Official Website, 2024. URL https://pytorch.
org/executorch-overview.
Biqing Qi, Xinquan Chen, Junqi Gao, Dong Li, Jianxing Liu, Ligang Wu, and Bowen Zhou.
Interactive continual learning: Fast and slow thinking. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 12882–12892, 2024.
Ali Cloud Qwen Team. Qwen 2-0.5b. Github, 2024. URL https://github.com/QwenLM/
Qwen2.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive
multimodal large language model for long video understanding. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323, 2024.
Rajarshi Saha, Varun Srivastava, and Mert Pilanci. Matrix compression via randomized low
rank and low precision factorization. Advances in Neural Information Processing Systems, 36,
2023.
33
Preprint, under review
Sivan Schwartz, Avi Yaeli, and Segev Shlomov. Enhancing trust in llm-based ai automation
agents: New considerations and future challenges. arXiv preprint arXiv:2308.05391, 2023.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-
of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance
with 0.1 m dollars. arXiv preprint arXiv:2404.07413, 2024.
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, and Hao
Wang. Continual learning of large language models: A comprehensive survey. arXiv
preprint arXiv:2404.16789, 2024.
Ozge Mercanoglu Sincan, Necati Cihan Camgoz, and Richard Bowden. Using an llm to turn
sign spottings into spoken language sentences. arXiv preprint arXiv:2403.10434, 2024.
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell
Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open
corpus of three trillion tokens for language model pretraining research. arXiv preprint
arXiv:2402.00159, 2024.
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model
serving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456, 2023.
Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, and Josep Torrellas. Towards
greener llms: Bringing energy-efficiency to the forefront of llm inference. arXiv preprint
arXiv:2403.20306, 2024.
Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi
Jing, Jiajun Xu, and Junhong Lin. Large language models for forecasting and anomaly
detection: A systematic literature review. arXiv preprint arXiv:2402.10350, 2024.
taivo. Gpt4 response time. Open AI community, 2023. URL https://community.openai.
com/t/gpt-3-5-and-gpt-4-api-response-time-measurements-fyi/237394/.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui
Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family
of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju,
Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al.
Gemma: Open models based on gemini research and technology. arXiv preprint
arXiv:2403.08295, 2024.
InternLM Team. Internlm: A multilingual language model with progressively enhanced
capabilities, 2023.
MLC team. MLC-LLM, 2023. URL https://github.com/mlc-ai/mlc-llm.
VLLM Project Team. Vllm documentation. VLLM Documentation Website, 2024. URL
https://docs.vllm.ai/en/stable/.
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia,
Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and
large vision-language models. arXiv preprint arXiv:2402.12289, 2024.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2:
Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
34
Preprint, under review
35
Preprint, under review
Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal
agents: A survey. arXiv preprint arXiv:2402.15116, 2024.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai
Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the trans-
former architecture. In International Conference on Machine Learning, pp. 10524–10533.
PMLR, 2020.
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe
Liu. Llmcad: Fast and scalable on-device large language model inference. arXiv preprint
arXiv:2309.04255, 2023.
Jiajun Xu and Suvrajeet Sen. Compromise policy for multi-stage stochastic linear pro-
gramming: Variance and bias reduction. Computers & Operations Research, 153:106132,
2023.
Jiajun Xu and Suvrajeet Sen. Ensemble variance reduction methods for stochastic mixed-
integer programming and their application to the stochastic facility location problem.
INFORMS Journal on Computing, 36(2):587–599, 2024.
Jiajun Xu, Qun Wang, Yuhang Cao, Baitao Zeng, and Sicheng Liu. A general-purpose device
for interaction with llms. arXiv preprint arXiv:2408.10230, 2024a.
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang
Wu, Yihao Zhao, Chen Yang, Shihe Wang, et al. A survey of resource-efficient llm and
multimodal foundation models. arXiv preprint arXiv:2401.08092, 2024b.
Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui,
and Ping Zhang. Wdmoe: Wireless distributed large language models with mixture of
experts. arXiv preprint arXiv:2405.03131, 2024a.
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, and Haibo Chen. Powerinfer-2:
Fast large language model inference on a smartphone. arXiv preprint arXiv:2406.06282,
2024b.
Zheyu Yan, Yifan Qin, Xiaobo Sharon Hu, and Yiyu Shi. On the viability of using llms for
sw/hw co-design: An example in designing cim dnn accelerators. In 2023 IEEE 36th
International System-on-Chip Conference (SOCC), pp. 1–6. IEEE, 2023.
Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, and Jing Xiao.
Pfid: Privacy first inference delegation framework for llms. arXiv preprint arXiv:2406.12238,
2024a.
Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang,
Shaochen Zhong, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A
survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6):
1–32, 2024b.
Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, and Wen Ji. Perllm:
Personalized inference scheduling with edge-cloud collaboration for diverse llm services.
arXiv preprint arXiv:2405.14636, 2024c.
Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey
on large language model (llm) security and privacy: The good, the bad, and the ugly.
High-Confidence Computing, pp. 100211, 2024a.
Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. Exploring post-training
quantization in llms from comprehensive study to low rank compensation. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 38, pp. 19377–19385, 2024b.
Zhi Yao, Zhiqing Tang, Jiong Lou, Ping Shen, and Weijia Jia. Velo: A vector database-assisted
cloud-edge collaborative llm qos optimization framework. arXiv preprint arXiv:2406.13399,
2024c.
36
Preprint, under review
Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. Edge-
moe: Fast on-device inference of moe-based large language models. arXiv preprint
arXiv:2308.14352, 2023.
Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. Llm as a system service on
mobile devices. arXiv preprint arXiv:2403.11805, 2024.
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li,
Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai.
arXiv preprint arXiv:2403.04652, 2024.
Yizhen Yuan, Rui Kong, Yuanchun Li, and Yunxin Liu. Wip: An on-device llm-based
approach to query privacy protection. In Proceedings of the Workshop on Edge and Mobile
Foundation Models, pp. 7–9, 2024.
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang.
Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823,
2023a.
Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, and Philip S Yu. Large language
models for robotics: A survey. arXiv preprint arXiv:2311.07226, 2023b.
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural
Information Processing Systems, 32, 2019.
Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen,
Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An im-
proved baseline for referring and grounding with large language models. arXiv preprint
arXiv:2404.07973, 2024a.
Mingjin Zhang, Jiannong Cao, Xiaoming Shen, and Zeyang Cui. Edgeshard: Efficient llm
inference via collaborative edge computing. arXiv preprint arXiv:2405.14371, 2024b.
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source
small language model. arXiv preprint arXiv:2401.02385, 2024c.
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan
Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models
with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar.
Remark-llm: A robust and efficient watermarking framework for generative large lan-
guage models. arXiv preprint arXiv:2310.12362, 2023b.
Shiquan Zhang, Ying Ma, Le Fang, Hong Jia, Simon D’Alfonso, and Vassilis Kostakos.
Enabling on-device llms personalization with smartphone sensing. arXiv preprint
arXiv:2407.04418, 2024d.
Xiaojin Zhang, Yulin Fei, Yan Kang, Wei Chen, Lixin Fan, Hai Jin, and Qiang Yang. No
free lunch theorem for privacy-preserving llm inference. arXiv preprint arXiv:2405.20681,
2024e.
Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, and Ran Zhang.
Edge intelligence optimization for large language model inference with batching and
quantization. arXiv preprint arXiv:2405.07140, 2024f.
Haiyan Zhao, Fan Yang, Himabindu Lakkaraju, and Mengnan Du. Opening the black
box of large language models: Two views on holistic interpretability. arXiv preprint
arXiv:2402.10688, 2024a.
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and
Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.
arXiv preprint arXiv:2403.03507, 2024b.
37
Preprint, under review
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. Llm-pq: Serving llm
on heterogeneous clusters with phase-aware partition and adaptive quantization. arXiv
preprint arXiv:2403.01136, 2024c.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao
Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with
mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024a.
Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. Response
length perception and sequence scheduling: An llm-empowered llm inference pipeline.
Advances in Neural Information Processing Systems, 36, 2024b.
Zoom. Zoom meeting summarizer. Zoom website, 2024. URL https://news.zoom.us/
zoom-iq-meeting-summary-chat-compose-free-trial/.
38