0% found this document useful (0 votes)
146 views38 pages

On-Device Language Models - A Comprehensive Review

This comprehensive review discusses the deployment of large language models (LLMs) on edge devices, highlighting the advantages of reduced latency, data localization, and personalized experiences. It examines the challenges of integrating LLMs into resource-constrained environments, explores innovative architectures and compression techniques, and analyzes real-world applications. The paper also identifies key research directions and open challenges, providing a roadmap for future advancements in on-device language models.

Uploaded by

demondavinci666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views38 pages

On-Device Language Models - A Comprehensive Review

This comprehensive review discusses the deployment of large language models (LLMs) on edge devices, highlighting the advantages of reduced latency, data localization, and personalized experiences. It examines the challenges of integrating LLMs into resource-constrained environments, explores innovative architectures and compression techniques, and analyzes real-world applications. The paper also identifies key research directions and open challenges, providing a roadmap for future advancements in on-device language models.

Uploaded by

demondavinci666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Preprint, under review

On-Device Language Models: A Comprehensive Review

Jiajun Xu ∗ Zhiyuan Li∗ Wei Chen∗


Meta Nexa AI Nexa AI
{jjxu217}@meta.com {zack}@nexa4ai.com {alexchen}@nexa4ai.com

Qun Wang∗
Computer Science Department, San Francisco State University
{qunwang}@sfsu
arXiv:2409.00088v2 [cs.CL] 14 Sep 2024

Xin Gao∗ Qi Cai∗ Ziyuan Ling∗


University of North Texas University of North Texas Nexa AI
{xingao}@my.unt.edu {qicai}@my.unt.edu {rita}@nexa4ai.com

Abstract

The advent of large language models (LLMs) revolutionized natural lan-


guage processing applications, and running LLMs on edge devices has
become increasingly attractive for reasons including reduced latency, data
localization, and personalized user experiences. This comprehensive review
examines the challenges of deploying computationally expensive LLMs
on resource-constrained devices and explores innovative solutions across
multiple domains. The paper investigates the development of on-device
language models, their efficient architectures, including parameter sharing
and modular designs, as well as state-of-the-art compression techniques
like quantization, pruning, and knowledge distillation. Hardware accel-
eration strategies and collaborative edge-cloud deployment approaches
are analyzed, highlighting the intricate balance between performance and
resource utilization. Case studies of on-device language models from
major mobile manufacturers demonstrate real-world applications and po-
tential benefits. The review also addresses critical aspects such as adap-
tive learning, multi-modal capabilities, and personalization. By identify-
ing key research directions and open challenges, this paper provides a
roadmap for future advancements in on-device language models, empha-
sizing the need for interdisciplinary efforts to realize the full potential of
ubiquitous, intelligent computing while ensuring responsible and ethical
deployment. For a comprehensive review of research work and educa-
tional resources on on-device large language models (LLMs), please visit
https://github.com/NexaAI/Awesome-LLMs-on-device. To download
and run on-device LLMs, visit https://www.nexaai.com/models.

1 Introduction

The emergence of Large Language Models (LLMs) has catalyzed a transformative shift in
natural language processing (NLP) applications. By leveraging the transformer architecture
(Vaswani et al., 2017), LLMs such as OpenAI’s GPT series (Radford et al., 2019; Brown
et al., 2020; Achiam et al., 2023) and Meta’s LLaMA series (Touvron et al., 2023a;b; Meta,
2024; Dubey et al., 2024) have demonstrated unparalleled proficiency in understanding
and generating human-like text, profoundly influencing fields ranging from automated
customer support to advanced content creation. The ability of these models to seamlessly
perform a variety of NLP tasks has positioned them as the backbone of modern AI-driven
∗ Equal contribution

1
Preprint, under review

applications (Wu et al., 2023b; Ge et al., 2024; Nam et al., 2024; Zheng et al., 2024a; Yang
et al., 2024b).
However, the traditional deployment of LLMs predominantly on cloud servers presents
several challenges, particularly in terms of latency, security, and the need for continuous
Internet connectivity. These concerns are driving the burgeoning interest in deploying
LLMs on edge devices—a shift that promises reduced response times, and personalized
user experiences directly on user devices such as smartphones, automotive systems, and
personal wearables. This paradigm shift not only aligns with the increasing user demand
for immediate and personalized assistance but also mitigates the bandwidth and energy
costs associated with cloud computing.

2022 15.2
Manufacturing

2023 19.1
Automotive

2024 23.5 Government

2025 31.1 IT & Telecom

Consumers & Goods


2026 40.2

Healthcare
2027 49.3
Year

Other End-Use Industries


2028 57.2

2029 72.0

2030 88.2

2031 111.1

2032 143.6

0.0 50.0 100.0 150.0 200.0

Size (USD Billion)

Figure 1: The global market size for on-device edge AI, by end-user, from 2022 to 2032, in
USD Billion. The market will grow at the CAGR of 25.9%. The forecasted market size for
2032 is $143.6B (Market.us, 2024).

The growing interest in on-device AI deployment is reflected in the rapidly expanding edge
AI market. As illustrated in Figure 1, the edge AI market is projected to experience substan-
tial growth across various sectors from 2022 to 2032. The market size is expected to increase
from $15.2 billion in 2022 to $143.6 billion by 2032, representing a nearly tenfold growth over
a decade (Market.us, 2024). This growth spans multiple industries, with manufacturing, au-
tomotive, and government sectors showing significant contributions. The projected market
expansion underscores the increasing demand for edge AI solutions, including on-device
language models, driven by the need for faster, more private, and efficient AI capabilities
across diverse applications. This market trend aligns with the technological push towards
more localized AI processing, further emphasizing the importance of developing efficient
on-device LLM solutions.
Despite the compelling advantages, integrating computationally intensive language models
within the constraints of edge devices poses significant challenges. The primary obstacles
include limited computational power, reduced memory capacity, and energy constraints,
which collectively complicate the direct adoption of cloud-based LLM architectures. For

2
Preprint, under review

instance, executing a state-of-the-art 405-billion parameters model (Dubey et al., 2024) on a


smartphone would be unfeasible without substantial compromises in model performance
and energy efficiency.

Introduction Evolution of On-Device LLM


Traditional Text-Based LLMs
Foundations and Preliminaries LLM Architecture Foundations
Multimodal LLMs

On-Device LLM Training Quantization-Aware Scaling

Sparse Update
Limitations of Cloud-Based LLM Inference and
Advantages of On-Device Inference Tiny Training Engine (TTE)

Contribution Analysis
The Performance Indicator of On-Device LLM
Parameter Sharing
Efficient Architectures for
Architectural Design Principles and Innovations Modular Architectures
On-Device LLMs for On-Device LLMs
Compact Representations
Model Compression and Parameter Sharing

Collaborative and Hierarchical Model Approaches

Memory and Computational Efficiency

Mixture-of-Experts (MoE) Architectures

General Efficiency and Performance Improvements Weight-Only Quantization


Post-Training Quantization (PTQ)
Quantization Weight-Activation Co-Quantization
Quantization-Aware Training (QAT)

Structured Pruning Llama.cpp


Model Compression and
Optimization Techniques for
Pruning Unstructured Pruning MNN
On-Device LLMs
Contextual Pruning PowerInfer
Large Language Models On-Device: A Comprehensive Review

Black-Box Knowledge Distillation ExecuTorch


Knowledge Distillation
White-Box Knowledge Distillation MediaPipe
Low-Rank Factorization
Edge-Only MLC-LLM
Popular On-Device LLM Framework
Edge-Cloud VLLM
Hardware Acceleration and
Deployment Strategies GPU OpenLLM by BentoML

Hardware Acceleration Research NPU

FPGA

Gemini Nano

Honor MagicLM

Nexa AI Octopus Series Model

Examples and Applications Examples of On-Device Large Models Apple OpenELM and Ferret-v2

Microsoft Phi Series

MiniCPM

NOMI GPT

Gemma2-9B

Qwen2-0.5B

Text Generating for Messaging

Translation

Applications of On-Device LLMs Meeting Summarizing

Healthcare Application

Scientific Research Support


Data Security Techniques
Companion Robot
Adaptive Edge-Cloud Collaboration
Disability Support
Multi-Modal and Cross-Modal Learning
Autonomous Vehicles
Future Directions and Open Challenges Resource-Efficient Solutions

Conclusion Hardware-Software Co-Design

Robustness and Reliability

Scalability and Deployment Optimization

Continual Learning and Personalization

Figure 2: The architecture of this paper

3
Preprint, under review

This review paper provides a comprehensive exploration of the current strategies and
advancements in the deployment of LLMs on edge devices. We aim to critically analyze
the various techniques and architectures that have been developed to adapt LLMs to the
constraints of edge computing. This includes a detailed examination of model compression
techniques, energy-efficient computing strategies, and the development of novel lightweight
model architectures. Furthermore, the paper will delve into deployment strategies that
enable the effective use of LLMs in edge scenarios, highlighting key industry applications
and the resulting benefits.
Through this review, we intend to illuminate the pathways and challenges in transitioning
from cloud-based to on-device language models, providing insights into how this shift
could redefine the landscape of applications and AI accessibility. The structure of this paper
is illustrated in Fig. 2. We begin by exploring the foundations and preliminaries in Section
2, including the evolution of LLMs on-device, architectural foundations, and on-device
training techniques. Section 3 delves into efficient architectures for on-device language
models, discussing innovative design principles, model compression, and collaborative
approaches. Section 4 continues with an in-depth examination of model compression and
optimization techniques, covering quantization, pruning, knowledge distillation, and low-
rank factorization. Section 5 investigates hardware acceleration and deployment strategies,
highlighting popular on-device LLM frameworks and hardware-specific optimizations. To
contextualize these advancements, in Section 6, we present examples of existing on-device
language models and their real-world applications across various domains. Finally, Section
7 discusses future directions and open challenges in the field, and Section 8 concludes
our review. By focusing on the intersection of LLM capabilities and edge computing
requirements, this paper contributes to the ongoing discourse in AI research, offering a
comprehensive perspective on achieving the delicate balance between model performance
and computational efficiency in resource-constrained environments.

2 Foundations and Preliminaries

2.1 Evolution of On-Device LLMs

The evolution of on-device LLMs is a process closely linked to technological progress. Figure
3 provides a comprehensive timeline of on-device language model development since 2023,
illustrating the rapid advancement in this field. As shown in the figure, the exploration and
experimentation of large language models on the edge began in earnest in 2023. We saw
the emergence of several influential model series with parameters below 10B, making it
possible for LLMs to run on edge devices. Notable examples include:

• Meta’s LLaMA series (Touvron et al. (2023a;b); Meta (2024); Dubey et al. (2024))
• Microsoft’s Phi series (Gunasekar et al. (2023); Li et al. (2023c); Abdin et al. (2024))
• Zhipu AI’s ChatGLM series (GLM et al. (2024))
• Alibaba’s Qwen series (Bai et al. (2023a); Qwen Team (2024))
• 01.AI’s Yi series (Young et al. (2024); 01.AI (2024))
• Mistral’s series (Jiang et al. (2023; 2024a))
• Shanghai AI Laboratory’s InternLM series (Team (2023); Cai et al. (2024b))

In addition, there are also models such as Falcon released by TII (Almazrouei et al., 2023)
and the MPT model released by Mosaic ML (MosaicML, 2023) that have participated in the
competition of such models. Although the performance of these small-parameter models
is not as good as that of traditional large-parameter models, they make it possible for
LLMs to run on edge devices. Their appearance marks the importance of the language
model industry to the application scenarios of edge devices using LLMs. At the same time,
with the application of technologies such as mixed experts, quantization, and compression,
the performance of small-parameter models is constantly making great progress while
maintaining the parameter volume.

4
Preprint, under review

NEXA AI
Octopus v2
Octopus v3
Octo-planner

0.5B, 2B 1B 3B

Meta
Llama 1
Llama 2
Llama 3

7B 7B 8B

Zhipu AI
ChatGLM
ChatGLM 2
ChatGLM 3

GLM 4, 4v

6B 6B 6B
9B, 9B
Mosaic ML
MPT

7B

Miscrosoft
Phi 1
Phi 1.5
Phi 2

Phi 3 mini, vision, small

1.3B 1.3B 2.7B


3.8B, 4.2B, 7B
Tll
Falcon

Falcon 2, 2 11B VLM

7B
11B, 11B
Baichuan AI
Baichuan
Baichuan 2

7B 7B

Mistral
Mistral
Mistral 8x

7.3B 7B

Alibaba Cloud
Qwen 1

Qwen 1
Qwen VL
Qwen 1.5
Qwen 2

1.8B
7B 9.6B 4B, 7B
0.5B, 1.8B, 0.5B, 1.5B, 7B
University of
Wisconsin-Madison
LLaVA 1.5
LLaVA NeXT

LLaVA 1.0
6.7B 7B
13B
01.AI
Yi
Yi VL
Yi 1.5

6B 6B 6B

Google
Gemini Nano

1.8B Gemma 1
Gemma 2

2B, 8B 9B
ModelBest
MiniCPM
MiniCPM V2.0

2B 2.8B MiniCPM Llama 3 V2.5

8B
Apple
OpenELM

1.1B DCLM

0.4B, 1.4B, 6.9B

AI2
OLMo

1B, 7B

Shanghai AI
Laboratory InternLM2
InternLM2.5

7B 7B

M-A-P
MAP Neo

7B

Hugging Face
SmolLM

135M, 360M, 1.7B

02 03 04 05 06 07 08 09 10 11 12 01 02 03 04 05 06 07
Text
Multimodal
2023 2024 Model Model

Figure 3: Summary of on-device LLMs’ evolution

Figure 3 also highlights the emergence of multimodal models since 2023, such as the LLaVa
series (Liu et al., 2024a;b), QwenVL (Bai et al., 2023b), Gemini Nano (Team et al., 2023), and
Yi VL (Young et al., 2024). These models represent valuable attempts to deploy multimodal
LLMs on the edge, adapting to more complex and changing user scenarios on mobile
devices.
Entering 2024, the pace of innovation accelerated, as evident from the dense cluster of new
models in the figure’s rightmost section. This period saw the introduction of:

• Nexa AI’s Octopus series (Chen & Li, 2024a;b;c)


• ModelBest’s MiniCPM series (Hu et al., 2024b; Tsinghua University, 2024)
• Google’s Gemma series (Team et al., 2024; Google, 2024a)
• Apple’s OpenELM (Mehta et al., 2024) and DataComp-LM (Li et al., 2024a)
• AI2’s OLMo (Soldaini et al., 2024; Groeneveld et al., 2024)

Figure 3 clearly shows an increased focus on multimodal capabilities in 2024, with many new
models offering both text and multimodal functionalities to address diverse task-processing
scenarios. As illustrated by the variety and progression of models, on-device language
models are rapidly evolving and diversifying. This trend, coupled with the continuous
maturation of intelligent hardware and software technologies, enables the integration of
these models into smartphones, Internet-connected cars, computers, robots, and other
terminal equipment, showcasing their growing application potential and value.

5
Preprint, under review

2.2 LLM Architecture Foundations


1. Traditional text-based LLMs: Let’s start where it all began. Transformer is a deep
learning model based on an attention mechanism (Vaswani et al., 2017), widely
used to process sequential data, especially in natural language processing tasks. It
consists of two parts: an encoder and a decoder. Nowadays, popular large language
models mainly use decoder-only architecture (Fu et al., 2023), representing models
such as GPT (Generative Pre-trained Transformer), LLaMA (Large Language Model
Meta AI), etc. The GPT model consists of multiple decoder layers (Radford et al.,
2018; 2019; Brown et al., 2020), and each decoder layer consists of a self-attention
mechanism. The GPT model also applies layer normalization after each sub-layer
(Floridi & Chiriatti, 2020). In contrast, LLaMA applies normalization (Ioffe &
Szegedy, 2015; Zhang & Sennrich, 2019; Xiong et al., 2020) before each sub-layer
operation, which helps to improve the stability of the training process (Touvron
et al., 2023a). In terms of the application of attention mechanisms, the GPT model
uses the standard self-attention mechanism, which allows the model to consider
information from all positions in the input sequence when generating the sequence,
while LLaMA uses Group Query Attention (GQA) (Ainslie et al., 2023), which is an
optimization technique that reduces the computational and memory footprint of
the model and improves efficiency.
The concept MoE (Mixture of Expert), originated in 1991 (Jacobs et al., 1991), plays
a key role in today’s language models pre-training. It enables efficient pre-training
with far less computational resources than are required for dense models. The
mechanism consists of two key components: a sparse MoE layer containing a
number of “experts”, each of which is a separate neural network in its own right
(Shazeer et al., 2017; Chen et al., 2022; Du et al., 2022); and a gating network or
routing: this component is used to determine which tokens are sent to which
expert model for processing. Architecture replaces each feed-forward network
(FFN) layer in a traditional Transformer model with a MoE layer, which consists
of two core components: a gating network and a number of experts (Masoudnia &
Ebrahimpour, 2014).
2. Multimodal LLMs: With the powerful learning architecture of Transformer, large
multimodal models can process multiple different modalities at the same time, such
as text, images, sounds, data tables, etc (Xie et al., 2024; Wu et al., 2023a). Its internal
operating mechanisms are as follows:
A) Use standard cross-attention layers to perform deep fusion of multimodal inputs
in the internal layers of the model (such as MultiModal-GPT (Gong et al., 2023))
B) Use custom-designed layers to perform deep fusion of multimodal inputs in the
internal layers of the model (LLaMA-Adapter (Zhang et al. (2023a)), MoE-LLaVa
(Lin et al. (2024a)))
C) Perform early fusion of multimodal inputs at the input stage of the model, using
modality-specific encoders (LLaVa (Liu et al., 2024b), Qwen-VL (Bai et al., 2023a))
D) Perform early fusion at the input stage, but use tokenization techniques (such as
tokenizers) to handle modalities (Wadekar et al., 2024).

2.3 On-Device LLMs Training

Deploying large language models (LLMs) on resource-constrained devices poses challenges


such as limited memory and computational power (Loukas et al. (2023)). To address
these issues, collaborative and hierarchical model approaches offer innovative solutions by
distributing computational load and utilizing models with varying capabilities.
Classic methods for training on resource-constrained devices include:

1. Quantization-aware scaling: Stabilize the training process by automatically scaling


the gradients of tensors with different bit precisions, solve the problem of inconsis-
tent gradient scales of tensors with different bit widths in the quantization graph,
and make the training accuracy of the quantized model comparable to that of the
floating-point model (Nagel et al., 2022; Huang et al., 2024a).

6
Preprint, under review

2. Sparse update: Selectively update the weights of a portion of the layers in the
network, skip the gradient calculations of less important layers and sub-tensors,
thereby reducing memory usage and computational costs (Liu et al., 2023; Ansell
et al., 2024).
3. Tiny Training Engine (TTE): Includes redundant nodes in the reverse graph, such
as gradient nodes that freeze weights, and reorder operations to achieve in-place
updates (Lin et al., 2023a; Khouas et al., 2024).
4. Contribution analysis: Automatically determine the sparse update scheme, that is,
determine which parameters (weights/biases) contribute the most to downstream
accuracy, so as to select which layers or parts of tensors should be updated under a
limited memory budget (Lin et al., 2022; Ren et al., 2024; Zeng et al., 2023a).

2.4 Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

Edge-cloud (local-remote) collaborated deployment of LLM is preferred, while existing


cloud-only (remote-only) (e.g., ChatGPT) is not a widely acceptable solution. As shown in
Figure 4, 88% of participants prefer an edge-cloud collaborated architecture, 58.33% of them
support local deployment, and 81.82% of them are not satisfied with the existing cloud-only
solutions. Their main concerns are 1) the high latency of remote LLM service, 2) the risk
of transmitting personal data to the cloud, and 3) the cost of cloud-based LLM services (Li
et al., 2024c).

Figure 4: Vote Distribution of different LLM deployment strategies in Personal LLM strate-
gies (Li et al., 2024c)

Although cloud-based LLMs offer powerful capabilities, they come with certain drawbacks,
including potential latency issues (Wang et al., 2024b) and data concerns due to their depen-
dency on networks. Hence, the concept of on-device deployment through edge computing

7
Preprint, under review

has emerged to reduce latency and safeguard user data (Gerganov, 2023). Processing oc-
curs locally, eliminating the need for data transmission. Moreover, the proliferation of
customized hardware accelerators on mobile devices has made it feasible to run large LLMs
with billions of parameters directly on devices.
On-device inference provides a compelling case for reducing latency because it allows
models to run directly on the user’s device without sending data to a cloud server. This
approach is particularly beneficial for applications that require real-time responses. In the
case of GPT-4, which gets responses based on the cloud, each token is generated at a speed
of about 200 ms, while common end-side models can already generate tokens faster than
this (taivo, 2023).
The ability to run models offline reduces reliance on network connectivity, making applica-
tions more accessible in areas with poor network coverage or other offline environments. For
example, Google’s Gemini Nano-based TalkBack, a feature that uses multimodal capabilities
to recognize image content to provide voice broadcasts to people with disabilities, can work
properly even when completely offline (Google, 2024b). On-device reasoning also optimizes
the use of limited computing resources through techniques such as model quantization,
allowing language models to run efficiently even on devices with limited memory.
The deployment of LLMs on mobile devices is further facilitated by user-friendly interfaces
that abstract away the complexities of AI, making the technology accessible to users without
specialized knowledge. Moreover, these applications are not just limited to text generation
but can extend their functionality to interact with device features, such as making calls,
conducting web searches, and managing calendar events, through innovative text-to-actions
features.

2.5 The Performance Indicator of On-Device LLMs

Latency is the time it takes from the user inputting a request to the system starting to
respond. It usually refers to the time from when the model receives the input text to when it
starts generating the first output. We generally use TTFT (Time-to-First-Token) to measure
this metric (Hu et al., 2024a; Agrawal et al., 2024b;a).
Inference speed refers to the speed at which the LLM makes an autoregression prediction
of the next token based on all the previous tokens seen so far. However, in addition to
the initial prompt decoding, inferring the next token also requires the logic of decoding
one token at a time. This is because each new token depends on the previous token, and
the previous token cannot be known in advance. This step takes up the most time in the
reasoning of the large language model. Because of this, the speed of this step will mainly
determine whether the user dialogue model is smooth, thus directly affecting the user
experience (Çöplü et al., 2023; Cai et al., 2024a; Zheng et al., 2024b).
The size of RAM/VRAM used is also one of the performance indicators of language models
operation. Due to the operation mechanism of language models, they consume correspond-
ing RAM according to the size of model parameters during inference. For example, it is
impractical to deploy a model with 70B parameters on a personal office laptop. This is
crucial for many edge devices with limited RAM size. Engineers must use various model
compression technologies to minimize the memory occupied by language model inference
(Kwon et al., 2023; Zhao et al., 2024b;c).
In addition, the storage space occupied by models and the energy consumed during infer-
ence, for example, will become important indicators on edge devices. These indicators are
particularly critical to whether LLMs can run on edge devices and how long they can run.
In most cases, LLMs inference will put the processor into a fully loaded working state. If
the operation time is too long, it will seriously consume the battery of the mobile device,
thus bringing new problems. For example, a 7B parameter LLM inference will consume
about 0.7J per token. For an iPhone with a battery capacity of about 50kJ, this means that
the conversation with the model can only last for two hours at most. This does not take into
account other issues such as device heating caused by model inference (Liu et al., 2024c;
Stojkovic et al., 2024; Jiang et al., 2024b).

8
Preprint, under review

3 Efficient Architectures for On-Device LLMs

3.1 Architectural Design Principles and Innovations for On-Device LLMs

Designing language models for on-device deployment involves several architectural princi-
ples and innovations aimed at overcoming the resource constraints typical of mobile and
edge devices. Key strategies include 1) parameter sharing (Lin et al., 2023b; Cao et al., 2024),
which involves reusing weights across different parts of the model to reduce the overall
parameter count; 2) modular architectures (Ning et al., 2023; Ostapenko et al., 2024; Shen
et al., 2024), which break down the LLM into smaller, independent components or modules
that can be processed separately or in parallel; and 3) compact representations, which focus
on reducing the memory footprint of LLMs through techniques like quantization and weight
pruning (Liu et al., 2024c; Zhang et al., 2024b; Xu et al., 2023). To provide a comprehensive
comparison of these architectures, we consider their performance, computational efficiency,
and memory requirements, which are summarized on Table 1.

Table 1: Comparative Analysis of State-of-the-Art On-Device LLM Architectures


Model Performance Computational Effi- Memory Requirements
ciency
MobileLLM High accuracy, opti- Embedding sharing, Reduced model size due to
(Liu et al., mized for sub-billion grouped-query atten- deep and thin structures
2024c) parameter models tion
EdgeShard Up to 50% latency re- Collaborative edge- Distributed model compo-
(Zhang et al., duction, 2× through- cloud computing, nents reduce individual de-
2024b) put improvement optimal shard place- vice load
ment
LLMCad (Xu Up to 9.3× speedup in Generate-then-verify, Smaller LLM for token gen-
et al., 2023) token generation token tree generation eration, larger LLM for ver-
ification
Any-Precision Supports multiple Post-training quantiza- Substantial memory sav-
LLM (Park et al., precisions efficiently tion, memory-efficient ings with versatile model
2024) design precisions
Breakthrough Up to 4.5× perfor- PIM and PNM tech- Enhanced memory band-
Memory (Kim mance improvement nologies enhance mem- width and capacity
et al., 2024c) ory processing
MELTing Point Provides systematic Analyzes impacts of Evaluates memory and
(Laskaridis performance evalua- quantization, efficient computational efficiency
et al., 2024) tion model evaluation trade-offs
LLMaaS on MD Reduces context Stateful execution, fine- Efficient memory manage-
(Yin et al., 2024) switching latency grained KV cache com- ment with tolerance-aware
significantly pression compression and swap-
ping
LocMoE (Li Reduces training time Orthogonal gating Minimizes communication
et al., 2024b) per epoch by up to weights, locality-based overhead with group-wise
22.24% expert regularization All-to-All and recompute
pipeline
EdgeMoE (Yi Significant perfor- Expert-wise bitwidth Efficient memory man-
et al., 2023) mance improvements adaptation, preloading agement through expert-
on edge devices experts by-expert computation
reordering
JetMoE (Shen Outperforms Llama2- Reduces inference com- 8B total parameters, only
et al., 2024) 7B and 13B-Chat with putation by 70% using 2B activated per input to-
fewer parameters sparse activation ken

3.2 Model Compression and Parameter Sharing

Efficient deployment of LLMs on resource-constrained devices such as smartphones and


edge devices often requires reducing the model size without significantly sacrificing per-
formance. Model compression and parameter-sharing techniques play a critical role in
achieving this balance. This subsection reviews key research works that focus on optimiz-

9
Preprint, under review

ing sub-billion parameter LLMs through innovative compression and parameter-sharing


methods.
Lin et al. (2024b) introduces a novel weight-only quantization method that focuses on
the significance of weights in LLMs. AWQ protects a small fraction of crucial weights
(0.1%-1%), reducing quantization loss and preserving the generalization ability of LLMs
across different domains and modalities. Unlike traditional methods, AWQ does not require
backpropagation or reconstruction, thus maintaining efficiency and performance. The
proposed TinyChat inference framework implements AWQ, achieving significant speedup
(up to 3×) over traditional FP16 implementations on both desktop and mobile GPUs.
MobileLLM addresses the need for efficient LLMs on mobile devices by proposing a deep
and thin architecture optimized for sub-billion parameter counts (Liu et al., 2024c). This
approach challenges the common belief that wider models are better, demonstrating that
deep and thin structures can capture complex patterns effectively. Key techniques include
embedding sharing, grouped-query attention, and block-wise immediate weight sharing.
MobileLLM achieves significant accuracy improvements over previous state-of-the-art
models (e.g., 2.7% and 4.3% accuracy boost over 125M and 350M models, respectively). The
enhanced version, MobileLLM-LS, further increases accuracy while maintaining a small
model size, making it ideal for on-device applications.
Both AWQ and MobileLLM showcase the potential of model compression and parameter-
sharing techniques in making LLMs feasible for deployment on mobile and edge devices.
AWQ focuses on weight quantization to reduce model size and improve inference speed,
while MobileLLM emphasizes architectural optimizations and weight sharing to create
efficient sub-billion parameter models. These innovations are crucial for enhancing the
performance and accessibility of LLMs in resource-constrained environments, enabling
advanced AI capabilities on personal devices without compromising accuracy or efficiency.

3.3 Collaborative and Hierarchical Model Approaches

Deploying language models on resource-constrained devices faces significant challenges,


such as limited memory and computational power. Collaborative and hierarchical model
approaches offer innovative solutions to overcome these limitations by distributing the
computational load and leveraging multiple models with varying capabilities. This subsec-
tion reviews key research works that implement collaborative and hierarchical strategies to
enhance the efficiency and scalability of on-device LLMs.
EdgeShard introduces the EdgeShard framework, which partitions large LLMs into smaller
segments (shards) and strategically distributes them across edge devices and cloud servers
(Zhang et al., 2024b). This method reduces latency and improves throughput by utilizing
the computational power of multiple devices simultaneously. A dynamic programming
algorithm optimizes shard placement, balancing the computational load and minimizing
communication overhead. Experimental results show significant improvements in latency
reduction (up to 50%) and throughput enhancement (up to 2×) compared to traditional
cloud-based methods.
LLMCad presents a novel inference engine that combines a smaller, memory-resident LLM
with a larger, more accurate LLM (Xu et al., 2023). The smaller LLM generates candidate
tokens, while the larger LLM verifies and corrects these tokens. This ”generate-then-verify”
approach leverages the efficiency of the smaller model and maintains the accuracy of the
larger model. LLMCad introduces several techniques, including token tree generation and
verification, self-adaptive fallback strategy, and speculative generation pipeline. These
innovations enable LLMCad to achieve up to 9.3× speedup in token generation without
compromising accuracy, making it suitable for real-time applications on mobile devices.
WDMoE proposed a new paradigm for deploying LLMs in a wireless communication
system (Xue et al., 2024a). By performing MoE Layer Decomposition, the gating network
at the base station is deployed, and expert networks are distributed across mobile devices
to optimize performance and reduce latency. In addition, the Expert Selection Policy is

10
Preprint, under review

proposed to Dynamically adjust expert selection based on wireless channel conditions to


ensure optimal performance.
Collaborative and hierarchical model approaches, such as those proposed in EdgeShard
and LLMCad, offer effective solutions to the challenges of deploying LLMs on resource-
constrained devices. By distributing the computational load across multiple devices and
using smaller models for preliminary tasks, these methods enhance the scalability and
efficiency of LLM inference. The EdgeShard framework demonstrates the benefits of collabo-
rative edge-cloud computing, while LLMCad showcases the potential of hierarchical model
collaboration in maintaining accuracy and improving inference speed. These approaches
are crucial for enabling advanced AI capabilities on mobile and edge devices, providing
real-time performance and efficient resource utilization.

3.4 Memory and Computational Efficiency

Efficient memory and computational resource utilization are critical for deploying large
language models (LLMs) on mobile and edge devices. Various techniques and innovations
aim to optimize the use of limited resources to ensure that LLMs can perform effectively
without overwhelming the device’s capabilities. This subsection reviews key research works
focusing on enhancing memory and computational efficiency for on-device LLMs.
The researchers from Samsung Electronics proposes innovative memory solutions to address
the memory bottlenecks in LLM deployment (Kim et al., 2024c). The authors introduce
Processing-in-Memory (PIM) and Processing-near-Memory (PNM) technologies:
Aquabolt-XL (Kim et al., 2021) and LPDDR-PIM (Kim et al., 2024a): These PIM devices
embed logic within the memory core, boosting internal memory bandwidth and supporting
high-performance computing tasks, including LLM acceleration. AXDIMM (Ke et al., 2021)
and CXL-PNM: These PNM solutions place computational logic near the memory core,
enhancing memory bandwidth and capacity. CXL-PNM integrates computational logic into
the CXL memory controller, significantly improving memory capacity and performance.
Experimental results show that these memory solutions achieve up to 4.5× performance
improvement and 71% energy reduction compared to traditional memory architectures,
making them highly suitable for LLM inference on resource-constrained devices.
MELTing Point introduces the MELT infrastructure, designed to facilitate the execution and
benchmarking of LLMs on mobile devices (Laskaridis et al., 2024). The MELT framework
supports Android, iOS, and Nvidia Jetson devices and provides detailed performance and
energy metrics. MELT systematically evaluates on-device LLM execution, providing insights
into performance, energy efficiency, and memory usage across various models. The paper
examines the impact of model quantization on performance and accuracy, demonstrating
that while quantization reduces memory requirements, it incurs an accuracy cost. The
results highlight the importance of balancing memory and computational efficiency with
performance to make LLMs viable for mobile applications.
Memory and computational efficiency are paramount for deploying LLMs on mobile and
edge devices. The research works reviewed in this subsection present innovative solutions
to overcome the memory wall and optimize resource usage. Samsung’s memory solutions,
such as PIM and PNM, significantly enhance memory bandwidth and capacity, enabling
efficient LLM inference. The MELT infrastructure provides a comprehensive evaluation
framework, offering valuable insights into the trade-offs between performance, energy
efficiency, and memory usage. These advancements are crucial for ensuring that LLMs can
operate effectively on resource-constrained devices, paving the way for more practical and
efficient AI applications in mobile and edge environments.

3.5 Mixture-of-Experts (MoE) Architectures

Mixture-of-Experts (MoE) architectures offer a promising approach for deploying LLMs on


edge devices by leveraging sparse activation and dynamic routing to improve efficiency.
This subsection reviews key research works focusing on MoE-based models designed to
optimize performance and resource utilization in on-device deployments.

11
Preprint, under review

EdgeMoE introduces a framework designed to efficiently execute MoE models on edge


devices (Yi et al., 2023). The authors proposed the Expert-wise Bitwidth Adaptation to
reduce the size of expert weights with minimal accuracy loss using per-channel linear
quantization. By utilizing novel expert management methods, they preload expert weights
into the compute-I/O pipeline to reduce I/O swapping overhead. Experimental Results
demonstrate significant memory savings and performance improvements compared to
baseline solutions, achieving up to 2.78× speedup in inference.
LocMoE introduces a routing strategy and communication optimization scheme to improve
the efficiency of training MoE-based LLMs (Li et al., 2024b). The Orthogonal Gating Weights
method is employed to reduce computational costs and facilitate explicit routing decisions.
Moreover, they introduced Locality-Based Expert Regularization to Encourage local experts
to compete, reducing communication time and avoiding under-training. They also included
Group-Wise All-to-All and Communication Overlap to optimizes All-to-All operations by
overlapping computation with communication to mask delays.
Yin et al. (2024) proposed the LLMaaS paradigm, integrating large language models as a
system service on mobile devices. In their proposed design, Stateful Execution allows the
system to maintain persistent states (KV cache) across multiple invocations to improve
performance. The Unified Interface helps reduce memory usage by exposing LLMs and
their infrastructure as a system feature to mobile apps. They also introducd techniques like
chunk-wise KV cache compression and swapping to minimize context-switching overhead.
JetMoE presents an efficient approach to large language model training using a Sparsely-
gated Mixture-of-Experts (SMoE) architecture (Shen et al., 2024). The authors apply sparse
activation to both attention and feed-forward layers, significantly reducing computational
costs while maintaining high performance. JetMoE-8B, trained with less than $0.1 million
using 1.25T tokens and 30,000 H100 GPU hours, outperforms Llama2-7B, while JetMoE-8B-
Chat surpasses Llama2-13B-Chat. The model’s 8B total parameters with only 2B activated
per input token reduces inference computation by about 70% compared to Llama2-7B.
MoE architectures offer innovative solutions to the challenges of deploying LLMs on edge
devices. These approaches leverage sparse activation and dynamic routing to improve
computational efficiency and resource utilization.

3.6 General Efficiency and Performance Improvements

Achieving efficient deployment of LLMs on edge devices involves a range of strategies aimed
at improving overall performance while managing computational and memory constraints.
This subsection reviews key research works that introduce innovative approaches to enhance
the efficiency and effectiveness of on-device LLMs.
Any-Precision LLM proposes a novel method to deploy various LLMs with different pre-
cisions in a memory-efficient manner (Park et al., 2024). Any-Precision model extends
any-precision deep neural networks to LLMs, allowing a single n-bit quantized model
to support multiple lower bit-width models down to 3 bits. This reduces memory usage
without significant performance loss. Post-training quantization (PTQ) creates low-bit
models and incrementally upscales them to higher bit widths. This avoids multiple training
phases for each precision, saving time and resources. A new software engine optimized for
any-precision support manages memory bandwidth and improves serving efficiency, ensur-
ing practical deployment of LLMs on edge devices. The experimental results demonstrate
substantial memory savings and improved serving efficiency, making any-precision LLMs
suitable for a variety of on-device applications.
Yan et al. (2023) explores the use of LLMs in software-hardware co-design to optimize the
development of compute-in-memory (CiM) deep neural network (DNN) accelerators. The
LCDA framework integrates LLMs into the design process of hardware and software, lever-
aging their extensive training on diverse datasets to speed up co-design. By incorporating
heuristic knowledge from pre-trained LLMs, the framework bypasses the cold start problem,
enabling faster convergence to optimal solutions. The framework shows a 25x speedup in
the design process compared to state-of-the-art methods while maintaining comparable

12
Preprint, under review

performance levels in designing efficient DNN models and hardware architectures. This
approach highlights the potential of LLMs to enhance the co-design process, improving
both software and hardware efficiency for advanced AI applications.
General efficiency and performance improvements are crucial for the practical deployment
of LLMs on edge devices. The research works reviewed in this subsection introduce innova-
tive methods to enhance memory efficiency, computational speed, and overall performance.
The Any-Precision LLM approach offers a flexible and memory-efficient solution for deploy-
ing multiple LLMs with different precisions, while the LCDA framework demonstrates the
benefits of integrating LLMs into the co-design process for optimizing both software and
hardware. These advancements contribute to making LLMs more accessible and effective in
resource-constrained environments, enabling a broader range of AI applications on mobile
and edge devices.

4 Model Compression and Optimization Techniques for On-Device


LLMs

In the realm of LLMs, optimizing computational efficiency while preserving performance


is crucial, particularly for deployment on edge devices. This section examines four key
model compression techniques: quantization, pruning, knowledge distillation, and low-
rank factorization. These methods enhance the operational efficiency of LLMs, ensuring
their viability for on-device applications by balancing performance, memory footprint, and
inference speed.

4.1 Quantization

Quantization in neural networks refers to the process of transforming high-precision


(floating-point) weights and activations into lower bit-widths (integers). This technique
substantially reduces the model size and computational demands, enabling faster inference
and decreased memory consumption while preserving accuracy.

1. Post-training quantization (PTQ) : PTQ is applied after model training, requiring


no retraining and thus being faster and less resource-intensive than quantization-
aware training (QAT). There are a few notable PTQ methods. GPTQ (Frantar et al.,
2022) utilizes second-order information for error compensation, effectively reducing
bit width to 3 or 4 bits per weight. This method maintains high accuracy with
minimal perplexity increase, enabling language models like OPT-175B to run on
a single high-end GPU. Activation-aware Weight Quantization (AWQ) (Lin et al.,
2024c) is based on the observation that a small fraction (0.1%-1%) of weights are
crucial for LLMs’ performance. By selectively skipping quantization of these salient
weights, AWQ significantly reduces quantization loss.
(a) Weight-only quantization : In weight-only quantization, only the weights
of the neural network are quantized. This approach simplifies the quantiza-
tion process and can be particularly effective when activations do not vary
significantly in range or when computational resources are severely limited.
(b) Weight-activation co-quantization : Both weights and activations are quan-
tized, enhancing reduction in computational complexity. This method is ad-
vantageous in hardware implementations due to efficient matrix multiplication
(Dettmers et al., 2022), vital in neural computations. BitNet b1.58 (Ma et al.,
2024) uses ternary quantization -1, 0, 1 for each parameter, significantly enhanc-
ing latency, memory, throughput, and energy consumption metrics.
2. Quantization-aware training (QAT) : QAT incorporates quantization directly into
the training process, allowing the model to accommodate the reduced precision
constraints inherently. This integration generally yields higher accuracy post-
quantization, as the model proactively learns to compensate for potential quantiza-
tion errors during its training phase.

13
Preprint, under review

4.2 Pruning

Pruning in neural networks involves selectively removing weights or neurons to reduce


complexity and enhance computational efficiency without significantly compromising
performance. This process targets the less crucial components of a model, focusing on
efficiency and functional integrity.

1. Structured Pruning: This approach removes entire subsets of parameters like lay-
ers, channels, or filters, which is beneficial for hardware optimization due to more
regular memory access patterns and simplified computations. The ‘LLM-Pruner’
(Kaddour et al., 2023) employs structured pruning to eliminate non-essential groups
based on gradient data, thus maintaining critical functionalities. It also facilitates
performance recovery through techniques such as LoRA, allowing efficient restora-
tion with minimal data.
2. Unstructured Pruning: Unlike structured pruning, unstructured pruning removes
individual weights across the model, offering finer granularity and potentially
higher compression rates (Li et al., 2023a). However, this method typically results
in sparse matrices, which can be less compatible with traditional hardware architec-
tures, compromising computational efficiency. It is most suitable where maximum
compression is needed without constraints on structural preservation.
3. Contextual Pruning: This advanced method prunes based on the operational con-
text of the model, targeting weights or neurons that are only relevant under specific
conditions or for particular tasks. Contextual pruning ensures that reductions align
dynamically with the model’s operational needs, thereby preserving performance
where it matters most.

4.3 Knowledge Distillation

Knowledge Distillation (KD) is a technique for transferring knowledge from a large, com-
putationally intensive model (teacher) to a smaller, more efficient model (student). This
method is crucial for condensing the capabilities of large language models (LLMs) into more
manageable forms without significantly impacting performance.

1. Black-box Knowledge Distillation: This approach involves the student model


learning solely from the outputs of the teacher model, without access to its internal
mechanics or parameters. It is particularly advantageous when the teacher model’s
details are proprietary or when the architectures of the teacher and student models
differ markedly. For instance, Gu et al. (2023) demonstrated that black-box KD
could effectively train models using only the output data from LLM APIs like
ChatGPT. The student model trains to emulate the teacher’s output distribution
based on input-output pairs, a process that, while effective, limits learning to
external behaviors without tapping into the teacher’s deeper internal states.
2. White-box Knowledge Distillation: In contrast, White-box Knowledge Distillation
allows the student model to access the internal states and workings of the teacher,
facilitating a deeper and more precise learning process. This method enables the
student to mimic not just the outputs but also the internal state distributions of
the teacher, enhancing learning efficacy and depth. The increased access to the
teacher’s detailed workings helps guide the student’s learning, resulting in more
accurate and robust models. However, this technique requires a careful alignment
of model architectures to ensure effective knowledge transfer and is generally more
complex to implement.

4.4 Low-Rank Factorization

Low-Rank Factorization (LRF) is a technique utilized to decompose matrices into smaller


components, significantly reducing computational complexity without substantially im-
pacting model accuracy. Leveraging the inherent low-rank structure prevalent in many
real-world matrices, LRF facilitates the approximation of these matrices by products of

14
Preprint, under review

low-rank factors, which has proven indispensable in applications such as image processing,
dimensionality reduction in machine learning models, and data compression (Saha et al.,
2023). This methodology not only maintains essential data characteristics but also ensures
efficient storage and processing, highlighting its crucial role in modern computational
tasks. Further extending its application, a study by Yao et al. (2024b) integrates LRF with
Post-training Quantization (PTQ) in Large Language Models. This innovative approach,
termed Low-Rank Compensation (LoRC), enhances model efficiency by significantly reduc-
ing model size and preserving accuracy, effectively mitigating the detrimental effects of
activation quantization. This synthesis of LRF and PTQ demonstrates a significant advance-
ment in optimizing computational efficiency while maintaining performance integrity in
complex models.

5 Hardware Acceleration and Deployment Strategies


Hardware accelerators such as GPUs, TPUs, and specialized AI chips play a crucial role in
enabling efficient on-device inference of LLMs by offering substantial computational capa-
bilities and high memory bandwidth. The selection between GPUs, TPUs, FPGAs, and other
AI-specific chips involves careful consideration of trade-offs involving performance, power
consumption, and cost. For instance, GPUs are favored for their parallel processing prowess,
TPUs for their specialized matrix operations, and FPGAs for their customizable hardware
tailored to specific tasks, which can be more power-efficient. Software-hardware co-design
approaches, including quantization-aware training and model compression, further en-
hance efficiency, making LLMs feasible on a range of devices from high-power servers to
low-power edge devices. Optimization strategies like parameter sharing and advanced
memory management techniques are vital for reducing the footprint of LLMs, ensuring
faster and more cost-effective deployments across diverse computing environments. These
strategies collectively improve the deployment and execution of LLMs, catering to various
application needs and hardware constraints.

5.1 Popular On-Device LLMs Framework

Deployment strategies for LLMs can vary significantly depending on the use case and the
available infrastructure, ranging from fully cloud-based solutions to edge-only deployments.

1. Edge-only
(a) Llama.cpp
• Description: Llama.cpp (Gerganov, 2023) is a C/C++ library designed
for efficient inference of large language models on a broad range of hard-
ware platforms. It supports integer quantization, GPU acceleration, and
CPU+GPU hybrid inference.
• Training: Supports fine-tuning LORA adapters on-device.
• Inference: Supports CPU and CPU+GPU hybrid inference across ARM and
x86 architectures.
(b) MNN
• Description: MNN (Alibaba, 2024) leverages Mobile Neural Network tech-
nology for efficient LLM inference on various platforms, optimized for
mobile devices with dynamic inputs and multimodal interactions.
• Training: Supports full-sized fine-tuning and LORA fine-tuning on-device.
• Inference: Supports model deployment for ONNX and MNN formats
across diverse backends including CPU, CUDA, and OpenCL.
(c) PowerInfer
• Description: PowerInfer (Song et al., 2023) and PowerInfer2 (Xue et al.,
2024b) is a high-speed inference engine optimized for deploying LLMs on
PCs with consumer-grade GPUs, utilizing a locality-centric design.
• Training: No built-in training capabilities.
• Inference: Supports various computing platforms including x86-64 CPUs
and Apple M Chips, optimized for Windows and Linux.

15
Preprint, under review

(d) ExecuTorch
• Description: ExecuTorch (PyTorch, 2024) is part of the PyTorch Edge ecosys-
tem, designed for deploying PyTorch models efficiently on edge devices
like mobile phones and wearables.
• Training: No built-in training capabilities.
• Inference: Leverages full hardware capabilities like CPUs, NPUs, and DSPs
across various computing platforms.
(e) MediaPipe
• Description: Developed by Google, MediaPipe (AI, 2024b) is a framework
for building and deploying multimodal machine learning pipelines involv-
ing video, audio, and other time-series data.
• Training: No built-in training capabilities.
• Inference: Supports multiple platforms including Android, iOS, macOS,
Windows, and Linux, leveraging CPU and GPU resources.
2. Edge-cloud
(a) MLC-LLM
• Description: MLC-LLM (team, 2023) is a machine learning compiler and
high-performance deployment engine, supporting universal LLM deploy-
ment on edge devices and in cloud environments.
• Training: No built-in training capabilities.
• Inference: Supports inference on various platforms including CPUs and
GPUs across ARM and x86 architectures.
(b) VLLM
• Description: VLLM (Team, 2024) is optimized for edge-cloud environments,
supporting advanced quantization methods for efficient key and value
memory management during inference.
• Training: No built-in training capabilities.
• Inference: Supports multiple GPU platforms and integrates with Vulkan,
CUDA, Metal, and WebGPU technologies.
(c) OpenLLM by BentoML
• Description: OpenLLM (BentoML, 2024) enables the deployment of various
open-source LLMs as OpenAI-compatible API endpoints, optimized for
high throughput and streamlined cloud deployment.
• Training: No built-in training capabilities.
• Inference: Compatible with various model architectures and backend im-
plementations for efficient deployment in production settings.

5.2 Hardware Acceleration

The continuous advancement in hardware technologies significantly impacts the deploy-


ment and performance of on-device LLMs.

1. GPU: Graphics Processing Units (GPUs) have become the standard for training
and accelerating large language models due to their massive parallelism and high
memory bandwidth. NVIDIA’s Tensor Cores, introduced in the Volta architecture
and improved in subsequent generations, offer specialized hardware for mixed-
precision matrix multiply-accumulate operations, which are crucial for transformer-
based models. Recent advancements like NVIDIA’s A100 GPU with 80GB HBM2e
memory enable training of models with billions of parameters on a single device.
Techniques such as tensor parallelism and pipeline parallelism, implemented in
frameworks like Megatron-LM, allow efficient scaling of LLMs across multiple
GPUs. The use of mixed-precision training, particularly FP16 and BF16 formats,
significantly reduces memory footprint and increases computational throughput on
modern GPUs.
2. NPU: Neural Processing Units (NPUs), also known as AI accelerators, are special-
ized chips designed for machine learning workloads. Google’s Tensor Processing

16
Preprint, under review

Units (TPUs) are a prominent example, with the latest v4 offering 275 TFLOPS of
BF16 performance per chip. TPUs utilize a systolic array architecture for efficient
matrix multiplications, which is particularly well-suited for transformer layers in
LLMs. The TPU Pod configuration allows scaling to thousands of chips, enabling
training of models like GPT-3 and PaLM. Huawei’s Ascend AI processors and
Apple’s Neural Engine are other examples of NPUs that offer on-device acceleration
for inference of smaller LLMs, utilizing techniques like quantization and pruning
to reduce model size and computational requirements.
3. FPGA: Field-Programmable Gate Arrays (FPGAs) offer a flexible hardware platform
for accelerating LLMs, particularly for inference. Recent work has demonstrated
efficient implementations of transformer layers on FPGAs, utilizing techniques such
as sparse matrix multiplication and quantization. For example, Microsoft’s Project
Brainwave uses Intel Stratix 10 FPGAs to accelerate BERT inference, achieving low
latency and high throughput. FPGAs excel in energy efficiency and can be optimized
for specific model architectures, making them suitable for edge deployment of
smaller LLMs. However, their lower computational density compared to GPUs and
ASICs limits their application in training large-scale models.

6 Examples and Applications

In the past years, the rapid development of artificial intelligence technology and the con-
tinuous upgrade of mobile device hardware have made the deployment of large language
models on edge devices a reality. Smartphones are one of the most commonly used devices
in people’s daily lives, and the language models on them are particularly eye-catching. At
present, the world’s major mobile phone brand manufacturers have developed and released
a number of advanced models that are deployed on the device side or adopt device-cloud
collaboration strategies, as displayed in Table 2. These models not only mark a major leap
forward in mobile computing but also bring users a series of advantages that traditional
cloud deployments cannot match.

Table 2: State-of-the-Art On-Device LLM released by mobile phone manufacturers


Year MODEL NAME Model Size Edge√ Cloud
2023 Google Gemini Nano 7B √ √
2023 OPPO AndesGPT 7B √
2024 Honor MagicLM 7B √ √
2024 VIVO BlueLM 7B √
2024 XiaoMi MiLM 6B √ √
2024 Apple OpenELM 1.1B

6.1 Examples of on-device language models


1. Gemini Nano: Mobile operating system will expose and LLM and its inference
infrastructure as a system feature to mobile apps, like the location or notification
services. User can access AI core through Google AI Edge SDK. Inside of AI core,
google provide a Gemini Nano model, which is smaller than other other Gemini
models running inference in the cloud, but with faster speed and low inference. AI
core is responsible for the distribution of Gemini Nano model so the memory can
be managed well. Besides, AI core can perform at the best speed since it leverages
on-device hardware to accelerate inference. Gemini Nano model is trained by
distilling from larger Gemini models. It is 4-bit quantized for deployment and
provides best-in-class performance (Team et al., 2023).
2. Nexa AI Octopus series model: A 2 billion parameter model running on edge
device surpasses GPT-4 in accuracy and latency and reduces context length by
95%. By tokenizing the names of core functions and fine-tuning the model using
functional tokens, the model can understand the functionality of the software
application and learn to map function descriptions to specific tokens. Deployment of

17
Preprint, under review

the Octopus model on mobile devices demonstrated fast response times, completing
function calls in 1.1 to 1.7 seconds for a typical query of 20 to 30 tokens, even on a
standard Android phone (Chen et al., 2024b; Chen & Li, 2024a;b;c).
3. Apple OpenELM and Ferret-v2: Apple has developed OpenELM (Mehta et al.,
2024), a substantial large language model integrated within iOS to augment ap-
plication functionalities, analogous to essential system services such as location
tracking. OpenELM employs a layer-wise scaling architecture, efficiently deploying
its 1.1 billion parameters to achieve a 2.36% increase in accuracy compared to prior
models, while requiring only half the pre-training tokens. Moreover, it is compatible
with the MLX library, facilitating direct fine-tuning on Apple devices. In parallel,
Ferret-v2 (Zhang et al., 2024a) marks a significant upgrade over its predecessor,
incorporating features such as any-resolution grounding, multi-granularity visual
encoding through the integration of a DINOv2 encoder, and a sophisticated three-
stage training regimen. These enhancements markedly improve performance by
advancing high-resolution image processing and enriching visual comprehension,
thereby ensuring robust, on-device functionality for iOS users.
4. Microsoft Phi series: Microsoft’s latest Phi-3-mini (Abdin et al., 2024) a compact
yet powerful 3.8 billion parameter language model, trained on an extensive 3.3
trillion token dataset. Despite its small size suitable for mobile deployment, Phi-3-
mini delivers performance competitive with larger models like Mixtral 8x7B and
GPT-3.5, achieving 69% on MMLU and 8.38 on MT-bench. This model benefits
from a unique training dataset, an expanded version of the one used for Phi-2,
which combines heavily filtered publicly available web data with synthetic data,
enhancing robustness, safety, and chat functionality. Additionally, we present
initial results from our scaled models, Phi-3-small and Phi-3-medium, trained on
4.8 trillion tokens, with 7 billion and 14 billion parameters respectively, showing
superior capabilities (75% and 78% on MMLU, and scores of 8.7 and 8.9 on MT-
bench). Expanding further, we introduce Phi-3-vision, a 4.2 billion parameter model
derived from Phi-3-mini, designed with enhanced reasoning abilities for both image
and text prompts.
5. MiniCPM: The MiniCPM-Llama3-V 2.5, a recent addition to the open-source
MiniCPM-V lineup crafted by the collaborative efforts of Tsinghua University and
ModelBest, boasts a substantial parameter count of 8.5 billion (Tsinghua University,
2024). This model has demonstrated exceptional performance across the Open-
Compass assessment platform, which encompasses a wide array of 11 multimodal
benchmarks. With a noteworthy average score of 65.1, MiniCPM-Llama3-V 2.5 has
surpassed leading industry models, including GPT-4V-1106 at 63.5, Gemini Pro at
62.9, Claude 3, and Qwen-VL-Max, even though it possesses only a fraction of the
parameters these models have.
In specific evaluations focusing on Optical Character Recognition (OCR) and scene
text comprehension, MiniCPM-Llama3-V 2.5 has excelled, securing a score surpass-
ing the 700-point mark on OCRBench, thereby outdoing its counterparts such as
GPT-4 and Gemini Pro. Moreover, it has attained remarkable accuracy rates of
76.6% on the TextVQA benchmark and an impressive 84.8% on DocVQA, effectively
establishing a new standard for the performance of open-source models in these
domains.
6. Gemma2-9B: Gemma is a lightweight, state-of-the-art family of open models from
Google. Gemma2 is Google’s upgraded version of Gemma, available in two dif-
ferent sizes, 9B and 27B. For the 9B version, Gemma2 has a training data volume
of 8 TB Tokens of web data, code and math data. The authors have taken a novel
approach to combining attention, with one layer of sliding window attention and
one layer of global attention. Techniques such as knowledge distillation, model
merging, etc., were also used. Gemma2-9B model also performs well in its equiva-
lent volume category, outperforming Llama 3-8B and other similar open models
in several domains such as reasoning, math, and code. This model also has good
compatibility with major AI frameworks such as HuggingFace, as well as Keras 3.0,
vLLM, Gemma.cpp, and Llama.cpp (Google, 2024a).

18
Preprint, under review

7. Qwen2-0.5B: Qwen team, Alibaba Cloud has upgraded the Qwen model series to
Qwen2 and brought the series to five sizes. Among them, Qwen2-0.5B is the one
with the smallest number of parameters and a context length of 32K. In multiple
tests, Qwen2-0.5B performs similarly to Gemma-2B and Phi-2 (Qwen Team, 2024),
but has a smaller number of parameters, which makes it possible to play a big
role in the future of the smart home industry. In addition, for the problem of short
context length, the Qwen-Agent framework adopts the idea of Agentic RAG, which
can extend the processing context to 1M, thus realizing long text understanding
(Bai et al., 2023a).

6.2 Applications of On-Device LLMs

On-device language models are ushering in a new era of intelligent, responsive, and per-
sonalized applications. By bringing the power of advanced natural language processing
directly to end-user devices, these models are transforming how we interact with technology
in our daily lives and professional endeavors. From instantaneous message suggestions
to real-time language translation, from confidential medical consultations to cutting-edge
autonomous vehicles, on-device LLMs are proving to be versatile tools with far-reaching
implications. The following examples, as summarized in Figure 5, illustrate the breadth and
depth of on-device LLM applications, showcasing how this technology is not only enhanc-
ing existing services but also enabling entirely new categories of intelligent, responsive, and
secure applications across diverse domains.

Translation

Messaging Meeting

Applications of
Automobile On-Device LLMs Healthcare

Disability Research

Robot

Figure 5: Different application domains of on-device LLMs

1. Text Generating For Messaging: In the past, the quick reply function based on cloud
LLM was limited by the generation speed and network latency, so it would be slow
to generate reply for users. This is inefficient in fast-paced instant conversations.
Thanks to on-device LLMs, Gboard (Keyboard app by Google) can use the Gemini
Nano, an on-device LLM by Google (AI, 2024a). When it detects that the user
is chatting online, Gemini Nano can quickly generate conversation-aware quick
replies for the user to choose from based on the chat content. Because the language

19
Preprint, under review

models used does not need to be connected to the Internet to wait for the server to
respond, this function can reflect the true response speed.
2. Translation: LLMs have been widely used in the field of language translation. This
method can use terminology and style suitable for a specific field for translation,
which is not possible with traditional machine translation methods. However,
cloud-based LLMs still face problems such as slow response speed and the need
to upload information. On-device LLMs better solve these problems, with smaller
parameters, faster response speed, and can also run in offline environments. This
also provides data security for many scenarios. In terms of translation quality, using
small-size models does not significantly reduce the accuracy of translation. The
token generation accuracy using the T5-small model is only 4% lower than the
T5-language models (Xu et al., 2023). In addition, faster response speed means that
the on-device model will be more suitable for more immediate translation situations
such as simultaneous interpretation.
3. Meeting Summarizing: Distill-CLI, a cloud-based solution released by Amazon
CTO, uses Anthropic’s Claude 3 Sonnet model and Amazon Transcribe technology
to generate real-time meeting summaries (Vogels, 2024). Similar applications such as
Plaud Note with GPT-4o model (Plaud, 2024), Zoom-IQ (Zoom, 2024), etc. However,
the disadvantage of using cloud-based models is that subscription service fees
will be incurred, as well as network latency problems caused by networking. By
employing an on-device model, the data remains localized and does not require
uploading to a cloud-based server.
4. Healthcare application: Current medical models, like Med-Palm Multimodal (Tu
et al., 2024) can combine and analyze patient statements, electronic record informa-
tion, X-rays and other medical images to generate long-form responses with high
accuracy. Edge deployment can help patients answer questions offline, thereby en-
suring the emergency availability of the model and keeping the patient’s condition
localized. What is exciting is that models fine-tuned based on pre-trained models
in professional medical fields have emerged, such as BioMistral-7B (Labrak et al.,
2024), HuatuoGPT-7B-II (Chen et al., 2023), etc. These low-parameter models have
the potential to be deployed on terminal devices.
5. Scientific Research Support: Traditional research support LLMs like GatorTronGPT
(Peng et al., 2023) use large amount of certain professional data to train. This enables
them to generate high-quality professional text, thereby accelerating the progress of
science research, especially in research areas where data is scarce or sensitive.
After changing to on-device LLMs, it can reduce the hardware cost of using language
models to assist scientific research tasks, obtain faster responses, and protect the
confidentiality of scientific research information.
6. Companion Robot: There are already some research cases that use language models
to enhance the capabilities of robots or Internet of Thing (IoT) devices (Ahn et al.,
2022; Xu et al., 2024a). LLM’s powerful planning and reasoning capabilities can
decompose human instructions into a series of text subtasks, allowing robots to bet-
ter understand natural language instructions (Zeng et al., 2023b). For example, the
Figure 01 robot based on Open AI’s multimodal language models can communicate
deeply with people and make independent decisions and actions based on the con-
tent of the conversation (AI, 2024c). With the rise of small-size models, robots that
deploy on-device language models can outperform traditional cloud-based model
robots in terms of corresponding generation speed. At the same time, the client-side
model can ensure that the robot can still maintain its intelligent capabilities when
offline.
7. Disability Support: For visually impaired users, converting images into text is
a very basic and important function. Currently, there are many on-device large
multimodal models, like Octopus v3 (Chen & Li, 2024b), MiniCPM-Llama3-V 2.5
(Tsinghua University, 2024) that can achieve this function by multimodel ability.
With them, blind people can also easily know the picture and video information in
the conversation.

20
Preprint, under review

Google is about to launch Talkback feature based on Gemini Nano, helping people
who are blind or have low vision to describe what is happening in the image more
richly and clearly (Google, 2024b). Because Gemini Nano is a model deployed on
the edge, these descriptions will appear quickly and work even without a network
connection.
Similar capabilities can also be used for sign language recognition, and there are
projects that use the ChatGPT model for sign language translation (Sincan et al.,
2024). In comparison, the on-device model can generate text translations corre-
sponding to sign language with lower latency and ensure its offline availability.
8. Autonomous Vehicles: Using language models to drive autonomous cars may be
an ideal future, but we already have examples of this today. DriveVLM Dual is
a system that combines autonomous driving technology with a large-scale visual
language model (VLM) to improve the understanding of complex and long-tail
scenes in urban environments. The system uses language to describe the driving
environment and identify key objects in the scene. It gradually develops a plan
from meta-action and decision descriptions to waypoints. DriveVLM surpasses
existing state-of-the-art methods on both public benchmarks and the researchers’
own benchmarks, especially in handling complex and dynamic scenes. Excitingly,
DriveVLM can be deployed locally on the car, which also provides convenience for
its immediate response (Tian et al., 2024).

7 Future Directions and Open Challenges

Data Adaptive
Security Edge-Cloud
Techniques Collaboration

Continual Multi-Modal
Learning & Cross-Modal
& Personalization Future Directions Learning
& Open
Challenges

Scalability
Resource-Efficient
& Deployment
Solutions
Optimization

Hardware-
Robustness &
Software
Reliability
Co-Design

Figure 6: Future Directions and Open Challenges for on-device LLMs

As on-device LLMs continue to evolve, several vital areas emerge as promising future
research and development directions. The field of on-device LLMs is rapidly advancing,
driven by the increasing demand for 1) data security, 2) low-latency, and 3) personalized AI
experiences on edge devices. This progress is exemplified by recent developments such as

21
Preprint, under review

TinyLlama (Zhang et al., 2024c), MobileVLM (Murthy et al., 2024; Chu et al., 2024), and novel
approaches like the OpenELM (Mehta et al., 2024). However, deploying LLMs on resource-
constrained devices presents unique challenges that differ significantly from traditional
cloud-based implementations. These challenges span multiple areas, including model
compression, efficient inference, security, energy efficiency, and seamless integration with
diverse hardware platforms. Moreover, the dynamic nature of edge environments and the
need for continuous adaptation introduce additional complexities that must be considered.
We outline the most pressing challenges and opportunities in advancing the field of LLMs
on-device. By identifying these key areas and stimulating innovation in developing more
capable, efficient, and reliable on-device language models, we aim to provide insights for
future research efforts. We should notice that the challenges and opportunities discussed
here are interconnected: the progress in one area often has implications for others. Therefore,
a holistic approach that considers the interplay between different aspects of on-device LLM
deployment is essential for achieving significant advancements in the field. We delve into
the current state of research, identifying key challenges and proposing potential directions
for future work, summarized in Fig. 6. By addressing these challenges, researchers and
practitioners can push the boundaries of what is possible with on-device LLMs, ultimately
leading to more intelligent, efficient, and user-centric computing experiences across various
applications and domains.

7.1 Data Security Techniques

On-device language models may offer inherent data security advantages, since all the data
can remain localized. Future work should focus on:

• Developing efficient privacy techniques techniques, including query obfuscation


(Yuan et al., 2024), prompt tuning (Li et al., 2023b), and advanced randomization
techniques (Zhang et al., 2024e) that balance data security guarantees with model
utility and computational constraints.
• Enhancing risk assessment and monitoring, by creating sophisticated benchmarking
systems (Yuan et al., 2024), implementing real-time monitoring (Das et al., 2024),
and designing systems to detect and mitigate potential PII leakage during inference
(Kim et al., 2024d).
• Optimizing model architectures and communication strategies, focusing on efficient
model sharding (Yang et al., 2024a), security-enhancing architectures (Yao et al.,
2024a), and minimizing data transmission (Wang et al., 2023).
• Addressing security challenges in collaborative and distributed learning scenarios,
through secure multi-party computation (Das et al., 2024), data protection for long
conversations (Yuan et al., 2024), and extending frameworks like PFID to support a
wider range of LLM architectures and tasks (Yang et al., 2024a).

7.2 Adaptive Edge-Cloud Collaboration

As on-device language models continue to evolve, the synergy between edge computing
and cloud infrastructure presents both opportunities and challenges. Future research in
adaptive edge-cloud collaboration for on-device LLMs should explore:

• Inventing advanced caching and request analysis techniques, including sophisti-


cated vector database caching strategies, feature extraction models for diverse LLM
requests (Yao et al., 2024c), and uncertainty-guided token sampling methods to
optimize data transmission between edge devices and cloud servers (Wang et al.,
2024a).
• Designing intelligent scheduling and resource allocation algorithms, incorporating
personalized inference scheduling (Yao et al., 2024c), adaptive resource allocation for
heterogeneous infrastructures (Yang et al., 2024c), and batch size-aware optimization
techniques to efficiently distribute LLM components and workloads across edge-
cloud environments (Zhang et al., 2024b).

22
Preprint, under review

• Creating efficient knowledge transfer and model compression methods, such as


adapter-based knowledge distillation for multimodal LLMs (Zhang et al., 2024f),
dynamic quantization techniques for various LLM architectures, and adaptive
weight update compression strategies to enable effective deployment of language
models on resource-constrained devices (Wang et al., 2024a).
• Improving performance optimization in collaborative systems by developing adap-
tive control mechanisms for token-level collaboration (Yang et al., 2024c), efficient
constraint satisfaction algorithms for real-time decision-making, and innovative
techniques to reduce latency and improve pipeline execution in hybrid edge-cloud
systems (Hao et al., 2024; Zhang et al., 2024b).

7.3 Multi-Modal and Cross-Modal Learning

As LLMs expand to incorporate multiple modalities, there is a growing need for efficient
multi-modal architectures suitable for on-device deployment (Carreira et al., 2023; Liu et al.,
2024c). Key research directions include:

• Developing efficient multi-modal processing and compression techniques, includ-


ing advanced uncertainty-guided token sampling methods, dynamic weight update
compression strategies for cloud-to-device model updates (Wang et al., 2024a;
McKinzie et al., 2024), and innovative approaches to efficiently combine multiple
modalities like audio, text, and video for on-device models (Wagner et al., 2024).
• Enhancing knowledge transfer and adaptation capabilities, such as exploring ad-
vanced adapter-based knowledge distillation methods for transferring knowledge
from larger cloud models to smaller on-device models, improving few-shot and
zero-shot capabilities across modalities (Chen et al., 2024a; Han et al., 2024; McK-
inzie et al., 2024), and investigating hybrid approaches that combine generative and
retrieval-based methods for multimodal content generation (Wu et al., 2023c).
• Expanding modality support and improving multi-modal understanding, through
the development of large-scale datasets for non-image modalities, design of new
encoders for fine-grained multi-modal understanding of high-resolution images,
long video sequences, and complex audio inputs (Han et al., 2024), and incorpora-
tion of support for additional modalities and tasks like web pages, 3D vision, heat
maps, and tables/figures (Wu et al., 2023c).
• Advancing temporal and contextual processing abilities, by investigating longer
context windows that incorporate features from previous interactions, developing
sophisticated techniques for processing and understanding temporal and sequential
information across modalities, and exploring tasks useful during interactions with
virtual assistants, such as audio captioning and acoustic scene classification (Wagner
et al., 2024).

7.4 Resource-Efficient Solutions

The deployment of LLMs on edge devices raises concerns about energy consumption and
environmental impact. Future research should prioritize:

• Creating efficient model compression and execution algorithm: Develop advanced


pruning, quantization, and knowledge distillation techniques for LLMs. Explore
methods to optimize execution for larger-than-memory models. Investigate dy-
namic and adaptive inference techniques to adjust model complexity based on input
and available resources (Bai et al., 2024).
• Exploiting model sparsity: Investigating techniques to take advantage of the run-
time activation sparsity of language models, where only a small portion of the
model is activated for a given task. This could lead to significant reductions in
inference time and memory footprint, enabling more efficient scaling of model sizes
(Xu et al., 2024b).

23
Preprint, under review

• Developing energy-aware training and deployment strategies, including energy-


efficient algorithms and runtime optimizations (Bai et al., 2024). Explore adaptive
parameter-efficient fine-tuning methods that balance security, energy efficiency, and
performance on edge devices (He et al., 2024).

7.5 Hardware-Software Co-Design

Closer integration between hardware and software development is crucial for optimizing
on-device LLM performance. Future research directions include:

• Advancing PIM/PNM architectures for various memory types, including optimiza-


tions for CXL-based systems and low-power solutions for edge devices (Kim et al.,
2024b).
• Developing hardware-aware optimization techniques, such as pruning-aware quan-
tization, contextual sparsity exploitation (Wan et al., 2024), and dynamic sparse
attention optimization (Kachris, 2024).
• Enhancing AI-specific compilers and runtime systems to automatically identify and
optimize operations for PIM/PNM hardware (Huang et al., 2024b), considering
both graph-level and hardware-specific optimizations (Kim et al., 2024b; Wan et al.,
2024).
• Designing efficient strategies for edge computing and multi-device systems, in-
cluding dynamic sparse tree optimization (Luk et al., 2024), adaptive bit-width
techniques, and energy-aware co-design approaches.

7.6 Robustness and Reliability

Ensuring the robustness and reliability of on-device language models under various operat-
ing conditions is paramount for their widespread adoption. Future work should address:

• Investigating methods for detecting and mitigating potential biases and hallucina-
tions in on-device LLM outputs, particularly in safety-critical applications (Ailem
et al., 2024).
• Exploring formal verification and validation frameworks for assessing the reliability
of on-device language models in real-world scenarios (Zhang et al., 2023b).
• Leveraging ensemble methods for variance and bias reduction (Xu & Sen, 2023;
2024). Exploring probabilistic inference methods to quantify and propagate uncer-
tainty through the LLM pipeline.

7.7 Scalability and Deployment Optimization

Efficiently scaling on-device LLMs to support a growing number of users and applications
presents significant challenges. Future research should explore:

• Developing dynamic resource allocation and load balancing techniques for dis-
tributed LLM inference across heterogeneous edge devices (Yang et al., 2024c;
Wilkins et al., 2024).
• Investigating optimization strategies for reducing latency and improving through-
put in collaborative edge computing scenarios, potentially leveraging techniques
such as model sharding and pipelined inference (Zhang et al., 2024b; Dhar et al.,
2024).
• Exploring efficient methods for managing and updating multiple LLM versions
across diverse edge devices, considering factors such as network constraints and
device capabilities. Building cyber-infrastructure to enhance the reusibility and
reproducibility of models and datasets (Wolf et al., 2019; Lhoest et al., 2021; Deng
et al., 2019).

24
Preprint, under review

7.8 Continual Learning and Personalization

The deployment of on-device LLMs offers unprecedented opportunities for personalized AI


experiences. However, it also presents unique challenges in maintaining model relevance
and adapting to new information and user preferences over time. Future research should
focus on:
• Implementing controllable knowledge retention and forgetting, such as selectively
retaining or forgetting information as the model encounters new data streams. This
is crucial for managing misinformation and ensuring ongoing accuracy. Enhance the
model’s ability to autonomously learn new skills and improve existing capabilities
based on user interactions and local data (Li et al., 2024d). Develop effective history-
tracking mechanisms to understand the evolution of the LLM through various
learning phases (Qi et al., 2024).
• Advancing theoretical foundations and practical optimizations by developing ro-
bust theoretical foundations for understanding and predicting the behavior of
continually learning LLMs in on-device settings. This also includes conducting
large-scale user studies to refine personalization frameworks and determine effec-
tive service delivery across diverse user groups and scenarios (Zhang et al., 2024d),
as well as improving key generation and retrieval processes for better representation
of task distributions in the vector space (Peng et al., 2024).
• Developing efficient continual learning mechanisms, including sophisticated data
mixing strategies and efficient replay sample selection (Shi et al., 2024). This in-
cludes exploring controllable memory systems and designing adaptive fine-tuning
mechanisms for continuous model adaptation (Wu et al., 2024; Li et al., 2024d).
Looking ahead at these future pathways and unresolved issues (Gao et al., 2024; Su et al.,
2024; Schwartz et al., 2023; Mahmood et al., 2023; Zhao et al., 2024a), researchers and
practitioners have the opportunity to propel the on-device LLMs to new heights and trans-
form the landscape of edge computing. The effective progression and integration of these
technologies hold the potential to unlock innovative frameworks for intelligent and tai-
lored applications, all the while tackling crucial issues surrounding security, efficiency,
and dependability. The impact of these advancements reaches well beyond theoretical
enhancements, offering the potential for substantial transformation across a broad spectrum
of fields. In the realm of mobile computing, enhanced on-device LLM-based AI agents
(Chen & Li, 2024c) have the potential to facilitate advanced natural language interfaces
and context-aware services, thereby significantly enhancing user experiences. In the con-
text of IoT applications, these advancements empower more autonomous and adaptable
systems capable of processing complex linguistic inputs in real time, even within resource-
constrained environments. Within the automotive sector, improved on-device LLMs could
elevate human-machine interactions in autonomous vehicles. Moreover, these technologies
could enable more personalized and responsive AI-assisted patient care in healthcare.
These advancements are realized to democratize access to sophisticated AI capabilities,
making them more accessible and efficient across a wide range of devices and use cases.
Therefore, continued research and development in this field is both technologically impera-
tive and socially significant, promising to herald a new era of more accessible, efficient, and
reliable AI-powered applications poised to impact various facets of society and industry
positively.

8 Conclusion
This comprehensive review has illuminated the state-of-the-art in on-device language
models. The extensive analysis presented herein has highlighted significant advancements
in model compression techniques, efficient architectural designs, and hardware-software co-
optimization strategies, all of which collectively facilitate the deployment of sophisticated
language models on resource-constrained edge devices. The potential impact of these
improvements is extensive, equipping improved data protection, decreased delay, and equal
access to advanced AI capabilities across different industries and applications.

25
Preprint, under review

The transition from cloud-centric to edge-based LLM deployment signifies more than a
mere technological progression; it represents a shift of human-AI interaction paradigms. By
bringing advanced natural language processing capabilities directly to end-user devices, this
transformation opens new avenues for personalized, context-aware, and instant AI experi-
ences. On-device LLMs will revolutionize user interactions and facilitate more intelligent,
responsive technologies, from mobile phones and the IoT to healthcare and autonomous
systems.
However, the trajectory towards ubiquitous on-device LLMs has significant challenges.
Striking an optimal balance between model performance and the inherent resource limi-
tations of edge devices remains a critical research problem. Ensuring model robustness
across heterogeneous operating conditions and developing effective continual learning
mechanisms present additional hurdles. Furthermore, as the boundaries of on-device AI
are pushed, questions about energy efficiency, sustainability, and responsible deployment
become increasingly salient, necessitating innovative solutions and careful ethical consider-
ations.
Realizing the full potential of on-device language models requires a concerted, multidis-
ciplinary effort. The research community must continue advancing the frontiers of model
compression techniques and efficient architecture design while concurrently addressing po-
tential issues of data security and system reliability. Practitioners in the field should explore
novel hardware-software co-design methodologies and adaptive edge-cloud collaboration
strategies to optimize real-world deployments. Industry stakeholders play a pivotal role in
developing specialized hardware accelerators and promoting open standards for on-device
AI deployment.
As research in this area evolves, on-device language models are positioned at the forefront
of imminent technological breakthroughs. The convergence of increasingly efficient models,
more powerful edge hardware, and innovative deployment strategies promises to unlock
unprecedented possibilities in human-AI interaction. By addressing the challenges and
capitalizing on the opportunities in this survey, the research community can work towards a
future where sophisticated AI capabilities are seamlessly integrated into daily life, augment-
ing human abilities while respecting personalization and individuality. The journey towards
ubiquitous, intelligent computing is well underway, and on-device LLMs are poised to play
a pivotal role in shaping this exciting future.
In conclusion, this review serves as a comprehensive resource for researchers and practition-
ers, thoroughly analyzing the current state of on-device LLMs and illuminating critical areas
for future research and development. As the field of on-device LLMs continues to evolve
rapidly, it is imperative that the research community remains committed to addressing the
challenges and embracing the opportunities presented by this transformative technology.

References
01.AI. Yi 1.5. https://github.com/01-ai/Yi-1.5, 2024.
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah,
Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl,
Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio
César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen,
Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo
de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao,
Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng
Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos
Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat
Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung
Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik
Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet,
Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji
Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning
Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini,

26
Preprint, under review

Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp
Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav,
Fan Yang, Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong
Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang,
and Xiren Zhou. Phi-3 technical report: A highly capable language model locally on your
phone, 2024.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Amey Agrawal, Anmol Agarwal, Nitin Kedia, Jayashree Mohan, Souvik Kundu, Nipun
Kwatra, Ramachandran Ramjee, and Alexey Tumanov. Metron: Holistic performance
evaluation framework for llm inference systems. arXiv preprint arXiv:2407.07000, 2024a.
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S
Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency
tradeoff in llm inference with sarathi-serve. arXiv preprint arXiv:2403.02310, 2024b.
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,
Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can,
not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691,
2022.
Google AI. Gboard smart reply. Google AI Developer Website, 2024a. URL https://
developer.android.com/ai/aicore#gboard-smart.
Google AI. Mediapipe solutions guide. Google AI Developer Website, 2024b. URL https:
//ai.google.dev/edge/mediapipe/solutions/guide.
Open AI. Figure 01 robot. Figure website, 2024c. URL https://www.figure.ai/.
Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, and James Bono. Examining the
robustness of llm evaluation to the distributional assumptions of benchmarks. arXiv
preprint arXiv:2404.16966, 2024.
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón,
and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from
multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
Alibaba. Mnn: A lightweight deep neural network inference engine. https://github.com/
alibaba/MNN, 2024.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan-
dra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay,
Quentin Malartic, et al. The falcon series of open language models. arXiv preprint
arXiv:2311.16867, 2023.
Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, and Edoardo M Ponti. Scaling
sparse fine-tuning to large language models. arXiv preprint arXiv:2401.16405, 2024.
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi,
Ziyang Yu, Mengdan Zhu, Yifei Zhang, et al. Beyond efficiency: A systematic survey of
resource-efficient large language models. arXiv preprint arXiv:2401.00625, 2024.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
2023a.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang
Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding,
localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023b.

27
Preprint, under review

BentoML. Openllm: Open-source library for language model lifecycle management. https:
//github.com/bentoml/OpenLLM, 2024.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020.
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and
Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding
heads. arXiv preprint arXiv:2401.10774, 2024a.
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui
Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
2024b.
Zouying Cao, Yifei Yang, and Hai Zhao. Head-wise shareable attention for large language
models. arXiv preprint arXiv:2402.11819, 2024.
Samuel Carreira, Tomás Marques, José Ribeiro, and Carlos Grilo. Revolutionizing mobile in-
teraction: Enabling a 3 billion parameter gpt llm on mobile. arXiv preprint arXiv:2310.01434,
2023.
Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He,
Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A dataset for gui-oriented
multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang,
Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, et al. Huatuogpt-ii, one-stage training
for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023.
Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent. arXiv
preprint arXiv:2404.01744, 2024a.
Wei Chen and Zhiyuan Li. Octopus v3: Technical report for on-device sub-billion multi-
modal ai agent. arXiv preprint arXiv:2404.11459, 2024b.
Wei Chen and Zhiyuan Li. Octopus v4: Graph of language models. arXiv preprint
arXiv:2404.19296, 2024c.
Wei Chen, Zhiyuan Li, and Mingyuan Ma. Octopus: On-device language model for function
calling of software apis. arXiv preprint arXiv:2404.01549, 2024b.
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding
the mixture-of-experts layer in deep learning. Advances in neural information processing
systems, 35:23049–23062, 2022.
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun,
Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for
vision language model. arXiv preprint arXiv:2402.03766, 2024.
Tolga Çöplü, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J Bouw, and Stephen
Cobb. A performance evaluation of a quantized large language model on various smart-
phones. arXiv preprint arXiv:2312.12472, 2023.
Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of
large language models: A survey. arXiv preprint arXiv:2402.00888, 2024.
Yunxiao Deng, Carl Kesselman, Suvrajeet Sen, and Jiajun Xu. Computational operations
research exchange (core): A cyber-infrastructure for analytics. In 2019 Winter Simulation
Conference (WSC), pp. 3447–3456. IEEE, 2019.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit
matrix multiplication for transformers at scale. Advances in Neural Information Processing
Systems, 35:30318–30332, 2022.

28
Preprint, under review

Nobel Dhar, Bobin Deng, Dan Lo, Xiaofeng Wu, Liang Zhao, and Kun Suo. An empirical
analysis and resource footprint study of deploying large language models on edge devices.
In Proceedings of the 2024 ACM Southeast Conference, pp. 69–76, 2024.
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu,
Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of
language models with mixture-of-experts. In International Conference on Machine Learning,
pp. 5547–5569. PMLR, 2022.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.
Minds and Machines, 30:681–694, 2020.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate
post-training quantization for generative pre-trained transformers. arXiv preprint
arXiv:2210.17323, 2022.
Zihao Fu, Wai Lam, Qian Yu, Anthony Man-Cho So, Shengding Hu, Zhiyuan Liu, and Nigel
Collier. Decoder-only or encoder-decoder? interpreting language model as a regularized
encoder-decoder. arXiv preprint arXiv:2304.04052, 2023.
Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. Llm-based nlg evaluation:
Current status and challenges. arXiv preprint arXiv:2402.01383, 2024.
Yingqiang Ge, Wenyue Hua, Kai Mei, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang,
et al. Openagi: When llm meets domain experts. Advances in Neural Information Processing
Systems, 36, 2024.
Georgi Gerganov. llama.cpp: Lightweight library for approximate nearest neighbors and
maximum inner product search. https://github.com/ggerganov/llama.cpp, 2023.
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas,
Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language
models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun
Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language
model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
Google. Gemma 2-9b. Google website, 2024a. URL https://storage.googleapis.com/
deepmind-media/gemma/gemma-2-report.pdf.
Google. Google talkback. Google website, 2024b. URL https://store.google.com/intl/
en/ideas/articles/gemini-nano-google-pixel/.
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Acceler-
ating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large
language models. In The Twelfth International Conference on Learning Representations, 2023.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno,
Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi,
et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin,
Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities
with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 26584–26595, 2024.

29
Preprint, under review

Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, and Ting Cao. Hybrid slm and llm for
edge-cloud collaborative inference. In Proceedings of the Workshop on Edge and Mobile
Foundation Models, pp. 36–41, 2024.
Yongjun He, Yao Lu, and Gustavo Alonso. Deferred continuous batching in resource-
efficient large language model serving. In Proceedings of the 4th Workshop on Machine
Learning and Systems, pp. 98–106, 2024.
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao
Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. Inference without interference: Disag-
gregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181,
2024a.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei
Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small
language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024b.
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele
Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.
arXiv preprint arXiv:2402.04291, 2024a.
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li,
Xiaofan Zhang, and Deming Chen. New solutions on llm acceleration, optimization, and
application. arXiv preprint arXiv:2406.10903, 2024b.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International conference on machine learning,
pp. 448–456. pmlr, 2015.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive
mixtures of local experts. Neural computation, 3(1):79–87, 1991.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian
Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024a.
Peng Jiang, Christian Sonne, Wangliang Li, Fengqi You, and Siming You. Preventing the
immense increase in the life-cycle energy and carbon footprints of llm-powered intelligent
chatbots. Engineering, 2024b.
Christoforos Kachris. A survey on hardware accelerators for large language models. arXiv
preprint arXiv:2401.09890, 2024.
Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and
Robert McHardy. Challenges and applications of large language models. arXiv preprint
arXiv:2307.10169, 2023.
Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin-Haeng Kang, Sukhan Lee, Songyi Han,
YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, et al. Near-memory processing in action:
Accelerating personalized recommendation with axdimm. IEEE Micro, 42(1):116–127,
2021.
Aymen Rayane Khouas, Mohamed Reda Bouadjenek, Hakim Hacid, and Sunil Aryal.
Training machine learning models at the edge: A survey. arXiv preprint arXiv:2403.02619,
2024.
Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang,
Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory
solutions for improved performance on llm inference. IEEE Micro, 2024a.

30
Preprint, under review

Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang,
Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory
solutions for improved performance on llm inference. IEEE Micro, 2024b.
Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang,
Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory
solutions for improved performance on llm inference. IEEE Micro, 2024c.
Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan
Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, et al. Aquabolt-xl:
Samsung hbm2-pim with in-memory processing for ml accelerators and beyond. In 2021
IEEE Hot Chips 33 Symposium (HCS), pp. 1–26. IEEE, 2021.
Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh.
Propile: Probing privacy leakage in large language models. Advances in Neural Information
Processing Systems, 36, 2024d.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu,
Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large
language model serving with pagedattention. In Proceedings of the 29th Symposium on
Operating Systems Principles, pp. 611–626, 2023.
Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier,
and Richard Dufour. Biomistral: A collection of open-source pretrained large language
models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
Stefanos Laskaridis, Kleomenis Kateveas, Lorenzo Minto, and Hamed Haddadi. Melting
point: Mobile evaluation of language transformers. arXiv preprint arXiv:2403.12844, 2024.
Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick
Von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall,
et al. Datasets: A community library for natural language processing. arXiv preprint
arXiv:2109.02846, 2021.
Chenyang Li, Jihoon Chung, Biao Cai, Haimin Wang, Xianlian Zhou, and Bo Shen. On model
compression for neural networks: Framework, algorithm, and convergence guarantee.
arXiv preprint arXiv:2303.06815, 2023a.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal,
Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next
generation of training sets for language models. arXiv preprint arXiv:2406.11794, 2024a.
Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao,
and Xin Chen. Locmoe: A low-overhead moe for large language model training. arXiv
preprint arXiv:2401.13920, 2024b.
Yansong Li, Zhixing Tan, and Yang Liu. Privacy-preserving prompt tuning for large lan-
guage model services. arXiv preprint arXiv:2305.06212, 2023b.
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu,
Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about
the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024c.
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu,
Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about
the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024d.
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat
Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463,
2023c.
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning,
and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv
preprint arXiv:2401.15947, 2024a.

31
Preprint, under review

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-
device training under 256kb memory. Advances in Neural Information Processing Systems,
35:22941–22954, 2022.
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, and Song Han. Tiny machine learning:
progress and futures [feature]. IEEE Circuits and Systems Magazine, 23(3):8–34, 2023a.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangx-
uan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight
quantization for on-device llm compression and acceleration. Proceedings of Machine
Learning and Systems, 6:87–100, 2024b.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangx-
uan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight
quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2024c.
Ye Lin, Mingxuan Wang, Zhexi Zhang, Xiaohui Wang, Tong Xiao, and Jingbo Zhu. Under-
standing parameter sharing in transformers. arXiv preprint arXiv:2306.09380, 2023b.
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual
instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 26296–26306, 2024a.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.
Advances in neural information processing systems, 36, 2024b.
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov,
Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. Mo-
bilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv
preprint arXiv:2402.14905, 2024c.
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali
Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual
sparsity for efficient llms at inference time. In International Conference on Machine Learning,
pp. 22137–22176. PMLR, 2023.
Lefteris Loukas, Ilias Stogiannidis, Odysseas Diamantopoulos, Prodromos Malakasiotis,
and Stavros Vassos. Making llms worth every penny: Resource-limited text classification
in banking. In Proceedings of the Fourth ACM International Conference on AI in Finance, pp.
392–400, 2023.
Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I Venieris, Hongxi-
ang Fan, et al. Hardware-aware parallel prompt decoding for memory-efficient accelera-
tion of llm inference. arXiv preprint arXiv:2405.18628, 2024.
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang,
Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large
language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang.
Llm-powered conversational voice assistants: Interaction patterns, opportunities, chal-
lenges, and design guidelines. arXiv preprint arXiv:2309.13879, 2023.
Market.us. Edge ai market. Market.us Online Report, July 2024. Accessed on 2024-07-28.
Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey. Artificial
Intelligence Review, 42:275–293, 2014.
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp
Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods,
analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611,
2024.

32
Preprint, under review

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin,
Chenfan Sun, Seyed Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal,
et al. Openelm: An efficient language model family with open training and inference
framework. In Workshop on Efficient Systems for Foundation Models II, 2024.
Meta. Meta llama 3. https://ai.meta.com/blog/meta-llama-3/, 2024.
MosaicML. Mpt-7b. https://www.databricks.com/blog/mpt-7b, 2023.
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby
Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, et al. Mobileaibench: Benchmarking
llms and lmms for on-device use cases. arXiv preprint arXiv:2406.10290, 2024.
Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Over-
coming oscillations in quantization-aware training. In International Conference on Machine
Learning, pp. 16318–16330. PMLR, 2022.
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers.
Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th
International Conference on Software Engineering, pp. 1–13, 2024.
Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, and Yu Wang. Skeleton-of-thought:
Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux,
Matheus Pereira, Lucas Caccia, and Alessandro Sordoni. Towards modular llms by
building and reusing a library of loras. arXiv preprint arXiv:2405.11157, 2024.
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, and Jae W Lee. Any-precision llm:
Low-cost deployment of multiple, different-sized llms. arXiv preprint arXiv:2402.10517,
2024.
Bohao Peng, Zhuotao Tian, Shu Liu, Mingchang Yang, and Jiaya Jia. Scalable language
model with generalized continual learning. arXiv preprint arXiv:2404.07470, 2024.
Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa,
Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative
large language model for medical research and healthcare. NPJ digital medicine, 6(1):210,
2023.
Plaud. Plaud note summarizer. Plaud website, 2024. URL https://www.plaud.ai/.
PyTorch. executorch: Overview. PyTorch Official Website, 2024. URL https://pytorch.
org/executorch-overview.
Biqing Qi, Xinquan Chen, Junqi Gao, Dong Li, Jianxing Liu, Ligang Wu, and Bowen Zhou.
Interactive continual learning: Fast and slow thinking. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 12882–12892, 2024.
Ali Cloud Qwen Team. Qwen 2-0.5b. Github, 2024. URL https://github.com/QwenLM/
Qwen2.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive
multimodal large language model for long video understanding. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323, 2024.
Rajarshi Saha, Varun Srivastava, and Mert Pilanci. Matrix compression via randomized low
rank and low precision factorization. Advances in Neural Information Processing Systems, 36,
2023.

33
Preprint, under review

Sivan Schwartz, Avi Yaeli, and Segev Shlomov. Enhancing trust in llm-based ai automation
agents: New considerations and future challenges. arXiv preprint arXiv:2308.05391, 2023.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-
of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance
with 0.1 m dollars. arXiv preprint arXiv:2404.07413, 2024.
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, and Hao
Wang. Continual learning of large language models: A comprehensive survey. arXiv
preprint arXiv:2404.16789, 2024.
Ozge Mercanoglu Sincan, Necati Cihan Camgoz, and Richard Bowden. Using an llm to turn
sign spottings into spoken language sentences. arXiv preprint arXiv:2403.10434, 2024.
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell
Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open
corpus of three trillion tokens for language model pretraining research. arXiv preprint
arXiv:2402.00159, 2024.
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model
serving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456, 2023.
Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, and Josep Torrellas. Towards
greener llms: Bringing energy-efficiency to the forefront of llm inference. arXiv preprint
arXiv:2403.20306, 2024.
Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi
Jing, Jiajun Xu, and Junhong Lin. Large language models for forecasting and anomaly
detection: A systematic literature review. arXiv preprint arXiv:2402.10350, 2024.
taivo. Gpt4 response time. Open AI community, 2023. URL https://community.openai.
com/t/gpt-3-5-and-gpt-4-api-response-time-measurements-fyi/237394/.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui
Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family
of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju,
Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al.
Gemma: Open models based on gemini research and technology. arXiv preprint
arXiv:2403.08295, 2024.
InternLM Team. Internlm: A multilingual language model with progressively enhanced
capabilities, 2023.
MLC team. MLC-LLM, 2023. URL https://github.com/mlc-ai/mlc-llm.
VLLM Project Team. Vllm documentation. VLLM Documentation Website, 2024. URL
https://docs.vllm.ai/en/stable/.
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia,
Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and
large vision-language models. arXiv preprint arXiv:2402.12289, 2024.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2:
Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

34
Preprint, under review

Modelbest Inc. Tsinghua University. Minicpm-llama3-v 2.5. huggingface, 2024. URL


https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5.
Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan
Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist
biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-
tion processing systems, 30, 2017.
Werner Vogels. Distill-cli meeting summarizer. Github, 2024. URL https://github.com/
awslabs/distill-cli.
Shakti N Wadekar, Abhishek Chaurasia, Aman Chadha, and Eugenio Culurciello. The
evolution of multimodal model architectures. arXiv preprint arXiv:2405.17927, 2024.
Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mir-
samadi, Aarshee Mishra, and Erik Marchi. A multimodal approach to device-directed
speech detection with large language models. In ICASSP 2024-2024 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10451–10455. IEEE, 2024.
Lily Jiaxin Wan, Yingbing Huang, Yuhong Li, Hanchen Ye, Jinghua Wang, Xiaofan Zhang,
and Deming Chen. Software/hardware co-design for llm and its application for design
verification. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC),
pp. 435–441. IEEE, 2024.
Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin
Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, et al. Cloud-device collaborative
learning for multimodal large language models. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 12646–12655, 2024a.
Yiming Wang, Yu Lin, Xiaodong Zeng, and Guannan Zhang. Privatelora for efficient privacy
preserving llm. arXiv preprint arXiv:2311.14030, 2023.
Yuxin Wang, Yuhan Chen, Zeyu Li, Zhenheng Tang, Rui Guo, Xin Wang, Qiang Wang,
Amelie Chi Zhou, and Xiaowen Chu. Towards efficient and reliable llm serving: A
real-world workload study. arXiv preprint arXiv:2401.17644, 2024b.
Grant Wilkins, Srinivasan Keshav, and Richard Mortier. Offline energy-optimal llm serving:
Workload-based energy models for llm inference on heterogeneous systems. arXiv preprint
arXiv:2407.04014, 2024.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s
transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771,
2019.
Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip. Multimodal large
language models: A survey. In 2023 IEEE International Conference on Big Data (BigData),
pp. 2247–2256. IEEE, 2023a.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin
Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications
via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023b.
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any
multimodal llm. arXiv preprint arXiv:2309.05519, 2023c.
Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholam-
reza Haffari. Continual learning for large language models: A survey. arXiv preprint
arXiv:2402.01364, 2024.

35
Preprint, under review

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal
agents: A survey. arXiv preprint arXiv:2402.15116, 2024.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai
Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the trans-
former architecture. In International Conference on Machine Learning, pp. 10524–10533.
PMLR, 2020.
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe
Liu. Llmcad: Fast and scalable on-device large language model inference. arXiv preprint
arXiv:2309.04255, 2023.
Jiajun Xu and Suvrajeet Sen. Compromise policy for multi-stage stochastic linear pro-
gramming: Variance and bias reduction. Computers & Operations Research, 153:106132,
2023.
Jiajun Xu and Suvrajeet Sen. Ensemble variance reduction methods for stochastic mixed-
integer programming and their application to the stochastic facility location problem.
INFORMS Journal on Computing, 36(2):587–599, 2024.
Jiajun Xu, Qun Wang, Yuhang Cao, Baitao Zeng, and Sicheng Liu. A general-purpose device
for interaction with llms. arXiv preprint arXiv:2408.10230, 2024a.
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang
Wu, Yihao Zhao, Chen Yang, Shihe Wang, et al. A survey of resource-efficient llm and
multimodal foundation models. arXiv preprint arXiv:2401.08092, 2024b.
Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui,
and Ping Zhang. Wdmoe: Wireless distributed large language models with mixture of
experts. arXiv preprint arXiv:2405.03131, 2024a.
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, and Haibo Chen. Powerinfer-2:
Fast large language model inference on a smartphone. arXiv preprint arXiv:2406.06282,
2024b.
Zheyu Yan, Yifan Qin, Xiaobo Sharon Hu, and Yiyu Shi. On the viability of using llms for
sw/hw co-design: An example in designing cim dnn accelerators. In 2023 IEEE 36th
International System-on-Chip Conference (SOCC), pp. 1–6. IEEE, 2023.
Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, and Jing Xiao.
Pfid: Privacy first inference delegation framework for llms. arXiv preprint arXiv:2406.12238,
2024a.
Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang,
Shaochen Zhong, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A
survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6):
1–32, 2024b.
Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, and Wen Ji. Perllm:
Personalized inference scheduling with edge-cloud collaboration for diverse llm services.
arXiv preprint arXiv:2405.14636, 2024c.
Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey
on large language model (llm) security and privacy: The good, the bad, and the ugly.
High-Confidence Computing, pp. 100211, 2024a.
Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. Exploring post-training
quantization in llms from comprehensive study to low rank compensation. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 38, pp. 19377–19385, 2024b.
Zhi Yao, Zhiqing Tang, Jiong Lou, Ping Shen, and Weijia Jia. Velo: A vector database-assisted
cloud-edge collaborative llm qos optimization framework. arXiv preprint arXiv:2406.13399,
2024c.

36
Preprint, under review

Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. Edge-
moe: Fast on-device inference of moe-based large language models. arXiv preprint
arXiv:2308.14352, 2023.
Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. Llm as a system service on
mobile devices. arXiv preprint arXiv:2403.11805, 2024.
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li,
Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai.
arXiv preprint arXiv:2403.04652, 2024.
Yizhen Yuan, Rui Kong, Yuanchun Li, and Yunxin Liu. Wip: An on-device llm-based
approach to query privacy protection. In Proceedings of the Workshop on Edge and Mobile
Foundation Models, pp. 7–9, 2024.
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang.
Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823,
2023a.
Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, and Philip S Yu. Large language
models for robotics: A survey. arXiv preprint arXiv:2311.07226, 2023b.
Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural
Information Processing Systems, 32, 2019.
Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen,
Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An im-
proved baseline for referring and grounding with large language models. arXiv preprint
arXiv:2404.07973, 2024a.
Mingjin Zhang, Jiannong Cao, Xiaoming Shen, and Zeyang Cui. Edgeshard: Efficient llm
inference via collaborative edge computing. arXiv preprint arXiv:2405.14371, 2024b.
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source
small language model. arXiv preprint arXiv:2401.02385, 2024c.
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan
Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models
with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar.
Remark-llm: A robust and efficient watermarking framework for generative large lan-
guage models. arXiv preprint arXiv:2310.12362, 2023b.
Shiquan Zhang, Ying Ma, Le Fang, Hong Jia, Simon D’Alfonso, and Vassilis Kostakos.
Enabling on-device llms personalization with smartphone sensing. arXiv preprint
arXiv:2407.04418, 2024d.
Xiaojin Zhang, Yulin Fei, Yan Kang, Wei Chen, Lixin Fan, Hai Jin, and Qiang Yang. No
free lunch theorem for privacy-preserving llm inference. arXiv preprint arXiv:2405.20681,
2024e.
Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, and Ran Zhang.
Edge intelligence optimization for large language model inference with batching and
quantization. arXiv preprint arXiv:2405.07140, 2024f.
Haiyan Zhao, Fan Yang, Himabindu Lakkaraju, and Mengnan Du. Opening the black
box of large language models: Two views on holistic interpretability. arXiv preprint
arXiv:2402.10688, 2024a.
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and
Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.
arXiv preprint arXiv:2403.03507, 2024b.

37
Preprint, under review

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, and Chuan Wu. Llm-pq: Serving llm
on heterogeneous clusters with phase-aware partition and adaptive quantization. arXiv
preprint arXiv:2403.01136, 2024c.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao
Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with
mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024a.
Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. Response
length perception and sequence scheduling: An llm-empowered llm inference pipeline.
Advances in Neural Information Processing Systems, 36, 2024b.
Zoom. Zoom meeting summarizer. Zoom website, 2024. URL https://news.zoom.us/
zoom-iq-meeting-summary-chat-compose-free-trial/.

38

You might also like