CoActo CoActive Neural Network Inference Offloading With Fine-Grained and Concurrent Execution
CoActo CoActive Neural Network Inference Offloading With Fine-Grained and Concurrent Execution
Concurrent Execution
Kyungmin Bin Jongseok Park Chanjeong Park
Seoul National University Seoul National University Seoul National University
[email protected] [email protected] [email protected]
⁝
0.15
mobile-server neural network inference offloading. However, we
⁝
⁝
…
find that existing collaborative inference solutions only focus on Input 0.02
Layer N Inference
data
partitioning the DNN computation, which is only a small part of Layer 1 Result
achieving an efficient DNN offloading system. What ultimately de-
termines the performance of DNN offloading is how the execution Time
system utilizes the characteristics of the given DNN offloading task
(a) Conventional approach
on the mobile, network, and server resources of the offloading en-
vironment. To this end, we design CoActo, a DNN execution system Mobile Server
⁝
…
which is a new, improved concept of DNN offloading that adds Input 0.02
data Layer N
two properties, 1) fine-grained expression of DNNs and 2) con- Layer 1 Inference
Result
currency of runtime resources, to existing collaborative inference. Mobile
In CoActo, system components go beyond simple model splitting Network
Server
of existing approaches and operate more proactively to achieve Time
the coactive execution of inference workloads. CoActo dynamically
(b) CoActo
schedules concurrent interleaving of the mobile, server, and net-
work operations to actively increase resource utilization, enabling
lower end-to-end latency. We implement CoActo for various mobile Figure 1: Illustration of (a) conventional collaborative DNN
devices and server environments and evaluate our system with inference offloading and (b) the coactive DNN inference of-
distinct environment settings and DNN models. The experimental floading of CoActo. The proposed coactive approach enables
results show that our system achieves up to 2.1 times speed-up concurrent execution of computation and communications,
compared to the state-of-the-art collaborative inference solutions. enabling a novel opportunity of latency reduction in infer-
ence offloading.
CCS CONCEPTS 3–7, 2024, Minato-ku, Tokyo, Japan. ACM, New York, NY, USA, 13 pages.
• Computing methodologies → Parallel algorithms. https://doi.org/10.1145/3643832.3661885
KEYWORDS 1 INTRODUCTION
Convolutional Neural Networks; Parallel Computing Algorithms
With the rapid development of neural networks, AI-based mobile
ACM Reference Format: applications [19, 24] are now providing high-quality services that
Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan are comparable to those of human experts. The user-centric nature
Lee. 2024. CoActo: CoActive Neural Network Inference Offloading with of these mobile applications often puts emphasis on user interaction,
Fine-grained and Concurrent Execution. In The 22nd Annual International such as understanding voice commands, analyzing visual input, and
Conference on Mobile Systems, Applications and Services (MOBISYS ’24), June interpreting human actions, which enable user experiences to be
more relevant to their requests. However, user interactions can also
significantly deteriorate user experience, when a high latency of the
response introduces a delay to the operation of the application. As
such low-latency responses to user requests are a crucial factor for
This work is licensed under a Creative Commons Attribution International 4.0 License. a good user experience of AI-based mobile services. To achieve this,
MOBISYS '24, June 3–7, 2024, Minato-ku, Tokyo, Japan these services generally strive for a latency service-level objective
© 2024 Copyright is held by the owner/author(s).
ACM ISBN 979-8-4007-0581-6/24/06. (SLO) [7, 14] instead of a throughput objective, usually on the order
https://doi.org/10.1145/3643832.3661885 of milliseconds.
412
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee
However, with the ever-increasing complexity of modern AI we find expressing DNNs in fine-grained units of tiles, instead of
models and services, meeting the latency SLO solely with local layers, to be the more fundamental solution that can be applied
processing in resource-constrained mobile devices has proven to to any DNN offloading problem, regardless of the DNN and de-
be extremely challenging. Even with the help of modern mobile vice used. Also, considering the dynamic nature of DNN offloading,
processors, executing a DNN solely on a mobile device consumes a the runtime components of the system must be flexible enough
considerable amount of time, power, and memory [20, 37], and the to support dynamic changes in the execution, such as changes
execution becomes near-impossible with large-scale DNNs, such as in available computation resources, network conditions, and the
generative NLPs. Instead, many services opted to offload the DNN presence of competing inference offloads. To enable such dynamic
inference to remote high-performance cloud servers, by transmit- behaviors, we find that the system components for computation
ting locally collected user data to the servers. For many mobile AI and networking should actively operate in parallel to dynamically
services with low latency SLOs, this mobile-server inference offload- allocate the resources to DNN offloading workflows that require
ing has become the dominating, if not only, solution. As such, this attention depending on the system status.
significance of mobile-server inference offloading naturally led to a To this end, we design CoActo, a DNN execution system built
series of works that aim to expand upon its design. from the ground up for mobile-server inference offloading. The key
One of the most significant improvements over the traditional design philosophy behind our system is Coactive DNN Inference
mobile-server inference offloading is achieved by mobile-server Offloading, which is a new, improved concept of DNN offload-
collaborative inference [15]. Unlike the traditional offloading ing that adds fine-grained expression of DNNs and concurrency of
approach, collaborative inference actively leverages the increas- runtime resources to the existing collaborative inference. In the
ing AI processing capabilities of modern mobile devices, by split- coactive DNN inference offloading of CoActo, the DNN workload
ting the DNN computation in a way that balances the computa- is expressed as a fine-grained tile-based dataflow graph. Using this
tion between the mobile and server computation resources. By fine-grained DNN graph, the computation and network resources
carefully profiling the characteristics of three main resources of of the given offloading environment are dynamically assigned and
inference offloading, i.e., mobile, server, and network resources, utilized to maximally leverage not only the concurrent activation of
collaborative inference approaches search and generate a partition- multiple offloaded workloads but also the concurrent activation of
ing and scheduling scheme that achieves the minimum combined computation and communications within a single offloaded work-
latency of both the computation and the communications within load, allowing higher resource utilization and lower end-to-end
its solution space, enabling much improved end-to-end latency latency in DNN offloading. For instance, in Figure 1(a), conven-
in inference offloading. As such, many collaborative inference ap- tional collaborative DNN inference offloading simply searches the
proaches [6, 10, 11, 13, 16, 17] have been proposed in recent years, best DNN partitioning point to achieve the lowest combined la-
each providing different solutions to finding the optimal partition- tency of the mobile computation, data communications, and server
ing and scheduling scheme in mobile-server collaborative inference. computation. In contrast, as depicted in Figure 1(b), coactive DNN
Despite these efforts, we find existing collaborative inferences inference offloading of CoActo can dynamically schedule concur-
to be far from complete. Existing works focus on the question of rent interleaving of the mobile, server, and network operations on a
how to split the DNN computation between the mobile and server fine-grained expression of DNNs, to actively increase the utilization
resources for efficient DNN offloading, but this is only a part of a of the resources in the offloading environment and enable lower
bigger question. What ultimately determines the end-to-end latency end-to-end latency.
is the design of the execution system, of how the system can effi- We find three design challenges to enable such a novel DNN
ciently support the characteristics of the given workload under the execution system: 1) Devising a general model partitioner that ex-
highly stochastic characteristics of the mobile offloading environ- presses an arbitrary layer-wise DNN graph as a tile-wise DNN
ment [22, 34], such as channel dynamics in mobile networks [22, 34] graph. 2) Designing an execution system that allows flexible alloca-
and bursty requests in cloud servers [33, 36]. For DNN offloading, tion and utilization of system resources. 3) Designing an efficient
this not only includes the partitioning and scheduling of the work- scheduler that can react to the dynamic changes in the runtime en-
load, but also the modeling of the workload, execution algorithm, vironment. We propose three corresponding design concepts for the
dynamic load-balancing, possibility of multi-tenant execution, and system design CoActo, each overcoming the three challenges men-
many more. Therefore, to find a more complete solution to collabo- tioned above. We implement CoActo for various mobile devices and
rative inference, we ask the more fundamental question of "How server environments and evaluate CoActo in various network envi-
should a DNN execution system be designed for efficient mobile DNN ronments and DNN models, including recent Transformer-based
offloading?" DNN models. The experimental results show that our framework
We identify two key properties that a DNN execution system achieves up to 2.1 times the speed-up compared to the state-of-the-
must have to realize efficient DNN inferences in a mobile-server art conventional collaborative inference frameworks.
offloading environment: 1) a fine-grained expression of DNNs, and
2) flexibility of the system resource utilization. In current DNN 2 BACKGROUND AND MOTIVATION
computation frameworks, DNNs are often expressed in units of
layers. However, we find that layer granularity is often too large 2.1 Collaborative Inference
for DNN offloading scenarios, and does not supply enough paral- In this section, we explain the concepts and limitations of two rep-
lelism to efficiently utilize all available resources. Existing solutions resentative approaches in collaborative inference: split computing
attempt to alleviate this by partitioning the layers [35, 39, 40], but and fused-layer offloading.
413
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan
414
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee
how the given mobile, network, and server resources are best uti- Server
Graphs received Graph Graph Fetch Computing
lized at runtime. Concurrent, rather than sequential, use of these from mobile devices Info Engine Engines
resources is necessary to maximize parallel resource utilization. DNN 1 DNN M Weights Dynamic Scheduler
Furthermore, this concurrency provides adaptability in dynamic (Sec. 3.5)
environments by enabling the handling of multiple workloads si- … Results
Communication
multaneously. Engine Transmit Tiles
three design challenges that must be addressed. Layer Communication Transmit Tiles
Tile
C1) Tile-based expression: As mentioned in Section 2.2, the tiling Graph Graph Engine
Results
technique is suitable for the fine-grained expression of DNNs. How- Partition Dynamic Scheduler
ever, devising a general model partitioner that expresses an ar- Graph (Sec. 3.5)
bitrary layer-wise computation graph as a tile-wise computation Info
Graph Computing
graph poses many challenges. These challenges include determin- Weights
Engine Fetch Engines
Tile-based Partitioner
ing the efficient tile dimensions and size for the given environment, (Sec. 3.3) Asynchronous Execution Engines
automatically parsing and generating the independent data depen- Mobile (Sec. 3.4)
dency flow graph between the tiles, and designing these processes
in a general manner for all DNNs. Figure 3: The overview of CoActo. Tile-based Partitioner (TP)
C2) Concurrent execution system: Current DNN inference frame- transforms the layer-wise computation graph into a tile-wise
works such as Tensorflow [1] or PyTorch [23] execute computation computation graph by partitioning a layer into several fine-
at the layer level, necessitating synchronization of the resources grained tiles. At runtime, Asynchronous Execution Engines
between each computation or communication operation. Although (AEEs) concurrently compute and transmit partitioned tiles.
tile-wise computational graphs allow independent computational Dynamic Scheduler (DS) makes offload decisions for each
paths, this layer-wise execution system restricts the concurrency tile.
only to the intra-layer level, leading to a serialized execution. De- structure is used for both mobile and the server during DNN of-
signing a concurrent execution system that enables overlapping floading. DNN offloading is initiated by the runtime of the mobile
the computation and communications of tiles in the independent device transferring its partitioned tile-wise computation graphs to
computational paths presents a challenging task. the server runtime. AEEs are composed of three separate engines,
C3) Dynamic scheduling of tiles: The third challenge comes namely the Graph Management Engine, Computing Engine, and
from the design philosophy of existing collaborative inference of- Communication Engine, which asynchronously execute tiles in
floading of balancing the model executions between the mobile, independent computational paths concurrently without any syn-
network, and server resources of the given environment. Unlike the chronization. DS dynamically decides whether to transmit the data
existing approach, however, our concurrent execution model does of tiles or not at the mobile devices during runtime, using the pro-
not allow the simple performance modeling of adding the individ- filed network condition and server’s current computation load to
ual profiled latency of the resources. As such, a novel scheduling estimate the completion time of the offloaded tile.
solution that allows dynamic adaptation and balancing of complex
fine-grained DNN between the concurrently operating resources
must be realized for coactive inference. 3.3 Tile-based Partitioner (TP)
The objective of TP is to automatically convert an arbitrary layer-
3.2 Overview wise DNN graph into a fine-grained computation graph of tiles.
By addressing the challenges mentioned above, we design CoActo, In doing so, TP combines the insight that matrix multiplications
a novel coactive inference framework that enables fine-grained and are composed of many smaller tiles which can be computed inde-
concurrent execution for DNN offloading. CoActo comprises the pendently from Section 2.2 with the observation that many DNN
three components: Tile-based Partitioner (TP) (Section 3.3), Asyn- layer computation kernels are executed using matrix multiplication.
chronous Execution Engines (AEEs) (Section 3.4), and Dynamic This leads us to the tile-level computation graphs that allow fine-
Scheduler (DS) (Section 3.5). Each component has been designed to grained dependency management and concurrent communication
address the three challenges identified in the previous section. In and computation of independent tiles during DNN offloading.
the following sections, we discuss how each component tackles its In CoActo, a tile is an abstract unit of scheduling that each ref-
corresponding challenge. erences a certain submatrix of the original tensor data. As tiles
Figure 3 presents an overview of CoActo. TP transforms a layer- are only references, CoActo is able to construct and execute fine-
wise computational graph to a fine-grained tile-wise computational grained dependency graphs on top of existing tensor-based DNN
graph by iteratively dividing the computation of each DNN layer execution model while keeping the original DNN tensors and their
into multiple fine-grained tiles and graphing the data dependencies executions unmodified. That is, tile abstraction only holds the meta-
among the tiles using the characteristics of each layer. The tile- data needed to define and access the tile, such as tile dimension
wise graph outputs of TP are saved, to be later used in the CoActo sizes, memory stride per dimension, dependency relationship with
runtime, composed of AEEs and DS. The same CoActo runtime other tiles, and the pointer reference to its tensor data, while the
415
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan
Channels
1 2
3 4
…
⁝
Output Tensor Output Matrix 1 2 3 4
Softmax Output Tiles
(c) Column-wise Softmax
(a) Parsing Layer (b) Convert Tensor (d) Tiling and Graph
Original DNN to Matrix Partitioning Reconstruction Partitioned Graph
Figure 4: An illustration of Tile-based Partitoner (TP) operation. An original DNN is transformed into a partitioned computation
graph through 4 steps, (a) parsing the target layer’s input and output tensor sizes, (b) converting input and output tensors to
matrices, (c) column-wise partitioning, and (d) generating tiles by merging multiple columns, recursively in every layer.
4 5 6 7
memory structure of the original tensor data remains unmodified Mobile 6 7
Softmax
1 2 3
9
and contiguous in memory. Thus, by leveraging this tile abstraction, Network 1 2 6 7
8
CoActo requires no additional duplication or transformation of ten- Server 4 5 8 9
⁝
sor data for its tensor parallel scheduling and execution1 of the Tile-wise Computation Graph Time
DNN, unlike the fused-layer offloading approaches in Section 2.1.
Figure 4 shows the overall process of TP. The first step (Fig- Mobile
Softmax
ure 4(a)) is to analyze the size of each layer’s output tensor. For CNN
Conv
Input
Pool
Network Input
⁝
tensors, TP flattens the output tensors into a matrix (Figure 4(b)) Server Idle
⁝ height equals the number of channels, and the width equals
Conv Pool
⁝
⁝
whose
⁝
⁝
Layer-wise Computation Graph Time
the product of the tensors’ width and height (i.e. the number of
elements per channel), and for Transformer-based DNNs, the ten- (a) Computation graphs (b) Timelines
sors are already in the form of matrices. After that, the matrices
are decomposed into column-wise vectors, and tiles are created by Figure 5: Examples of the collaborative inference with the
merging the partitioned columns, which are uniformly shaped sub- tiles-wise computation graph (top) and the layer-wise com-
matrices (Figure 4(c)). The computation graph is then reconstructed putation graph (bottom).
by graphing the tiles into a directed acyclic graph (DAG), using the
data dependency relationship between the tiles of input and output underutilization becomes even worse when network conditions
⁝
layers(Figure 4(d)). In this step, the granularity of the tiles is deter- degrade. On the contrary, the tile-wise computation graph enables
⁝
⁝
mined by the number of merged columns. Merging the tiles allows concurrent computation and transmission of independent nodes,
for decreased scheduling overhead and an increase in weight data as independent tiles are allowed concurrent execution and the pro-
reuse through decreased computation granularity. TP merges the cessing time of each tile is much faster than the computation of
the whole layer. As a result, the completion time is greatly reduced
⁝
layer. TP evaluates the performance of the current graph under the compared to the conventional layer-wise collaborative inference
given network and computing environment and heuristically ad- approaches.
justs the number of tiles until it achieves satisfactory performance.
Through these steps, a fine-grained tile-wise computation graph is 3.4 Asynchronous Execution Engines (AEEs)
formed, which represents the DNN using fine-grained tiles as the Our approach to achieving a concurrent DNN execution system
graph node and the data dependencies among them as the graph involves designing Asynchronous Execution Engines (AEEs), which
edge. consist of three types of execution engines, namely Graph Man-
This transformation to a fine-grained dependency graph allows agement Engine, Computing Engine, and Communication Engine,
a flexible and concurrent execution in DNN inference offloading. where each engine can asynchronously and independently operate
Figure 5 illustrates the difference between the tile-wise compu- without waiting for each other. This asynchronous design maxi-
tation graph and the conventional layer-wise computation graph mizes concurrency by executing the tiles in parallel, but also re-
for DNN inference offloading. With the layer-wise computation quires sophisticated coordination of the tasks to avoid potential
graph, the server can start its computation of the subsequent layers race conditions or race conditions. We now explain the engines
only after the delivery of the whole input data. This results in the asynchronously operate to achieve concurrency.
powerful server remaining idle during the data transfer, and this
3.4.1 Graph Management Engine. Managing the complex fine-grained
1 The individual tile executions may require data duplication at the computation kernel
computation graph separately in each parallel computing engine
level, depending on the implementation of the kernel. For instance, a convolution
kernel that combines im2col with matrix multiplication may require duplication of requires frequent synchronization of the graph states, resulting in
the input data, while loop-based convolution may not require such duplication. severe serialization of execution. To avoid this, we separate the
416
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee
Update Computing Engines Engine parallel resources are saturated if a larger number of ready nodes
Graph.
than the number of parallel execution engines are prepared in the
⁝
Kernel
Engine 1
Comp.
workload queue. To maximize concurrency, the communication
engine also operates asynchronously without synchronization with
Kernel not-ready
Softmax
Engine N ready
computing engines. It continuously transmits the completed nodes
Fetch
completed in the send queue to the target device (e.g., a cloud server or a
mobile device). Whenever it receives the completed nodes from the
Figure 6: The overall execution workflows of Asynchronous other devices, it also returns the nodes to the graph management
Execution Engines. engine.
role of management of complex computation graphs to Graph Man- 3.5 Dynamic Scheduler (DS)
agement Engine, which contains the entire computation graph
Our concurrent approach makes finding the optimal offloading de-
information and its state. With a separate graph management en-
cisions non-trivial, as the completion time cannot be modeled by
gine, the computing and communication engines are guaranteed
simply adding the profiled computing times and communication
to access data exclusively for every node without any requiring
times. Furthermore, the fine-grained tile-wise computation graph
synchronization.
makes the problem more complex. With the tile-based partitioning
Graph management: To manage the data flow of computation
technique, scheduling the tiles is interpreted as a complex DAG
graphs, we define three transition states of a node2 : completed, ready,
scheduling problem that is a well-known NP-complete problem [31].
and not-ready. If a node is executed or if its outputs are received
Static DAG scheduling approaches [28, 31] are available for adop-
from another device (mobile or server), it is in a completed state.
tion, yet their effectiveness diminishes when deployed in dynamic
Note that input nodes, which hold the input data for the inference,
environments. For instance, unexpected network interference or
are always considered completed. If all parents of a node are in
an extremely burst request on the server reduces the efficacy of
completed state, the node can be executed, and this is represented
the statically derived solution. Therefore, we suggest a dynamic
as a ready state. On the other hand, if one or more parent(s) are not
offloading decision algorithm that dynamically schedules the nodes
completed, the node is in a not-ready state. Whenever the graph
at runtime based on the estimated completion time of each node at
engine receives the completed nodes from the computing engines
the moment.
or the communication engine, it updates the state of the completed
node’s child nodes and pushes any child node that becomes ready to 3.5.1 Task Model. We define the partitioned computation graph
the workload queue, as depicted in the left of Figure 6. This update as a DAG 𝐺 =< 𝑉 , 𝐸 >, where the vertex set 𝑉 represents the set
process is performed atomically to prevent any race conditions. of the nodes. The edge set 𝐸 contains the directed edges 𝑒𝑖,𝑗 ∈ 𝐸
Also, this data dependency update is performed asynchronously for the data dependency between the node 𝑣𝑖 and 𝑣 𝑗 . A node 𝑣𝑖 ,
by a data exchange between the graph management engine and which serves as the starting point of an edge, is referred to as the
only a single communication or computation engine, minimizing parent node, and a node 𝑣 𝑗 that serves as the endpoint of the edge
the communication overhead within a single device. is referred to as the child node. A node without any child nodes is
Workload queue: The purpose of the workload queue is to act called an exit node 𝑣𝑒𝑥𝑖𝑡 . A child node is dependent on its parent
as a barrier between the nodes that are ready for execution and nodes and can only be executed when all the output data of the
those that are not. This ensures that computing engines fetch only parent nodes are ready. Each node 𝑣𝑖 has a computation cost (FLOPs)
the ready nodes from the tile-based DNN graph, without having of 𝑐𝑖 , and output tile data size of 𝑑𝑖 .
to consider the data dependency between other nodes. This ap-
proach ensures that new computations are readily available to be 3.5.2 Dynamic Offloading Decision. The goal of our dynamic of-
dynamically scheduled to the resources at any point of system exe- floading decision in mobile is to find the optimal offloading policy
cution, enabling the maximal utilization of the given computation 𝑂 = {𝑜 1, ..., 𝑜 𝑁 } that minimizes the maximum completion time of
resources. the exit nodes 𝑣𝑒𝑥𝑖𝑡 . Note that 𝑜𝑖 denotes the offloading decision of
3.4.2 Computing & Communication Engines. All computing and a node 𝑣𝑖 , where 1 represents server offloading, and 0 represents
communication engines operate concurrently and asynchronously local computation. Here, 𝐶𝑇 (𝑣𝑖 ) represents the completion time of
without any synchronization among them. Each computing engine a node 𝑣𝑖 , measured from the beginning of the inference.
continuously fetches the ready nodes from the workload queue in
417
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan
Input 𝑣𝑣1 𝑣𝑣2 𝑣𝑣3 inaccurate. Nevertheless, we also suggest a profile-based estimation
M 𝑣𝑣4 𝑣𝑣5 𝑣𝑣8 𝑣𝑣6 𝑣𝑣9 𝑣𝑣11 𝑣𝑣7 𝑣𝑣10 𝑣𝑣12
𝑣𝑣4 𝑣𝑣5 𝑣𝑣6 𝑣𝑣7
𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 𝑣𝑣8 𝑣𝑣11 𝑣𝑣12
approach, for an accurate estimation of these completion times,
N
𝑣𝑣10
detailed in the subsequent paragraphs.
𝑣𝑣8 𝑣𝑣9 𝑣𝑣7 𝑣𝑣6 𝑣𝑣10 𝑣𝑣4 𝑣𝑣5 𝑣𝑣9 𝑣𝑣10 𝑣𝑣11 𝑣𝑣12
S
Estimating the completion times: Figure 7(c) and Figure 7(d)
𝑣𝑣11 𝑣𝑣12 Complete Time show the examples of 𝐸𝐶𝑇 (𝑣 8 , offloaded) and 𝐸𝐶𝑇 (𝑣 8 , server).
(a) Sample DAG (b) Timeline of scheduled DAG 𝐸𝐶𝑇 (𝑣 8 , offloaded) can be decomposed into the completion time
𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,offloaded ) 𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,server) of the node 𝑣 8 (𝐶𝑇 (𝑣 8 )), the queuing latency (zero in the example of
𝑣𝑣4 𝑣𝑣5 𝑣𝑣8 𝑣𝑣6 … …
Figure 7(c)), and the transmission time of the node 𝑣 8 . The queuing
𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 𝑣𝑣8 … 𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 … latency can be obtained by the total data size of the nodes in the
… 𝑣𝑣7 𝑣𝑣6 𝑣𝑣10 𝛼𝛼 𝑣𝑣4 𝑣𝑣5 𝑣𝑣8
send queue of the communication engine (in Sec 3.4) and the pro-
filed bandwidth. The transmission time of the node 𝑣 8 is obtained
Time Time 𝑑8
𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ) 𝑑𝑑8 by 𝐵𝑊 + 𝐿, where, 𝐿 is the network latency between the server
+ 𝐿𝐿 Input transmission Compute time 𝑐𝑐8 ∗ # 𝑜𝑜𝑜𝑜 𝑐𝑐𝑜𝑜𝑐𝑐𝑐𝑐𝑐𝑐
𝐵𝐵𝐵𝐵 of ancestors 𝐹𝐹𝐿𝐿𝐹𝐹𝐹𝐹𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 and mobile. Note that 𝐵𝑊 is estimated by calculating the moving
(c) 𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,offloaded ) (d) 𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,server ) average of the division between the size of transmitted data and the
time taken for the transmission time during the previous inferences.
Figure 7: An example of dynamic offloading decision 𝐸𝐶𝑇 (𝑣 8,server) is decomposed into the transmission time of all in-
with (a) a sample DAG. It is performed by calculating put nodes, the computation time of ancestors of 𝑣 8 , and the delayed
𝐸𝐶𝑇 (𝑣𝑖 ,offloaded) and 𝐸𝐶𝑇 (𝑣𝑖 ,server). time 𝛼𝑘 of the mobile device 𝑘 by the resource contention from the
Basic idea: The main idea of our approach is to first send all the other computation graphs as in Figure 7(d). The transmission time
input nodes to the server and dynamically decide to compute and of all input nodes is calculated by dividing the data size of the nodes
send the outputs of the subsequent nodes on runtime. Sharing the and the profiled network bandwidth. Then, the computation time
input data is very cheap, as input data are often much smaller of the ancestors of 𝑣 8 is estimated by dividing the summation of the
than intermediate DNN tensors. Through this, we guarantee that computation time of the ancestors of 𝑣 8 and the number of com-
the end-to-end execution latency of our coactive inference is at putation resources in the server. Meanwhile, estimating 𝛼𝑘 in the
least the minimum value between a full server offloading and full mobile device 𝑘 is non-trivial, as the mobile devices cannot know
mobile on-device computation. Our system is designed in this way the server’s computational load. To estimate 𝛼𝑘 , the server informs
to guarantee minimum performance, even with unforeseen network the number of computed nodes of each mobile device within the
or server degradation during the execution. If incomplete input data predefined time interval 𝑇𝑠 among the multiple computation graphs.
∗𝑛𝑘
were to be sent, the powerful server computations may wait for the Then, 𝛼𝑘 is estimated by 𝑇𝑠Σ𝑛 , 𝑛𝑘 being the number of computed
transmission of intermediate nodes from the mobile, which may tiles of the computation graph of the mobile device 𝑘 and Σ𝑛 being
not be latency guaranteed in wireless or user mobility scenarios. the summation of the computed tiles of all computation graphs in
Furthermore, it always ensures that the worst-case execution time is the server. This allows the mobile devices to find the average time
the time taken for on-device inference if the computation gain from taken in the server for the scheduling of its node among multiple
the powerful server is lost by an offloading service disconnection. competing computations.
Dynamic offloading decision: We explain how the dynamic
offloading decision operates by using the example DAG in Figure 7.
As explained before, all input nodes are transmitted and duplicate
computation between server and mobile is allowed. However, to 4 IMPLEMENTATION
minimize unnecessary duplicated computation, the mobile and As current DNN inference frameworks rely on layer-wise expres-
server start their computation from a ready node pair with the sion and do not support the execution of tile-wise computational
largest diameter in between, iterating in opposite directions. For graphs, we implement CoActo from scratch using approximately
example, as illustrated in Figure 7(b), the mobile transmits 𝑣 3 , while 21,000 lines of C code, aimed at CPUs.
computing 𝑣 4 . The server starts computation of 𝑣 7 as soon as it Tile-based Partitioner: We implement the custom C structs for
receives 𝑣 3 . Then, the offloading decision of each node is performed both tiles and the tile-wise computation graph. The tile object con-
using a greedy approach in the mobile device, starting from 𝑣 4 . This tains the variables associated with the tiles, such as the pointers of
decision is made using the estimated completion times of the node the input and output data memory addresses. The tile-wise com-
on the server when it is either 1) offloaded from the mobile and 2) putation graph struct generated by TP then holds the pointer list
computed solely on the server, denoted as 𝐸𝐶𝑇 (𝑣𝑖 ,offloaded) and of the tiles in the graph. As tiles are only references to tensor data,
𝐸𝐶𝑇 (𝑣𝑖 ,server), respectively. A node 𝑣𝑖 is only offloaded (i.e., 𝑜𝑖 = 1) the memory overhead for each tile object is around 100 bytes. As
if 𝐸𝐶𝑇 (𝑣𝑖 ,offloaded) < 𝐸𝐶𝑇 (𝑣𝑖 ,server), indicating that computation DNNs are usually compiled into fine-grained DNN graphs with a
of the node 𝑣𝑖 on the server alone will be delayed when compared to few hundred to a few thousand nodes each, the memory overhead
the case when it is assisted by the mobile device over the network. for a tile-wise graph is in the range of a few hundred kilobytes on
For example, in Figure 7(b), the node 𝑣 8 is offloaded based on the average. To generate the tile-wise graph, TP parses the Darknet [25]
estimations from the mobile. The server is then allowed to skip the cfg and weight files and creates the tile structure by parsing and
computation of 𝑣 8 by using the computation result from the mobile analyzing the DNN structures layer by layer. After that, it locates
device. This mobile-assisted greedy approach enables the maximal the child tiles of each tile and graphs them by storing the child tile’s
utilization of the powerful server resources, even if the estimation is pointer references to the tile struct.
418
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee
419
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan
Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
350 140 500 800
On-device On-device On-device
250 120 On-device
400
Latency (ms)
Latency (ms)
Latency (ms)
Latency (ms)
100 600
200
80 300
150 400
60 200
100 40 200
50 100
20
0 0 0 0
4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64
Available Number of Server Cores Available Number of Server Cores Available Server Cores Available Server Cores
(a) VGG16 Batch 1 (b) ResNet50 Batch 1 (c) Bert-base Batch 1 (d) YOLOv3 Batch 1
Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
2500 1000 3000 6000
On-device On-device On-device
1500 On-device 1500 4000
800
Latency (ms)
Latency (ms)
Latency (ms)
Latency (ms)
600 3000
1000 1000
400 2000
500 500
200 1000
0 0 0 0
4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64
Available Number of Server Cores Available Number of Server Cores Available Server Cores Available Server Cores
(e) VGG16 Batch 8 (f) ResNet50 Batch 8 (g) Bert-base Batch 8 (h) YOLOv3 Batch 8
Figure 8: End-to-end latency using Jetson AGX Xavier with different number of available server cores, under a 100Mbps WiFi
network.
Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
400 300 450 1500
350 On-device 400
250 On-device 1250
Latency (ms)
Latency (ms)
Latency (ms)
350
Latency (ms)
300
250 200 300 1000
250 On-device
200 150 750
On-device 200
150 150
100 500
100 100
50 50 250
50
0 0 0 0
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps)
(a) VGG16 Batch 1 (b) ResNet50 Batch 1 (c) Bert-base Batch 1 (d) YOLOv3 Batch 1
Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
3000 3000 3000 10000
On-device On-device
2500 2500 2500
Latency (ms)
Latency (ms)
Latency (ms)
8000
Latency (ms)
On-device
2000 2000 2000
6000 On-device
1500 1500 1500
4000
1000 1000 1000
500 500 500 2000
0 0 0 0
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps)
(e) VGG16 Batch 8 (f) ResNet50 Batch 8 (g) Bert-base Batch 8 (h) YOLOv3 Batch 8
Figure 9: End-to-end latency using Jetson AGX Xavier under different network bandwidths and 8 cores available in the server.
the different channel dynamics of wireless mobile networks.. Fig- network bandwidth, caused by longer transmission times exacer-
ure 9 shows the end-to-end latency of each baseline using Jetson in bated by a decrease in network bandwidth. In contrast, CoActo
different network bandwidth settings. Similar to the situation with and FL-offloading hide the transmission time within the comput-
a computing bottleneck, the sequential methods of Cloud-only ing time, thereby reducing latency. In the case of models such as
and SPINN result in an increase in end-to-end latency under lower ResNet50 and YOLOv3, when network bandwidth decreases, the
420
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee
Latency (ms)
Latency (ms)
200
1200
150
800
100
50 400
(a) Cloud-only 0 0
Jetson Xavier Rpi 4 Pixel 5 Jetson Xavier Rpi 4 Pixel 5
ResNet50 BERT-Base VGG16 ResNet50 BERT-Base VGG16
421
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan
Latency (ms)
Latency (ms)
75
100 be further applied to the between the global memory, network in-
75 50 terface, and the accelerator memory to further increase the system
50
25
utilization.
25 Secondly, as the benefit of CoActo is ideally maximized when
0 0 the transmission and computation can be perfectly interleaved,
20 100 200 300 400 500 600 700 800 20 100 200 300 400 500 600 700 800
Number of Tiles per Layer Number of Tiles per Layer CoActo’s increase in resource utilization from interleaving concur-
rent executions may be limited in situations, where there is an im-
(a) VGG16 (b) ResNet50
Computation Transmission Computation Transmission
balance between computation and network resources. when server
150 500 resources are plentiful but mobile network bandwidth is limited,
the performance gain is still bounded by the network bandwidth,
Latency (ms)
Latency (ms)
400
100
300 even if computation and communication are perfectly pipelined.
200
This highlights the need for additional research into the adoption
50 of other scheduling methods for mobile offloading, such as dynamic
100
tile batching or prioritized scheduling.
0 0
20 100 200 300 400 500 600 700 800 20 100 200 300 400 500 600 700 800 CoActo in modern large models: Large models have tens to
Number of Tiles per Layer Number of Tiles per Layer hundreds of billions of parameters, resulting in high computational
(c) BERT-base (d) YOLOv3 and memory costs for inference. Given the limited memory and
computational power of mobile SoCs, it is challenging to achieve
satisfactory service QoE with on-device inference. In such cases,
Figure 12: End-to-End latency of CoActo with different num-
offloading inference to a powerful cloud server is often the only
ber of tiles per layer in 100MBps and 64 cores in the server.
realistic solution. Unlike existing system designs, CoActo’s fine-
Note that the batch size is 1 for all the tested DNNs.
grained tile-based DNN expression greatly increases the number
and server setting. In theory, as the number of tiles increases (i.e. be- of independent computation paths within the graph. This leads
comes more fine-grained), more opportunities arise for concurrent to more opportunities for flexible offloading decisions that help
computation and communication, allowing for a decrease in the end- the system adapt to dynamic computing and network resources
to-end latency. This expectation can be validated using VGG16 and of mobile offloading environments, which existing systems cannot
YOLOv3 as in Figure 12. However, BERT-base and ResNet50 surpris- achieve. In addition, as increased depth and width of such large
ingly exhibit increased latency after 100 and 50 tiles, respectively. models allow an even greater number of independent paths, we
The intricate layered structure of BERT-base and ResNet50, such expect CoActo to have much-improved adaptability to dynamic
as Transformers and residual connections, prevents the creation of mobile offloading environments compared to existing solutions.
long concurrent computational paths across multiple layers, unlike Early decision: Our tile-wise fine-grained DNN expression has
the more simple VGG16 and YOLOv3 structures. This reduces the another potential for DNN inference offloading acceleration: an
effectiveness of concurrency while increasing the overhead from early decision that can terminate the offloading by predicting the
smaller tiles. This suggests the need for a thorough design of the inference output with only the partial received input data when
partitioning strategy for concurrency over various DNN models. it estimates enough partial data is received to predict the results,
To achieve this, tile configurations of the aforementioned evalua- which is not feasible in current layer-wise expressions. We suggest
tions are automatically optimized during the granularity adjustment further increasing the use of powerful server resources during input
steps of TP. transmission by allowing the server to compute inference output us-
ing only the partial input data received. By using a confidence level
of the intermediate output, the server may stop the offloading and
return the output with even lower latency. This approach allows for
6 DISCUSSION reducing communications and computing time simultaneously by
Limitations of current design: While CoActo’s fine-grained and halting extraneous tile transmission, with the trade-off of accuracy.
flexible design provides a novel opportunity for latency reduction We leave the early decision method that finds an optimal point in
in inference offloading, its current design has several limitations. the trade-off of accuracy and end-to-end latency for future work.
Firstly, it only supports CPUs, which restricts its ability to utilize the Multi-machine inference: Existing layer-wise expression-based
powerful computational resources of DNN accelerators like GPUs approaches only allow for multi-machines cooperative inference
or TPUs, particularly for server-side resources. However, with the on certain models, like Inception-v3 [30], where multi-paths are
implementation of tile-based computation kernels on those proces- available. Furthermore, parallelism is still restricted due to syn-
sors, CoActo is able to easily support such computation resources, chronization needs in several layers, and therefore extending their
as contributions of CoActo focus mainly on efficient parallelism runtime execution system to multi-machine is non-trivial. In con-
and scheduling in mobile offloading scenarios, which is orthogonal trast, our coactive approach can facilitate cooperative inference by
to the choice of computation accelerators. Moreover, as the use generating many independent computational paths through tile-
of computation accelerator adds extra communication overheads, based fine-grained expression and computing independent paths in
such as transferring tile data from global memory to the dedicated parallel on multiple machines by extending our runtime execution
422
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee
423
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan
[20] Park, J., Bin, K., and Lee, K. mgemm: low-latency convolution with minimal parallel and distributed systems 13, 3 (2002), 260–274.
memory overhead optimized for mobile devices. In Proceedings of the 20th [32] Wang, M., Ding, S., Cao, T., Liu, Y., and Xu, F. Asymo: scalable and efficient
Annual International Conference on Mobile Systems, Applications and Services deep-learning inference on asymmetric mobile cpus. In Proceedings of the 27th
(2022), pp. 222–234. Annual International Conference on Mobile Computing and Networking (2021),
[21] Park, J., Bin, K., Park, G., Ha, S., and Lee, K. Aspen: Breaking operator bar- pp. 215–228.
riers for efficient parallelization of deep neural networks. Advances in Neural [33] Weng, Q., Xiao, W., Yu, Y., Wang, W., Wang, C., He, J., Li, Y., Zhang, L., Lin,
Information Processing Systems 36 (2024). W., and Ding, Y. { MLaaS } in the wild: Workload analysis and scheduling in
[22] Park, S., Lee, J., Kim, J., Lee, J., Ha, S., and Lee, K. Exll: An extremely low-latency { Large-Scale } heterogeneous { GPU } clusters. In 19th USENIX Symposium on
congestion control for mobile cellular networks. In Proceedings of the 14th Networked Systems Design and Implementation (NSDI 22) (2022), pp. 945–960.
International Conference on emerging Networking EXperiments and Technologies [34] Zaki, Y., Pötsch, T., Chen, J., Subramanian, L., and Görg, C. Adaptive conges-
(2018), pp. 307–319. tion control for unpredictable cellular networks. In Proceedings of the 2015 ACM
[23] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Conference on Special Interest Group on Data Communication (2015), pp. 509–522.
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, [35] Zeng, L., Chen, X., Zhou, Z., Yang, L., and Zhang, J. Coedge: Cooperative dnn
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and inference with adaptive workload partitioning over heterogeneous edge devices.
Chintala, S. Pytorch: An imperative style, high-performance deep learning IEEE/ACM Transactions on Networking 29, 2 (2020), 595–608.
library. Advances in Neural Information Processing Systems 32 (2019), 8026–8037. [36] Zhang, H., Tang, Y., Khandelwal, A., and Stoica, I. { SHEPHERD } : Serving
[24] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and { DNNs } in the wild. In 20th USENIX Symposium on Networked Systems Design
Sutskever, I. Zero-shot text-to-image generation. In International Conference and Implementation (NSDI 23) (2023), pp. 787–808.
on Machine Learning (2021), PMLR, pp. 8821–8831. [37] Zhang, L. L., Han, S., Wei, J., Zheng, N., Cao, T., Yang, Y., and Liu, Y. Nn-meter:
[25] Redmon, J. Darknet: Open source neural networks in c. http://pjreddie.com/ Towards accurate latency prediction of deep-learning model inference on diverse
darknet/, 2013–2016. edge devices. In Proceedings of the 19th Annual International Conference on Mobile
[26] Redmon, J., and Farhadi, A. Yolov3: An incremental improvement. arXiv Systems, Applications, and Services (2021), pp. 81–93.
preprint arXiv:1804.02767 (2018). [38] Zhang, S., Zhang, S., Qian, Z., Wu, J., Jin, Y., and Lu, S. Deepslicing: collabora-
[27] Shi, Y., Yang, Z., Xue, J., Ma, L., Xia, Y., Miao, Z., Guo, Y., Yang, F., and Zhou, L. tive and adaptive cnn inference with low latency. IEEE Transactions on Parallel
Welder: Scheduling deep learning memory access via tile-graph. In 17th USENIX and Distributed Systems 32, 9 (2021), 2175–2187.
Symposium on Operating Systems Design and Implementation (OSDI 23) (2023), [39] Zhao, Z., Barijough, K. M., and Gerstlauer, A. Deepthings: Distributed
pp. 701–718. adaptive deep learning inference on resource-constrained iot edge clusters. IEEE
[28] Sinnen, O., and Sousa, L. A. Communication contention in task scheduling. Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11
IEEE Transactions on parallel and distributed systems 16, 6 (2005), 503–515. (2018), 2348–2359.
[29] Smith, T. M., Van De Geijn, R., Smelyanskiy, M., Hammond, J. R., and Van Zee, [40] Zhou, H., Li, M., Wang, N., Min, G., and Wu, J. Accelerating deep learning infer-
F. G. Anatomy of high-performance many-threaded matrix multiplication. In ence via model parallelism and partial computation offloading. IEEE Transactions
Proceedings of the IEEE International Parallel and Distributed Processing Symposium on Parallel and Distributed Systems 34, 2 (2022), 475–488.
(2014), IEEE, pp. 1049–1059. [41] Zhu, H., Wu, R., Diao, Y., Ke, S., Li, H., Zhang, C., Xue, J., Ma, L., Xia, Y., Cui,
[30] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the W., et al. { ROLLER } : Fast and efficient tensor compilation for deep learning. In
inception architecture for computer vision. In Proceedings of the IEEE conference 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI
on computer vision and pattern recognition (2016), pp. 2818–2826. 22) (2022), pp. 233–248.
[31] Topcuoglu, H., Hariri, S., and Wu, M.-Y. Performance-effective and low-
complexity task scheduling for heterogeneous computing. IEEE transactions on
424