0% found this document useful (0 votes)
70 views13 pages

CoActo CoActive Neural Network Inference Offloading With Fine-Grained and Concurrent Execution

Uploaded by

asialwen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views13 pages

CoActo CoActive Neural Network Inference Offloading With Fine-Grained and Concurrent Execution

Uploaded by

asialwen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CoActo: CoActive Neural Network Inference Offloading with Fine-grained and

Concurrent Execution
Kyungmin Bin Jongseok Park Chanjeong Park
Seoul National University Seoul National University Seoul National University
[email protected] [email protected] [email protected]

Seyeon Kim Kyunghan Lee


University of Colorado Boulder Seoul National University
[email protected] [email protected]

ABSTRACT Mobile Server

Collaborative inference is the current state-of-the-art solution for 0.81


0.15
mobile-server neural network inference offloading. However, we




find that existing collaborative inference solutions only focus on Input 0.02
Layer N Inference
data
partitioning the DNN computation, which is only a small part of Layer 1 Result
achieving an efficient DNN offloading system. What ultimately de-
termines the performance of DNN offloading is how the execution Time
system utilizes the characteristics of the given DNN offloading task
(a) Conventional approach
on the mobile, network, and server resources of the offloading en-
vironment. To this end, we design CoActo, a DNN execution system Mobile Server

built from the ground up for mobile-server inference offloading. 0.81


Our key design philosophy is Coactive Inference Offloading, 0.15



which is a new, improved concept of DNN offloading that adds Input 0.02
data Layer N
two properties, 1) fine-grained expression of DNNs and 2) con- Layer 1 Inference
Result
currency of runtime resources, to existing collaborative inference. Mobile
In CoActo, system components go beyond simple model splitting Network
Server
of existing approaches and operate more proactively to achieve Time
the coactive execution of inference workloads. CoActo dynamically
(b) CoActo
schedules concurrent interleaving of the mobile, server, and net-
work operations to actively increase resource utilization, enabling
lower end-to-end latency. We implement CoActo for various mobile Figure 1: Illustration of (a) conventional collaborative DNN
devices and server environments and evaluate our system with inference offloading and (b) the coactive DNN inference of-
distinct environment settings and DNN models. The experimental floading of CoActo. The proposed coactive approach enables
results show that our system achieves up to 2.1 times speed-up concurrent execution of computation and communications,
compared to the state-of-the-art collaborative inference solutions. enabling a novel opportunity of latency reduction in infer-
ence offloading.
CCS CONCEPTS 3–7, 2024, Minato-ku, Tokyo, Japan. ACM, New York, NY, USA, 13 pages.
• Computing methodologies → Parallel algorithms. https://doi.org/10.1145/3643832.3661885

KEYWORDS 1 INTRODUCTION
Convolutional Neural Networks; Parallel Computing Algorithms
With the rapid development of neural networks, AI-based mobile
ACM Reference Format: applications [19, 24] are now providing high-quality services that
Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan are comparable to those of human experts. The user-centric nature
Lee. 2024. CoActo: CoActive Neural Network Inference Offloading with of these mobile applications often puts emphasis on user interaction,
Fine-grained and Concurrent Execution. In The 22nd Annual International such as understanding voice commands, analyzing visual input, and
Conference on Mobile Systems, Applications and Services (MOBISYS ’24), June interpreting human actions, which enable user experiences to be
more relevant to their requests. However, user interactions can also
significantly deteriorate user experience, when a high latency of the
response introduces a delay to the operation of the application. As
such low-latency responses to user requests are a crucial factor for
This work is licensed under a Creative Commons Attribution International 4.0 License. a good user experience of AI-based mobile services. To achieve this,
MOBISYS '24, June 3–7, 2024, Minato-ku, Tokyo, Japan these services generally strive for a latency service-level objective
© 2024 Copyright is held by the owner/author(s).
ACM ISBN 979-8-4007-0581-6/24/06. (SLO) [7, 14] instead of a throughput objective, usually on the order
https://doi.org/10.1145/3643832.3661885 of milliseconds.

412
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee

However, with the ever-increasing complexity of modern AI we find expressing DNNs in fine-grained units of tiles, instead of
models and services, meeting the latency SLO solely with local layers, to be the more fundamental solution that can be applied
processing in resource-constrained mobile devices has proven to to any DNN offloading problem, regardless of the DNN and de-
be extremely challenging. Even with the help of modern mobile vice used. Also, considering the dynamic nature of DNN offloading,
processors, executing a DNN solely on a mobile device consumes a the runtime components of the system must be flexible enough
considerable amount of time, power, and memory [20, 37], and the to support dynamic changes in the execution, such as changes
execution becomes near-impossible with large-scale DNNs, such as in available computation resources, network conditions, and the
generative NLPs. Instead, many services opted to offload the DNN presence of competing inference offloads. To enable such dynamic
inference to remote high-performance cloud servers, by transmit- behaviors, we find that the system components for computation
ting locally collected user data to the servers. For many mobile AI and networking should actively operate in parallel to dynamically
services with low latency SLOs, this mobile-server inference offload- allocate the resources to DNN offloading workflows that require
ing has become the dominating, if not only, solution. As such, this attention depending on the system status.
significance of mobile-server inference offloading naturally led to a To this end, we design CoActo, a DNN execution system built
series of works that aim to expand upon its design. from the ground up for mobile-server inference offloading. The key
One of the most significant improvements over the traditional design philosophy behind our system is Coactive DNN Inference
mobile-server inference offloading is achieved by mobile-server Offloading, which is a new, improved concept of DNN offload-
collaborative inference [15]. Unlike the traditional offloading ing that adds fine-grained expression of DNNs and concurrency of
approach, collaborative inference actively leverages the increas- runtime resources to the existing collaborative inference. In the
ing AI processing capabilities of modern mobile devices, by split- coactive DNN inference offloading of CoActo, the DNN workload
ting the DNN computation in a way that balances the computa- is expressed as a fine-grained tile-based dataflow graph. Using this
tion between the mobile and server computation resources. By fine-grained DNN graph, the computation and network resources
carefully profiling the characteristics of three main resources of of the given offloading environment are dynamically assigned and
inference offloading, i.e., mobile, server, and network resources, utilized to maximally leverage not only the concurrent activation of
collaborative inference approaches search and generate a partition- multiple offloaded workloads but also the concurrent activation of
ing and scheduling scheme that achieves the minimum combined computation and communications within a single offloaded work-
latency of both the computation and the communications within load, allowing higher resource utilization and lower end-to-end
its solution space, enabling much improved end-to-end latency latency in DNN offloading. For instance, in Figure 1(a), conven-
in inference offloading. As such, many collaborative inference ap- tional collaborative DNN inference offloading simply searches the
proaches [6, 10, 11, 13, 16, 17] have been proposed in recent years, best DNN partitioning point to achieve the lowest combined la-
each providing different solutions to finding the optimal partition- tency of the mobile computation, data communications, and server
ing and scheduling scheme in mobile-server collaborative inference. computation. In contrast, as depicted in Figure 1(b), coactive DNN
Despite these efforts, we find existing collaborative inferences inference offloading of CoActo can dynamically schedule concur-
to be far from complete. Existing works focus on the question of rent interleaving of the mobile, server, and network operations on a
how to split the DNN computation between the mobile and server fine-grained expression of DNNs, to actively increase the utilization
resources for efficient DNN offloading, but this is only a part of a of the resources in the offloading environment and enable lower
bigger question. What ultimately determines the end-to-end latency end-to-end latency.
is the design of the execution system, of how the system can effi- We find three design challenges to enable such a novel DNN
ciently support the characteristics of the given workload under the execution system: 1) Devising a general model partitioner that ex-
highly stochastic characteristics of the mobile offloading environ- presses an arbitrary layer-wise DNN graph as a tile-wise DNN
ment [22, 34], such as channel dynamics in mobile networks [22, 34] graph. 2) Designing an execution system that allows flexible alloca-
and bursty requests in cloud servers [33, 36]. For DNN offloading, tion and utilization of system resources. 3) Designing an efficient
this not only includes the partitioning and scheduling of the work- scheduler that can react to the dynamic changes in the runtime en-
load, but also the modeling of the workload, execution algorithm, vironment. We propose three corresponding design concepts for the
dynamic load-balancing, possibility of multi-tenant execution, and system design CoActo, each overcoming the three challenges men-
many more. Therefore, to find a more complete solution to collabo- tioned above. We implement CoActo for various mobile devices and
rative inference, we ask the more fundamental question of "How server environments and evaluate CoActo in various network envi-
should a DNN execution system be designed for efficient mobile DNN ronments and DNN models, including recent Transformer-based
offloading?" DNN models. The experimental results show that our framework
We identify two key properties that a DNN execution system achieves up to 2.1 times the speed-up compared to the state-of-the-
must have to realize efficient DNN inferences in a mobile-server art conventional collaborative inference frameworks.
offloading environment: 1) a fine-grained expression of DNNs, and
2) flexibility of the system resource utilization. In current DNN 2 BACKGROUND AND MOTIVATION
computation frameworks, DNNs are often expressed in units of
layers. However, we find that layer granularity is often too large 2.1 Collaborative Inference
for DNN offloading scenarios, and does not supply enough paral- In this section, we explain the concepts and limitations of two rep-
lelism to efficiently utilize all available resources. Existing solutions resentative approaches in collaborative inference: split computing
attempt to alleviate this by partitioning the layers [35, 39, 40], but and fused-layer offloading.

413
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan

Overlapped region computation granularity significantly limits the number of indepen-


dent computations in the workload, which leads to restricted use
of parallelism between resources, or forced independence through
duplicate computations, as seen in existing collaborative inference
approaches. Furthermore, layer-wise expression also makes it very
(a) Split computing (b) Fused-layer offloading challenging to design a flexible system that can adapt to dynamic
changes in environmental characteristics, due to the large amount
Figure 2: Illustrations of two collaborative inference ap- of processing time required for each workload unit. Therefore, a
proaches, (a) split computing and (b) fused-layer offloading. fine-grained expression of DNN computation is necessary to pro-
Split computing: This technique splits the DNN model into two vide a higher number of independent computations and better
submodels at the layer level, as in Figure 2(a). In this approach, dynamic capabilities at runtime.
the mobile device computes the earlier DNN layers and then trans- In this regard, we investigate the tiling technique, which is used
mits the intermediate output data to the server. Subsequently, the in matrix multiplication to increase the parallelism of the opera-
server computes the remaining layers after receiving the whole tion [5, 29]. This technique decomposes a single matrix multiplica-
intermediate output data. In this scheme, the key control knob is tion into several sub-matrix multiplications, where each sub-matrix
the split point, which determines the ratio between mobile and computation is referred to as a tile. This tile-wise partitioning al-
server computation and the communication time in between, as lows parallel computing resources to access and compute each tile
each layer’s intermediate output data size and computing cost are all concurrently, leading to greatly increased parallelism. We find this
different from one another. Many existing studies [6, 10, 13, 15, 17] concept of tiles to be the ideal unit for fine-grained expression of
have suggested several solutions in finding the optimal split point DNN computation, as many DNN computations are already com-
in this partitioning scheme by profiling the computing and net- puted by mapping it to matrix multiplication. Also, the freedom in
work resources of the given offloading environment. However, this tile sizes, from a single element to the entire matrix, provides the
approach has a critical flaw, as layer-wise data dependency necessi- opportunity to supply a sufficiently fine-grained DNN representa-
tates two of the server, mobile, or network resources to idle during tion for any DNN computation task. Based on this observation, we
the execution of a resource. As a result, this sequential execution conclude that integrating the tiling technique into a DNN offloading
approach suffers from under-utilization of the available resources, system would enable a highly concurrent and flexible collaborative
leading to only a limited amount of latency improvements. inference system.
Fused-layer (FL) offloading: The key idea of this technique is
to fuse multiple layers [2] by exploiting the spatial locality of lay-
ers, such as convolution and pooling layers. It allows the division 3 CoActo DESIGN
of the neural network model into several submodels with fused
layers, as shown in Figure 2(b), each with zero data dependencies
3.1 Design Philosophy & Challenges
to one another. This independent nature of the submodels allows 3.1.1 Design Philosophy. Traditional approaches for collaborative
the computation of each submodel to be executed without any inference primarily focus on model splitting, which allows the mo-
synchronization to the computation of other submodels, enabling bile, network, and server resources to achieve a minimum sum of
the available computation resources to be utilized without any latency. However, for optimal end-to-end performance, it is crucial
idling. Many works have leveraged this technique for distributed not only to distribute the workload but also to ensure that runtime
inference [35, 38, 39] in edge devices such as IoT clusters, and a execution system components work in unison to make the best
recent work [40] has demonstrated the potential of this scheme in use of the available resources under the dynamic environments
collaborative inference by applying model parallelism and partial that exist during DNN offloading. Therefore, we propose a novel
offloading to the mobile-server offloading environment. Unfortu- concept of Coactive Inference to propose an answer to the design
nately, this approach suffers from limited scalability and high com- of a DNN execution system bespoke for DNN offloading. In coac-
putation overhead, as dividing a DNN into independent submodels tive inference, the execution system components go beyond simple
inevitably introduces duplication of computation between the sub- model splitting and behave more proactively, ultimately achieving a
models. As the DNN becomes more complex, and as the number of coactive execution of the inference workloads. Coactive inference
submodels increases, overlapping computational regions (the gray offloading adds the following two design philosophies to existing
areas in Figure 2(b)) between the submodel expands, resulting in collaborative inference.
an exponential increase in the total computational cost. As a result, P1) Fine-grained DNN expression: To increase the parallelism
the efficacy of this method drops significantly as the DNN model and flexibility in the system, a large workload unit of current DNN
gets more complex and the concurrency of the method increases, expression must be decomposed into smaller, fine-grained sub-
resulting in reduced benefits from collaborative inference. workloads. Smaller workload size allows faster unit processing
times, which are suitable for dynamic scheduling of parallel re-
sources at runtime. In addition, this enables a large increase in
2.2 Tiling for Collaborative Inference parallelism for both the computing and network resources, which
The fundamental limitation of both approaches is that they use is necessary to saturate the given system.
layers to express the DNN computation graph. Layers are simply P2) Concurrency of runtime resources: As explained in Sec-
too large of a unit for a DNN offloading environment. The large tion 1, the end-to-end latency of DNN offloading highly relies on

414
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee

how the given mobile, network, and server resources are best uti- Server
Graphs received Graph Graph Fetch Computing
lized at runtime. Concurrent, rather than sequential, use of these from mobile devices Info Engine Engines
resources is necessary to maximize parallel resource utilization. DNN 1 DNN M Weights Dynamic Scheduler
Furthermore, this concurrency provides adaptability in dynamic (Sec. 3.5)
environments by enabling the handling of multiple workloads si- … Results
Communication
multaneously. Engine Transmit Tiles

3.1.2 Design Challenges. To create a DNN inference offloading Transmit


Offline Runtime
framework that realizes the coactive inference concept, we find the Tile Graph

three design challenges that must be addressed. Layer Communication Transmit Tiles
Tile
C1) Tile-based expression: As mentioned in Section 2.2, the tiling Graph Graph Engine
Results
technique is suitable for the fine-grained expression of DNNs. How- Partition Dynamic Scheduler
ever, devising a general model partitioner that expresses an ar- Graph (Sec. 3.5)
bitrary layer-wise computation graph as a tile-wise computation Info
Graph Computing
graph poses many challenges. These challenges include determin- Weights
Engine Fetch Engines
Tile-based Partitioner
ing the efficient tile dimensions and size for the given environment, (Sec. 3.3) Asynchronous Execution Engines
automatically parsing and generating the independent data depen- Mobile (Sec. 3.4)

dency flow graph between the tiles, and designing these processes
in a general manner for all DNNs. Figure 3: The overview of CoActo. Tile-based Partitioner (TP)
C2) Concurrent execution system: Current DNN inference frame- transforms the layer-wise computation graph into a tile-wise
works such as Tensorflow [1] or PyTorch [23] execute computation computation graph by partitioning a layer into several fine-
at the layer level, necessitating synchronization of the resources grained tiles. At runtime, Asynchronous Execution Engines
between each computation or communication operation. Although (AEEs) concurrently compute and transmit partitioned tiles.
tile-wise computational graphs allow independent computational Dynamic Scheduler (DS) makes offload decisions for each
paths, this layer-wise execution system restricts the concurrency tile.
only to the intra-layer level, leading to a serialized execution. De- structure is used for both mobile and the server during DNN of-
signing a concurrent execution system that enables overlapping floading. DNN offloading is initiated by the runtime of the mobile
the computation and communications of tiles in the independent device transferring its partitioned tile-wise computation graphs to
computational paths presents a challenging task. the server runtime. AEEs are composed of three separate engines,
C3) Dynamic scheduling of tiles: The third challenge comes namely the Graph Management Engine, Computing Engine, and
from the design philosophy of existing collaborative inference of- Communication Engine, which asynchronously execute tiles in
floading of balancing the model executions between the mobile, independent computational paths concurrently without any syn-
network, and server resources of the given environment. Unlike the chronization. DS dynamically decides whether to transmit the data
existing approach, however, our concurrent execution model does of tiles or not at the mobile devices during runtime, using the pro-
not allow the simple performance modeling of adding the individ- filed network condition and server’s current computation load to
ual profiled latency of the resources. As such, a novel scheduling estimate the completion time of the offloaded tile.
solution that allows dynamic adaptation and balancing of complex
fine-grained DNN between the concurrently operating resources
must be realized for coactive inference. 3.3 Tile-based Partitioner (TP)
The objective of TP is to automatically convert an arbitrary layer-
3.2 Overview wise DNN graph into a fine-grained computation graph of tiles.
By addressing the challenges mentioned above, we design CoActo, In doing so, TP combines the insight that matrix multiplications
a novel coactive inference framework that enables fine-grained and are composed of many smaller tiles which can be computed inde-
concurrent execution for DNN offloading. CoActo comprises the pendently from Section 2.2 with the observation that many DNN
three components: Tile-based Partitioner (TP) (Section 3.3), Asyn- layer computation kernels are executed using matrix multiplication.
chronous Execution Engines (AEEs) (Section 3.4), and Dynamic This leads us to the tile-level computation graphs that allow fine-
Scheduler (DS) (Section 3.5). Each component has been designed to grained dependency management and concurrent communication
address the three challenges identified in the previous section. In and computation of independent tiles during DNN offloading.
the following sections, we discuss how each component tackles its In CoActo, a tile is an abstract unit of scheduling that each ref-
corresponding challenge. erences a certain submatrix of the original tensor data. As tiles
Figure 3 presents an overview of CoActo. TP transforms a layer- are only references, CoActo is able to construct and execute fine-
wise computational graph to a fine-grained tile-wise computational grained dependency graphs on top of existing tensor-based DNN
graph by iteratively dividing the computation of each DNN layer execution model while keeping the original DNN tensors and their
into multiple fine-grained tiles and graphing the data dependencies executions unmodified. That is, tile abstraction only holds the meta-
among the tiles using the characteristics of each layer. The tile- data needed to define and access the tile, such as tile dimension
wise graph outputs of TP are saved, to be later used in the CoActo sizes, memory stride per dimension, dependency relationship with
runtime, composed of AEEs and DS. The same CoActo runtime other tiles, and the pointer reference to its tensor data, while the

415
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan

Input Tensor Input Tiles


1 2
Input
1 2 3 3 4
4 5 6
Conv 7 8 9

Conv Width x Height


Pool

Channels
1 2
3 4


Output Tensor Output Matrix 1 2 3 4
Softmax Output Tiles
(c) Column-wise Softmax
(a) Parsing Layer (b) Convert Tensor (d) Tiling and Graph
Original DNN to Matrix Partitioning Reconstruction Partitioned Graph

Figure 4: An illustration of Tile-based Partitoner (TP) operation. An original DNN is transformed into a partitioned computation
graph through 4 steps, (a) parsing the target layer’s input and output tensor sizes, (b) converting input and output tensors to
matrices, (c) column-wise partitioning, and (d) generating tiles by merging multiple columns, recursively in every layer.

4 5 6 7
memory structure of the original tensor data remains unmodified Mobile 6 7

Softmax
1 2 3

9
and contiguous in memory. Thus, by leveraging this tile abstraction, Network 1 2 6 7

8
CoActo requires no additional duplication or transformation of ten- Server 4 5 8 9


sor data for its tensor parallel scheduling and execution1 of the Tile-wise Computation Graph Time
DNN, unlike the fused-layer offloading approaches in Section 2.1.
Figure 4 shows the overall process of TP. The first step (Fig- Mobile

Softmax
ure 4(a)) is to analyze the size of each layer’s output tensor. For CNN

Conv
Input

Pool
Network Input


tensors, TP flattens the output tensors into a matrix (Figure 4(b)) Server Idle
⁝ height equals the number of channels, and the width equals
Conv Pool



whose



Layer-wise Computation Graph Time
the product of the tensors’ width and height (i.e. the number of
elements per channel), and for Transformer-based DNNs, the ten- (a) Computation graphs (b) Timelines
sors are already in the form of matrices. After that, the matrices
are decomposed into column-wise vectors, and tiles are created by Figure 5: Examples of the collaborative inference with the
merging the partitioned columns, which are uniformly shaped sub- tiles-wise computation graph (top) and the layer-wise com-
matrices (Figure 4(c)). The computation graph is then reconstructed putation graph (bottom).
by graphing the tiles into a directed acyclic graph (DAG), using the
data dependency relationship between the tiles of input and output underutilization becomes even worse when network conditions

layers(Figure 4(d)). In this step, the granularity of the tiles is deter- degrade. On the contrary, the tile-wise computation graph enables


mined by the number of merged columns. Merging the tiles allows concurrent computation and transmission of independent nodes,
for decreased scheduling overhead and an increase in weight data as independent tiles are allowed concurrent execution and the pro-
reuse through decreased computation granularity. TP merges the cessing time of each tile is much faster than the computation of
the whole layer. As a result, the completion time is greatly reduced

columns until the number of tiles reaches a predefined number per


layer. TP evaluates the performance of the current graph under the compared to the conventional layer-wise collaborative inference
given network and computing environment and heuristically ad- approaches.
justs the number of tiles until it achieves satisfactory performance.
Through these steps, a fine-grained tile-wise computation graph is 3.4 Asynchronous Execution Engines (AEEs)
formed, which represents the DNN using fine-grained tiles as the Our approach to achieving a concurrent DNN execution system
graph node and the data dependencies among them as the graph involves designing Asynchronous Execution Engines (AEEs), which
edge. consist of three types of execution engines, namely Graph Man-
This transformation to a fine-grained dependency graph allows agement Engine, Computing Engine, and Communication Engine,
a flexible and concurrent execution in DNN inference offloading. where each engine can asynchronously and independently operate
Figure 5 illustrates the difference between the tile-wise compu- without waiting for each other. This asynchronous design maxi-
tation graph and the conventional layer-wise computation graph mizes concurrency by executing the tiles in parallel, but also re-
for DNN inference offloading. With the layer-wise computation quires sophisticated coordination of the tasks to avoid potential
graph, the server can start its computation of the subsequent layers race conditions or race conditions. We now explain the engines
only after the delivery of the whole input data. This results in the asynchronously operate to achieve concurrency.
powerful server remaining idle during the data transfer, and this
3.4.1 Graph Management Engine. Managing the complex fine-grained
1 The individual tile executions may require data duplication at the computation kernel
computation graph separately in each parallel computing engine
level, depending on the implementation of the kernel. For instance, a convolution
kernel that combines im2col with matrix multiplication may require duplication of requires frequent synchronization of the graph states, resulting in
the input data, while loop-based convolution may not require such duplication. severe serialization of execution. To avoid this, we separate the

416
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee

Remote the graph engine whenever it is idle, as in Figure 6. Once it fetches


Local CoActo AEEs CoActo AEEs
Graph Management Communication Engine a node, it dispatches the corresponding computation kernel of the
Comm.
Engine node to a parallel computing resource. After finishing the compu-
Engine

Receive Queue Graph.


tation of a node, it returns the results to the graph management
Return Transmit Comp.
engine. If the dynamic scheduler decides to transmit the completed
node, it also pushes the node to the send queue in the communi-
Send Queue Remote
Push cation engine. This execution cycle is asynchronously performed
CoActo AEEs
Comm.
in each computing engine without synchronization, therefore, the
Workload Queue

Update Computing Engines Engine parallel resources are saturated if a larger number of ready nodes
Graph.
than the number of parallel execution engines are prepared in the

Kernel
Engine 1
Comp.
workload queue. To maximize concurrency, the communication
engine also operates asynchronously without synchronization with
Kernel not-ready
Softmax
Engine N ready
computing engines. It continuously transmits the completed nodes
Fetch
completed in the send queue to the target device (e.g., a cloud server or a
mobile device). Whenever it receives the completed nodes from the
Figure 6: The overall execution workflows of Asynchronous other devices, it also returns the nodes to the graph management
Execution Engines. engine.

role of management of complex computation graphs to Graph Man- 3.5 Dynamic Scheduler (DS)
agement Engine, which contains the entire computation graph
Our concurrent approach makes finding the optimal offloading de-
information and its state. With a separate graph management en-
cisions non-trivial, as the completion time cannot be modeled by
gine, the computing and communication engines are guaranteed
simply adding the profiled computing times and communication
to access data exclusively for every node without any requiring
times. Furthermore, the fine-grained tile-wise computation graph
synchronization.
makes the problem more complex. With the tile-based partitioning
Graph management: To manage the data flow of computation
technique, scheduling the tiles is interpreted as a complex DAG
graphs, we define three transition states of a node2 : completed, ready,
scheduling problem that is a well-known NP-complete problem [31].
and not-ready. If a node is executed or if its outputs are received
Static DAG scheduling approaches [28, 31] are available for adop-
from another device (mobile or server), it is in a completed state.
tion, yet their effectiveness diminishes when deployed in dynamic
Note that input nodes, which hold the input data for the inference,
environments. For instance, unexpected network interference or
are always considered completed. If all parents of a node are in
an extremely burst request on the server reduces the efficacy of
completed state, the node can be executed, and this is represented
the statically derived solution. Therefore, we suggest a dynamic
as a ready state. On the other hand, if one or more parent(s) are not
offloading decision algorithm that dynamically schedules the nodes
completed, the node is in a not-ready state. Whenever the graph
at runtime based on the estimated completion time of each node at
engine receives the completed nodes from the computing engines
the moment.
or the communication engine, it updates the state of the completed
node’s child nodes and pushes any child node that becomes ready to 3.5.1 Task Model. We define the partitioned computation graph
the workload queue, as depicted in the left of Figure 6. This update as a DAG 𝐺 =< 𝑉 , 𝐸 >, where the vertex set 𝑉 represents the set
process is performed atomically to prevent any race conditions. of the nodes. The edge set 𝐸 contains the directed edges 𝑒𝑖,𝑗 ∈ 𝐸
Also, this data dependency update is performed asynchronously for the data dependency between the node 𝑣𝑖 and 𝑣 𝑗 . A node 𝑣𝑖 ,
by a data exchange between the graph management engine and which serves as the starting point of an edge, is referred to as the
only a single communication or computation engine, minimizing parent node, and a node 𝑣 𝑗 that serves as the endpoint of the edge
the communication overhead within a single device. is referred to as the child node. A node without any child nodes is
Workload queue: The purpose of the workload queue is to act called an exit node 𝑣𝑒𝑥𝑖𝑡 . A child node is dependent on its parent
as a barrier between the nodes that are ready for execution and nodes and can only be executed when all the output data of the
those that are not. This ensures that computing engines fetch only parent nodes are ready. Each node 𝑣𝑖 has a computation cost (FLOPs)
the ready nodes from the tile-based DNN graph, without having of 𝑐𝑖 , and output tile data size of 𝑑𝑖 .
to consider the data dependency between other nodes. This ap-
proach ensures that new computations are readily available to be 3.5.2 Dynamic Offloading Decision. The goal of our dynamic of-
dynamically scheduled to the resources at any point of system exe- floading decision in mobile is to find the optimal offloading policy
cution, enabling the maximal utilization of the given computation 𝑂 = {𝑜 1, ..., 𝑜 𝑁 } that minimizes the maximum completion time of
resources. the exit nodes 𝑣𝑒𝑥𝑖𝑡 . Note that 𝑜𝑖 denotes the offloading decision of
3.4.2 Computing & Communication Engines. All computing and a node 𝑣𝑖 , where 1 represents server offloading, and 0 represents
communication engines operate concurrently and asynchronously local computation. Here, 𝐶𝑇 (𝑣𝑖 ) represents the completion time of
without any synchronization among them. Each computing engine a node 𝑣𝑖 , measured from the beginning of the inference.
continuously fetches the ready nodes from the workload queue in

2 Tile and node terms are interchangeably used in this paper.


min max 𝐶𝑇 (𝑣𝑒𝑥𝑖𝑡 ) (1)
𝑜𝑖 ∈𝑂

417
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan

Input 𝑣𝑣1 𝑣𝑣2 𝑣𝑣3 inaccurate. Nevertheless, we also suggest a profile-based estimation
M 𝑣𝑣4 𝑣𝑣5 𝑣𝑣8 𝑣𝑣6 𝑣𝑣9 𝑣𝑣11 𝑣𝑣7 𝑣𝑣10 𝑣𝑣12
𝑣𝑣4 𝑣𝑣5 𝑣𝑣6 𝑣𝑣7
𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 𝑣𝑣8 𝑣𝑣11 𝑣𝑣12
approach, for an accurate estimation of these completion times,
N
𝑣𝑣10
detailed in the subsequent paragraphs.
𝑣𝑣8 𝑣𝑣9 𝑣𝑣7 𝑣𝑣6 𝑣𝑣10 𝑣𝑣4 𝑣𝑣5 𝑣𝑣9 𝑣𝑣10 𝑣𝑣11 𝑣𝑣12
S
Estimating the completion times: Figure 7(c) and Figure 7(d)
𝑣𝑣11 𝑣𝑣12 Complete Time show the examples of 𝐸𝐶𝑇 (𝑣 8 , offloaded) and 𝐸𝐶𝑇 (𝑣 8 , server).
(a) Sample DAG (b) Timeline of scheduled DAG 𝐸𝐶𝑇 (𝑣 8 , offloaded) can be decomposed into the completion time
𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,offloaded ) 𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,server) of the node 𝑣 8 (𝐶𝑇 (𝑣 8 )), the queuing latency (zero in the example of
𝑣𝑣4 𝑣𝑣5 𝑣𝑣8 𝑣𝑣6 … …
Figure 7(c)), and the transmission time of the node 𝑣 8 . The queuing
𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 𝑣𝑣8 … 𝑣𝑣3 𝑣𝑣2 𝑣𝑣1 … latency can be obtained by the total data size of the nodes in the
… 𝑣𝑣7 𝑣𝑣6 𝑣𝑣10 𝛼𝛼 𝑣𝑣4 𝑣𝑣5 𝑣𝑣8
send queue of the communication engine (in Sec 3.4) and the pro-
filed bandwidth. The transmission time of the node 𝑣 8 is obtained
Time Time 𝑑8
𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ) 𝑑𝑑8 by 𝐵𝑊 + 𝐿, where, 𝐿 is the network latency between the server
+ 𝐿𝐿 Input transmission Compute time 𝑐𝑐8 ∗ # 𝑜𝑜𝑜𝑜 𝑐𝑐𝑜𝑜𝑐𝑐𝑐𝑐𝑐𝑐
𝐵𝐵𝐵𝐵 of ancestors 𝐹𝐹𝐿𝐿𝐹𝐹𝐹𝐹𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 and mobile. Note that 𝐵𝑊 is estimated by calculating the moving
(c) 𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,offloaded ) (d) 𝑬𝑬𝑬𝑬𝑻𝑻(𝒗𝒗𝟖𝟖 ,server ) average of the division between the size of transmitted data and the
time taken for the transmission time during the previous inferences.
Figure 7: An example of dynamic offloading decision 𝐸𝐶𝑇 (𝑣 8,server) is decomposed into the transmission time of all in-
with (a) a sample DAG. It is performed by calculating put nodes, the computation time of ancestors of 𝑣 8 , and the delayed
𝐸𝐶𝑇 (𝑣𝑖 ,offloaded) and 𝐸𝐶𝑇 (𝑣𝑖 ,server). time 𝛼𝑘 of the mobile device 𝑘 by the resource contention from the
Basic idea: The main idea of our approach is to first send all the other computation graphs as in Figure 7(d). The transmission time
input nodes to the server and dynamically decide to compute and of all input nodes is calculated by dividing the data size of the nodes
send the outputs of the subsequent nodes on runtime. Sharing the and the profiled network bandwidth. Then, the computation time
input data is very cheap, as input data are often much smaller of the ancestors of 𝑣 8 is estimated by dividing the summation of the
than intermediate DNN tensors. Through this, we guarantee that computation time of the ancestors of 𝑣 8 and the number of com-
the end-to-end execution latency of our coactive inference is at putation resources in the server. Meanwhile, estimating 𝛼𝑘 in the
least the minimum value between a full server offloading and full mobile device 𝑘 is non-trivial, as the mobile devices cannot know
mobile on-device computation. Our system is designed in this way the server’s computational load. To estimate 𝛼𝑘 , the server informs
to guarantee minimum performance, even with unforeseen network the number of computed nodes of each mobile device within the
or server degradation during the execution. If incomplete input data predefined time interval 𝑇𝑠 among the multiple computation graphs.
∗𝑛𝑘
were to be sent, the powerful server computations may wait for the Then, 𝛼𝑘 is estimated by 𝑇𝑠Σ𝑛 , 𝑛𝑘 being the number of computed
transmission of intermediate nodes from the mobile, which may tiles of the computation graph of the mobile device 𝑘 and Σ𝑛 being
not be latency guaranteed in wireless or user mobility scenarios. the summation of the computed tiles of all computation graphs in
Furthermore, it always ensures that the worst-case execution time is the server. This allows the mobile devices to find the average time
the time taken for on-device inference if the computation gain from taken in the server for the scheduling of its node among multiple
the powerful server is lost by an offloading service disconnection. competing computations.
Dynamic offloading decision: We explain how the dynamic
offloading decision operates by using the example DAG in Figure 7.
As explained before, all input nodes are transmitted and duplicate
computation between server and mobile is allowed. However, to 4 IMPLEMENTATION
minimize unnecessary duplicated computation, the mobile and As current DNN inference frameworks rely on layer-wise expres-
server start their computation from a ready node pair with the sion and do not support the execution of tile-wise computational
largest diameter in between, iterating in opposite directions. For graphs, we implement CoActo from scratch using approximately
example, as illustrated in Figure 7(b), the mobile transmits 𝑣 3 , while 21,000 lines of C code, aimed at CPUs.
computing 𝑣 4 . The server starts computation of 𝑣 7 as soon as it Tile-based Partitioner: We implement the custom C structs for
receives 𝑣 3 . Then, the offloading decision of each node is performed both tiles and the tile-wise computation graph. The tile object con-
using a greedy approach in the mobile device, starting from 𝑣 4 . This tains the variables associated with the tiles, such as the pointers of
decision is made using the estimated completion times of the node the input and output data memory addresses. The tile-wise com-
on the server when it is either 1) offloaded from the mobile and 2) putation graph struct generated by TP then holds the pointer list
computed solely on the server, denoted as 𝐸𝐶𝑇 (𝑣𝑖 ,offloaded) and of the tiles in the graph. As tiles are only references to tensor data,
𝐸𝐶𝑇 (𝑣𝑖 ,server), respectively. A node 𝑣𝑖 is only offloaded (i.e., 𝑜𝑖 = 1) the memory overhead for each tile object is around 100 bytes. As
if 𝐸𝐶𝑇 (𝑣𝑖 ,offloaded) < 𝐸𝐶𝑇 (𝑣𝑖 ,server), indicating that computation DNNs are usually compiled into fine-grained DNN graphs with a
of the node 𝑣𝑖 on the server alone will be delayed when compared to few hundred to a few thousand nodes each, the memory overhead
the case when it is assisted by the mobile device over the network. for a tile-wise graph is in the range of a few hundred kilobytes on
For example, in Figure 7(b), the node 𝑣 8 is offloaded based on the average. To generate the tile-wise graph, TP parses the Darknet [25]
estimations from the mobile. The server is then allowed to skip the cfg and weight files and creates the tile structure by parsing and
computation of 𝑣 8 by using the computation result from the mobile analyzing the DNN structures layer by layer. After that, it locates
device. This mobile-assisted greedy approach enables the maximal the child tiles of each tile and graphs them by storing the child tile’s
utilization of the powerful server resources, even if the estimation is pointer references to the tile struct.

418
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee

Platform CPU Memory


Baselines: 1) Cloud-only: A status-quo approach that offloads
Server 64 Cores AMD Threadripper 3990X 128GB
Jetson AGX Xavier 8 Cores Carmel ARMv8.2 32GB
whole DNN inference workloads to the cloud server by transmitting
Raspberry Pi 4 4 Cores ARM Cortex-A72 1.8GHz 8GB the input data. 2) On-device: An approach that executes local in-
1 Core ARM Cortex A-76 2.4GHz ference on mobile platforms without offloading. 3) SPINN [16]: The
Pixel 5 1 Core ARM Cortex A-76 2.2GHz 8GB
6 Cores ARM Cortex A-55
state-of-the-art split computing in collaborative inference, which
Table 1: Specifications of the tested platforms. vertically partitions a DNN model into two submodels, adjusting the
split point based on the network bandwidth and computation time
Graph Management Engine: To prevent the frequent memory of previous inferences. 4) FL-offloading [39, 40]: Fused-Layer
copy of tile data during runtime, Graph Management Engine man- (FL)-based collaborative inference approach that horizontally parti-
ages tiles by using references to the tiles. It keeps the array of tions a DNN model into multiple submodels by fusing several layers,
pointers to the tile structs of the given graph, and their dependency adjusting the number of layers to fuse and the number of submod-
update process is performed by traversing and updating the child els. We find the optimal number of submodels and the number of
pointers of each completed tile. To minimize the computational layers to be fused in brute force while offline based on the profiled
overhead of the child update process, only the number of completed network conditions and the computation times of the mobile and
parents of the child is atomically and asynchronously incremented. the server. For a fair comparison, we implement all these baselines
The workload queue is also implemented by storing the pointer on top of our C-based execution backend of CoActo.
references of tiles, and the fetching process of computing engines
is performed by passing the pointer reference of each tile.
Computing Engine: Each computing engine is implemented to 5.2 End-to-End Latency
asynchronously perform its function on a separate thread. Each en- We start by evaluating the efficacy of CoActo for different settings
gine continuously fetches a tile from the workload queue whenever of server load and network conditions.
it is idle. Then, it executes the tile by computing its computation ker- Effectiveness in computation bottleneck: We first evaluate
nel. We implemented our tile-wise computation kernels of GEMM CoActo under different available computational resources of the
(Generalized Matrix Multiplication) using the AVX2 X86 SIMD ex- server. To simulate this, we configure the number of available server
tensions for the server and the NEON ARM SIMD extension for cores. Figure 8 presents the end-to-end latency of each baseline with
mobile devices. Using this tile-wise GEMM kernel, we implement Jetson under differing amounts of available computational resources
other DNN operators, such as Conv2D. Non-GEMM-based kernels, of the server. We observe that CoActo outperforms the existing
such as add or pooling, are implemented as simple C for-loops. frameworks in all tested server load settings by hiding computa-
Communication Engine: We also implement Communication tion time into the communications time. Since both Cloud-only
Engine by creating separate threads for receive and transmit. Before and SPINN are based on the sequential nature of layer-wise expres-
starting an inference, a TCP connection is established between the sion and execution, the end-to-end latency shows a linear sum-
mobile and server device. During runtime, the communication of mation of transmission and computation time. On the other hand,
the tiles is performed in full-duplex using the TCP socket. The FL-offloading and CoActo operate concurrently, allowing inter-
receive and transmission queues also store pointer references of leaving of computation time and communications time, resulting in
each tile to prevent frequent memory copies. lower end-to-end latency. However, FL-offloading is still based
on layer-wise expressions and system design, which significantly
5 EVALUATION limits the concurrency of the system. Therefore, CoActo shows
much-improved latency over FL-offloading.
In this section, we evaluate the effectiveness of the flexible and
Interestingly, there is a higher performance gain when batch sizes
fine-grained design of CoActo in a diverse set of dynamic mobile
are large. The reason is the increase of independent computational
DNN offloading scenarios. The experiments are conducted under
paths that provide amplified opportunity for concurrent execution
different network and server settings and in a multi-tenant scenario.
for CoActo. The detailed operations of the baselines and CoActo
can be observed in Figure 10 that shows the timelines of each tile
5.1 Experimental Setup in three baselines and CoActo with the same settings. While the
We conduct our evaluations using three different off-the-shelf mo- server is idle until the whole input is delivered for Cloud-only
bile platforms, NVIDIA Jetson AGX Xavier, Raspberry Pi 4, and and SPINN, FL-offlaoding and CoActo start computation during
Pixel 5, and a server with an AMD Threadripper 3990X 64-core communications when only partial data is delivered to compute. As
processor. The detailed specifications are in Table 1. To validate mentioned above, FL-offloading exhibits restricted granularity,
the efficacy of CoActo in various network conditions, we use WiFi- and only a limited number of tiles can be executed while trans-
5GHz (802.11ac) and control the bandwidth and latency by using mission. In contrast, in CoActo, through the fine-grained tile-wise
Traffic Control (TC) [12] in the mobile platforms. To emulate the computation graphs and the flexible execution system that fully
server’s computational loads, we control the number of cores by set- leverages computational resources in the server, the server dynam-
ting the processor affinity of CoActo on Linux taskset in the server ically starts tile computations on the execution engines whenever
platform. We evaluate CoActo using four popular DNN models, corresponding data is received, allowing maximum overlap between
VGG16 [8], ResNet50 [9], YOLOv3 [26], and BERT-base [4], with communications and computation.
batch sizes ranging from 1 to 8. To obtain benchmark results, we Effectiveness in network bottleneck: We also evaluate CoActo in
average the end-to-end latency of 50 runs. different network bandwidth settings to simulate its efficacy under

419
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan

Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
350 140 500 800
On-device On-device On-device
250 120 On-device
400

Latency (ms)
Latency (ms)
Latency (ms)
Latency (ms)

100 600
200
80 300
150 400
60 200
100 40 200
50 100
20
0 0 0 0
4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64
Available Number of Server Cores Available Number of Server Cores Available Server Cores Available Server Cores

(a) VGG16 Batch 1 (b) ResNet50 Batch 1 (c) Bert-base Batch 1 (d) YOLOv3 Batch 1
Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
2500 1000 3000 6000
On-device On-device On-device
1500 On-device 1500 4000
800
Latency (ms)

Latency (ms)
Latency (ms)

Latency (ms)
600 3000
1000 1000
400 2000
500 500
200 1000

0 0 0 0
4 8 16 32 64 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64
Available Number of Server Cores Available Number of Server Cores Available Server Cores Available Server Cores

(e) VGG16 Batch 8 (f) ResNet50 Batch 8 (g) Bert-base Batch 8 (h) YOLOv3 Batch 8

Figure 8: End-to-end latency using Jetson AGX Xavier with different number of available server cores, under a 100Mbps WiFi
network.

Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
400 300 450 1500
350 On-device 400
250 On-device 1250
Latency (ms)

Latency (ms)
Latency (ms)

350

Latency (ms)
300
250 200 300 1000
250 On-device
200 150 750
On-device 200
150 150
100 500
100 100
50 50 250
50
0 0 0 0
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps)

(a) VGG16 Batch 1 (b) ResNet50 Batch 1 (c) Bert-base Batch 1 (d) YOLOv3 Batch 1
Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo Cloud-only SPINN CoActo
FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission FL-offloading Transmission
3000 3000 3000 10000
On-device On-device
2500 2500 2500
Latency (ms)

Latency (ms)
Latency (ms)

8000
Latency (ms)

On-device
2000 2000 2000
6000 On-device
1500 1500 1500
4000
1000 1000 1000
500 500 500 2000

0 0 0 0
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps) Bandwidth (Mbps)

(e) VGG16 Batch 8 (f) ResNet50 Batch 8 (g) Bert-base Batch 8 (h) YOLOv3 Batch 8

Figure 9: End-to-end latency using Jetson AGX Xavier under different network bandwidths and 8 cores available in the server.

the different channel dynamics of wireless mobile networks.. Fig- network bandwidth, caused by longer transmission times exacer-
ure 9 shows the end-to-end latency of each baseline using Jetson in bated by a decrease in network bandwidth. In contrast, CoActo
different network bandwidth settings. Similar to the situation with and FL-offloading hide the transmission time within the comput-
a computing bottleneck, the sequential methods of Cloud-only ing time, thereby reducing latency. In the case of models such as
and SPINN result in an increase in end-to-end latency under lower ResNet50 and YOLOv3, when network bandwidth decreases, the

420
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee

FL-offloading On-device FL-offloading On-device


Cloud-only SPINN CoActo Cloud-only SPINN CoActo
600 5000
On-device On-device
250 1970ms 1600 13643ms

Latency (ms)

Latency (ms)
200
1200
150
800
100
50 400

(a) Cloud-only 0 0
Jetson Xavier Rpi 4 Pixel 5 Jetson Xavier Rpi 4 Pixel 5
ResNet50 BERT-Base VGG16 ResNet50 BERT-Base VGG16

(a) Batch 1 (b) Batch 8

Figure 11: End-to-end latency in multi-DNN inference sce-


(b) SPINN nario that each device requests a distinct DNN inference
query to the shared server with 100Mbps and 64 available
cores settings.

latency and extremely high network bandwidth, and the increas-


ing AI processing power of both cloud and mobile, our coactive
(c) FL-offloading approach is well-designed to be the future of DNN offloading.

5.3 Concurrency in Multi-tenant Scenario


In Figure 11, we further evaluate the effectiveness of CoActo concur-
rency in a multi-tenant inference scenario in which each mobile de-
(d) CoActo vice sends a distinct inference query of the DNN model to the shared
server. For a fair comparison, we conducted our tests with all avail-
Figure 10: Timelines of each tile in three baselines and CoActo able resources (WiFi 100Mbps and 64 server cores). Our coactive
with VGG16 batch size 8. The tested network bandwidth is inference approach with the fine-grained DNNs outperforms other
100Mbps and the available server cores are 8. baselines by concurrently utilizing all runtime resources, including
mobile devices, during inference. With Cloud-only approach, all
offloading approach becomes even slower than On-device. How- mobile devices transmit the input data to the shared server, and the
ever, SPINN can resolve it through profiling-based approaches, and server simultaneously computes all DNNs at the same time. In this
CoActo can resolve it through duplication-based dynamic offload- scenario, the server suffers excessive computational loads while
ing decisions. not efficiently utilizing the computing resources of mobile devices.
Overall, CoActo achieves up to 2.1x speed-up and 1.3x on aver- Therefore, the latency of ResNet50 in Jetson becomes even larger
age compared to the baselines in all the tested network and server than the On-device latency. While FL-offloading enables con-
settings. This speed-up is achieved by overlapped computation and current execution, it also shows similar results due to the limited
communications time through flexible and concurrent execution of concurrency in its nature. In contrast, SPINN can find the solution
CoActo. While FL-offloading enables concurrent execution, its for utilizing mobile computing resources. However, due to the lack
layer-wise expression limits the concurrency of the system as men- of centralized control on the server, the Pixel 5 device overestimates
tioned earlier. To compare the concurrency of both systems, we mea- the computational load of the server and decides to compute lo-
sure the amount of computation time overlapped by communication cally, resulting in higher latency than the Cloud-only. Unlike the
time in comparison to Cloud-only. On average, we observed that others, by dynamically utilizing resources during runtime, CoActo
FL-offloading hides only 3.2% of computation time compared to improves the latency of all offloading execution by concurrently
Cloud-only, while CoActo hides 49.7% of computation time. There- processing DNN queries from different devices, as well as com-
fore, CoActo demonstrates significantly improved latency with this putation and communication from a single offloading execution.
enhanced concurrency compared to FL-offloading. In addition, when the server becomes saturated, as illustrated in
We confirm that the coactive inference approach provides greatly Figure 11(b) of Jetson, mobile CoActo runtime dynamically iden-
improved utilization of the runtime resources when either the com- tifies this and carries out the inference locally, without the use of
puting or network becomes the performance bottleneck, which is complicated centralized scheduling.
often the case in most computing scenarios. Ideally, the concurrency
of computation and communications is maximized in an environ- 5.4 Effectiveness of the Granularity
ment where equivalent computing and network performance is
To evaluate the efficacy of tile granularity in CoActo, we use a static
available, as the transmission and computation can be perfectly
number of tiles per layer during partitioning in TP. The higher the
interleaved. Considering the future mobile network with ultra-low
number of tiles per layer, the more fine-grained is the computation
graph. Figure 12 demonstrates the end-to-end latency of CoActo
with a different number of tiles per layer using the same network

421
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan

Computation Transmission Computation Transmission


150 100 on-chip memory, the asynchronous system design and tile-level in-
125 terleaving of communication and computation of CoActo may also

Latency (ms)
Latency (ms)

75
100 be further applied to the between the global memory, network in-
75 50 terface, and the accelerator memory to further increase the system
50
25
utilization.
25 Secondly, as the benefit of CoActo is ideally maximized when
0 0 the transmission and computation can be perfectly interleaved,
20 100 200 300 400 500 600 700 800 20 100 200 300 400 500 600 700 800
Number of Tiles per Layer Number of Tiles per Layer CoActo’s increase in resource utilization from interleaving concur-
rent executions may be limited in situations, where there is an im-
(a) VGG16 (b) ResNet50
Computation Transmission Computation Transmission
balance between computation and network resources. when server
150 500 resources are plentiful but mobile network bandwidth is limited,
the performance gain is still bounded by the network bandwidth,
Latency (ms)

Latency (ms)

400
100
300 even if computation and communication are perfectly pipelined.
200
This highlights the need for additional research into the adoption
50 of other scheduling methods for mobile offloading, such as dynamic
100
tile batching or prioritized scheduling.
0 0
20 100 200 300 400 500 600 700 800 20 100 200 300 400 500 600 700 800 CoActo in modern large models: Large models have tens to
Number of Tiles per Layer Number of Tiles per Layer hundreds of billions of parameters, resulting in high computational
(c) BERT-base (d) YOLOv3 and memory costs for inference. Given the limited memory and
computational power of mobile SoCs, it is challenging to achieve
satisfactory service QoE with on-device inference. In such cases,
Figure 12: End-to-End latency of CoActo with different num-
offloading inference to a powerful cloud server is often the only
ber of tiles per layer in 100MBps and 64 cores in the server.
realistic solution. Unlike existing system designs, CoActo’s fine-
Note that the batch size is 1 for all the tested DNNs.
grained tile-based DNN expression greatly increases the number
and server setting. In theory, as the number of tiles increases (i.e. be- of independent computation paths within the graph. This leads
comes more fine-grained), more opportunities arise for concurrent to more opportunities for flexible offloading decisions that help
computation and communication, allowing for a decrease in the end- the system adapt to dynamic computing and network resources
to-end latency. This expectation can be validated using VGG16 and of mobile offloading environments, which existing systems cannot
YOLOv3 as in Figure 12. However, BERT-base and ResNet50 surpris- achieve. In addition, as increased depth and width of such large
ingly exhibit increased latency after 100 and 50 tiles, respectively. models allow an even greater number of independent paths, we
The intricate layered structure of BERT-base and ResNet50, such expect CoActo to have much-improved adaptability to dynamic
as Transformers and residual connections, prevents the creation of mobile offloading environments compared to existing solutions.
long concurrent computational paths across multiple layers, unlike Early decision: Our tile-wise fine-grained DNN expression has
the more simple VGG16 and YOLOv3 structures. This reduces the another potential for DNN inference offloading acceleration: an
effectiveness of concurrency while increasing the overhead from early decision that can terminate the offloading by predicting the
smaller tiles. This suggests the need for a thorough design of the inference output with only the partial received input data when
partitioning strategy for concurrency over various DNN models. it estimates enough partial data is received to predict the results,
To achieve this, tile configurations of the aforementioned evalua- which is not feasible in current layer-wise expressions. We suggest
tions are automatically optimized during the granularity adjustment further increasing the use of powerful server resources during input
steps of TP. transmission by allowing the server to compute inference output us-
ing only the partial input data received. By using a confidence level
of the intermediate output, the server may stop the offloading and
return the output with even lower latency. This approach allows for
6 DISCUSSION reducing communications and computing time simultaneously by
Limitations of current design: While CoActo’s fine-grained and halting extraneous tile transmission, with the trade-off of accuracy.
flexible design provides a novel opportunity for latency reduction We leave the early decision method that finds an optimal point in
in inference offloading, its current design has several limitations. the trade-off of accuracy and end-to-end latency for future work.
Firstly, it only supports CPUs, which restricts its ability to utilize the Multi-machine inference: Existing layer-wise expression-based
powerful computational resources of DNN accelerators like GPUs approaches only allow for multi-machines cooperative inference
or TPUs, particularly for server-side resources. However, with the on certain models, like Inception-v3 [30], where multi-paths are
implementation of tile-based computation kernels on those proces- available. Furthermore, parallelism is still restricted due to syn-
sors, CoActo is able to easily support such computation resources, chronization needs in several layers, and therefore extending their
as contributions of CoActo focus mainly on efficient parallelism runtime execution system to multi-machine is non-trivial. In con-
and scheduling in mobile offloading scenarios, which is orthogonal trast, our coactive approach can facilitate cooperative inference by
to the choice of computation accelerators. Moreover, as the use generating many independent computational paths through tile-
of computation accelerator adds extra communication overheads, based fine-grained expression and computing independent paths in
such as transferring tile data from global memory to the dedicated parallel on multiple machines by extending our runtime execution

422
MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan Kyungmin Bin, Jongseok Park, Chanjeong Park, Seyeon Kim, and Kyunghan Lee

design to multi-machines. It is worth mentioning that this aspect ACKNOWLEDGMENTS


of our approach is left for future work. This work was supported by the National Research Foundation of
Korea (NRF) grant (2022R1A5A1027646) and the IITP grant (2022-
0-00420) through the Ministry of Science and ICT (MSIT), Korea
7 RELATED WORKS Government, and Samsung Electronics Co., Ltd(IO220808-01782-01).
Split computing: To overcome the limitations of the cloud-only Kyunghan Lee is the corresponding author of this work.
approach, Neurosurgeon [15] proposes the concept of mobile-cloud
server collaborative inference. This approach balances mobile com- A ARTIFACT APPENDIX
puting, communications time, and server computing time through The research artifacts accompanying this paper are available via
vertical model partitioning. Motivated by this, many researchers https://doi.org/10.5281/zenodo.11090490.
propose expanding the collaborative inference approach to multi-
path models [10], adopting early-exit [16], uploading model [13]. REFERENCES
However, they only focus on determining the split point in a given [1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale
environment, and not on designing the entire offloading execution machine learning. In Proceedings of the 12th USENIX Symposium on Operating
system, providing a limited solution. Systems Design and Implementation (2016), pp. 265–283.
Fused-layer (FL) offloading: Fused-layer (FL) technique is first [2] Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fused-layer cnn accelerators.
In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture
proposed to design a CNN accelerator that reduces the off-chip (MICRO) (2016), IEEE, pp. 1–12.
memory access overhead by fusing multiple layers [2]. Motivated [3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,
by this, some researchers introduce a distributed inference offload- Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models
are few-shot learners. Advances in Neural Information Processing Systems 33
ing in IoT clusters [35, 38, 39] to overcome the limitation of small (2020), 1877–1901.
memory size of IoT devices by partitioning a large DNN model into [4] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
several independent submodels. A recent study suggests a parallel arXiv:1810.04805 (2018).
partial offloading for mobile-server collaborative inference [40] and [5] Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. A set of level 3
demonstrated the potential of adopting FL in collaborative infer- basic linear algebra subprograms. ACM Transactions on Mathematical Software
16, 1 (1990), 1–17.
ence through simulated results. However, this approach requires [6] Eshratifar, A. E., Abrishami, M. S., and Pedram, M. Jointdnn: An efficient
sophisticated handling for overlapped regions, thereby limiting training and inference engine for intelligent mobile cloud computing services.
scalability. Furthermore, this method only applies to DNNs with IEEE Transactions on Mobile Computing 20, 2 (2019), 565–576.
[7] Gujarati, A., Karimi, R., Alzayat, S., Hao, W., Kaufmann, A., Vigfusson, Y.,
spatial locality (i.e., CNNs), and not to Transformer-based models and Mace, J. Serving { DNNs } like clockwork: Performance predictability from
like BERT [4] and GPT [3]. In contrast, tile-based expression can the bottom up. In 14th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 20) (2020), pp. 443–462.
be employed in all DNNs and attain finer granularity, maximizing [8] Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural
concurrency. networks with pruning, trained quantization and huffman coding. arXiv preprint
Tiling technique: Tiling is a well-known matrix multiplication ac- arXiv:1510.00149 (2015).
[9] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recog-
celeration technique [5, 29] that increases parallelism by decompos- nition. In Proceedings of the IEEE conference on computer vision and pattern
ing matrix multiplication into multiple submatrix multiplications. recognition (2016), pp. 770–778.
Many studies adopt this tiling technique in various DNN computa- [10] Hu, C., Bao, W., Wang, D., and Liu, F. Dynamic adaptive dnn surgery for
inference acceleration on the edge. In IEEE INFOCOM 2019-IEEE Conference on
tion topics such as DNN scheduling [18, 21], DNN compilers [41], Computer Communications (2019), IEEE, pp. 1423–1431.
reducing memory overhead [27], or heterogeneous computing [32]. [11] Huang, K., and Gao, W. Real-time neural network inference on extremely
weak devices: agile offloading with explainable ai. In Proceedings of the 28th
To the best of our knowledge, we first use this tiling technique to Annual International Conference on Mobile Computing And Networking (2022),
design a DNN offloading system. pp. 200–213.
[12] Hubert, B. Linux traffic control (tc). https://manpages.ubuntu.com/manpages/
xenial/man8/tc.8.html.
[13] Jeong, H.-J., Lee, H.-J., Shin, C. H., and Moon, S.-M. Ionn: Incremental offload-
8 CONCLUSION ing of neural network computations from mobile devices to edge servers. In
Proceedings of the ACM symposium on cloud computing (2018), pp. 401–411.
In this paper, we design CoActo, a novel DNN execution system that [14] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R.,
Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance
realizes a new concept of collaborative inference, Coactive Infer- analysis of a tensor processing unit. In Proceedings of the 44th annual international
ence Offloading. Coactive inference offloading adds two properties, symposium on computer architecture (2017), pp. 1–12.
fine-grained expression of DNNs and concurrency of runtime re- [15] Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang,
L. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge.
sources, to existing collaborative inference. In coactive inference, ACM SIGARCH Computer Architecture News 45, 1 (2017), 615–629.
system components go beyond simple model splitting, operating [16] Laskaridis, S., Venieris, S. I., Almeida, M., Leontiadis, I., and Lane, N. D.
more proactive and concurrently to achieve coactive execution Spinn: synergistic progressive inference of neural networks over device and cloud.
In Proceedings of the 26th annual international conference on mobile computing
of inference workloads. CoActo dynamically schedules concurrent and networking (2020), pp. 1–15.
interleaving of the mobile, server, and network operations to ac- [17] Li, E., Zeng, L., Zhou, Z., and Chen, X. Edge ai: On-demand accelerating deep
neural network inference via edge computing. IEEE Transactions on Wireless
tively increase resource utilization in the offloading environment, Communications 19, 1 (2019), 447–457.
enabling lower end-to-end latency. We implement CoActo for vari- [18] Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., Hu, W., Yang, F., Zhang, L.,
ous mobile devices and server environments and demonstrate that and Zhou, L. Rammer: Enabling holistic deep learning compiler optimizations
with { rTasks } . In 14th USENIX Symposium on Operating Systems Design and
our coactive approach achieves up to 2.1 times speed-up compared Implementation (OSDI 20) (2020), pp. 881–897.
to the state-of-the-art collaborative inference approaches. [19] OpenAI. Chatgpt. https://openai.com/chatgpt, 2023.

423
CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution MOBISYS ’24, June 3–7, 2024, Minato-ku, Tokyo, Japan

[20] Park, J., Bin, K., and Lee, K. mgemm: low-latency convolution with minimal parallel and distributed systems 13, 3 (2002), 260–274.
memory overhead optimized for mobile devices. In Proceedings of the 20th [32] Wang, M., Ding, S., Cao, T., Liu, Y., and Xu, F. Asymo: scalable and efficient
Annual International Conference on Mobile Systems, Applications and Services deep-learning inference on asymmetric mobile cpus. In Proceedings of the 27th
(2022), pp. 222–234. Annual International Conference on Mobile Computing and Networking (2021),
[21] Park, J., Bin, K., Park, G., Ha, S., and Lee, K. Aspen: Breaking operator bar- pp. 215–228.
riers for efficient parallelization of deep neural networks. Advances in Neural [33] Weng, Q., Xiao, W., Yu, Y., Wang, W., Wang, C., He, J., Li, Y., Zhang, L., Lin,
Information Processing Systems 36 (2024). W., and Ding, Y. { MLaaS } in the wild: Workload analysis and scheduling in
[22] Park, S., Lee, J., Kim, J., Lee, J., Ha, S., and Lee, K. Exll: An extremely low-latency { Large-Scale } heterogeneous { GPU } clusters. In 19th USENIX Symposium on
congestion control for mobile cellular networks. In Proceedings of the 14th Networked Systems Design and Implementation (NSDI 22) (2022), pp. 945–960.
International Conference on emerging Networking EXperiments and Technologies [34] Zaki, Y., Pötsch, T., Chen, J., Subramanian, L., and Görg, C. Adaptive conges-
(2018), pp. 307–319. tion control for unpredictable cellular networks. In Proceedings of the 2015 ACM
[23] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Conference on Special Interest Group on Data Communication (2015), pp. 509–522.
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, [35] Zeng, L., Chen, X., Zhou, Z., Yang, L., and Zhang, J. Coedge: Cooperative dnn
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and inference with adaptive workload partitioning over heterogeneous edge devices.
Chintala, S. Pytorch: An imperative style, high-performance deep learning IEEE/ACM Transactions on Networking 29, 2 (2020), 595–608.
library. Advances in Neural Information Processing Systems 32 (2019), 8026–8037. [36] Zhang, H., Tang, Y., Khandelwal, A., and Stoica, I. { SHEPHERD } : Serving
[24] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and { DNNs } in the wild. In 20th USENIX Symposium on Networked Systems Design
Sutskever, I. Zero-shot text-to-image generation. In International Conference and Implementation (NSDI 23) (2023), pp. 787–808.
on Machine Learning (2021), PMLR, pp. 8821–8831. [37] Zhang, L. L., Han, S., Wei, J., Zheng, N., Cao, T., Yang, Y., and Liu, Y. Nn-meter:
[25] Redmon, J. Darknet: Open source neural networks in c. http://pjreddie.com/ Towards accurate latency prediction of deep-learning model inference on diverse
darknet/, 2013–2016. edge devices. In Proceedings of the 19th Annual International Conference on Mobile
[26] Redmon, J., and Farhadi, A. Yolov3: An incremental improvement. arXiv Systems, Applications, and Services (2021), pp. 81–93.
preprint arXiv:1804.02767 (2018). [38] Zhang, S., Zhang, S., Qian, Z., Wu, J., Jin, Y., and Lu, S. Deepslicing: collabora-
[27] Shi, Y., Yang, Z., Xue, J., Ma, L., Xia, Y., Miao, Z., Guo, Y., Yang, F., and Zhou, L. tive and adaptive cnn inference with low latency. IEEE Transactions on Parallel
Welder: Scheduling deep learning memory access via tile-graph. In 17th USENIX and Distributed Systems 32, 9 (2021), 2175–2187.
Symposium on Operating Systems Design and Implementation (OSDI 23) (2023), [39] Zhao, Z., Barijough, K. M., and Gerstlauer, A. Deepthings: Distributed
pp. 701–718. adaptive deep learning inference on resource-constrained iot edge clusters. IEEE
[28] Sinnen, O., and Sousa, L. A. Communication contention in task scheduling. Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11
IEEE Transactions on parallel and distributed systems 16, 6 (2005), 503–515. (2018), 2348–2359.
[29] Smith, T. M., Van De Geijn, R., Smelyanskiy, M., Hammond, J. R., and Van Zee, [40] Zhou, H., Li, M., Wang, N., Min, G., and Wu, J. Accelerating deep learning infer-
F. G. Anatomy of high-performance many-threaded matrix multiplication. In ence via model parallelism and partial computation offloading. IEEE Transactions
Proceedings of the IEEE International Parallel and Distributed Processing Symposium on Parallel and Distributed Systems 34, 2 (2022), 475–488.
(2014), IEEE, pp. 1049–1059. [41] Zhu, H., Wu, R., Diao, Y., Ke, S., Li, H., Zhang, C., Xue, J., Ma, L., Xia, Y., Cui,
[30] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the W., et al. { ROLLER } : Fast and efficient tensor compilation for deep learning. In
inception architecture for computer vision. In Proceedings of the IEEE conference 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI
on computer vision and pattern recognition (2016), pp. 2818–2826. 22) (2022), pp. 233–248.
[31] Topcuoglu, H., Hariri, S., and Wu, M.-Y. Performance-effective and low-
complexity task scheduling for heterogeneous computing. IEEE transactions on

424

You might also like