0% found this document useful (0 votes)
34 views

Dynamic Split Computing Framework in Distributed Serverless Edge Clouds

Dsc

Uploaded by

mde2022005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Dynamic Split Computing Framework in Distributed Serverless Edge Clouds

Dsc

Uploaded by

mde2022005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE INTERNET OF THINGS JOURNAL, VOL. 11, NO.

8, 15 APRIL 2024 14523

Dynamic Split Computing Framework in


Distributed Serverless Edge Clouds
Haneul Ko , Senior Member, IEEE, Hyeonjae Jeong, Daeyoung Jung,
and Sangheon Pack , Senior Member, IEEE

Abstract—Distributed serverless edge clouds and split comput- carrying out the defined function, and the raw input data, such
ing are promising technologies to reduce the inference latency as video stream and speech to the cloud. Then, the container
of large-scale deep neural networks (DNNs). In this article, instance performs the inference and returns the result to the
we propose a dynamic split computing framework (DSCF) in
distributed serverless edge clouds. In DSCF, the edge cloud mobile device.
orchestrator dynamically determines 1) splitting point and Meanwhile, edge computing [9], [10] has been introduced
2) warm status maintenance of container instances (i.e., whether as an emerging computing paradigm, in which multiple
or not to maintain each container instance in a warm status). For edge clouds are deployed near mobile devices and perform
optimal decisions, we formulate a constrained Markov decision computing-intensive tasks (e.g., inference). As a result, it
process (CMDP) problem to minimize the inference latency while
maintaining the average resource consumption of distributed potentially addresses the challenges of high communication
edge clouds below a certain level. The optimal stochastic policy latency in cloud computing caused by the long distance
can be obtained by converting the CMDP model into a linear between mobile devices and the cloud.
programming (LP) model. The evaluation results demonstrate To take advantage of serverless computing and edge
that DSCF can achieve less than half the inference latency computing jointly, serverless edge computing-based infer-
compared to the local computing scheme while maintaining
sufficient low resource consumption of distributed edge clouds. ence models can be considered. However, standalone edge
clouds are not sufficient to further reduce the inference
Index Terms—Distributed serverless edge cloud, joint latency of large-scale DNN models, and thus we consider
optimization, split computing, warm start.
distributed serverless edge clouds where distributed edge
clouds are deployed and an orchestrator distributes serverless
tasks depending on the status of networking and computing
I. I NTRODUCTION
resources. Specifically, a given DNN model is divided into
EEP neural network (DNN) are one of the most general
D approaches in current intelligent mobile applications and
have become more popular due to their accurate and reliable
two subnetworks at a certain splitting layer. Then, the first
subnetwork from the first layer to the splitting layer is called
the head model, and the second subnetwork from the next
inference ability [1], [2]. Basically, for inference using DNNs, layer of the splitting layer to the last layer is called the
a local computing-based inference method can be considered, tail model. These head and tail models are constructed as
in which a mobile device performs all necessary computations container instances and run at edge clouds. Then, the DNN
for inference [3], [4]. In this method, there is no help from inference latency depends on the splitting point and whether
the network side and thus complex DNN models cannot be the corresponding container instances are maintained as warm
executed on the mobile device due to its limited resources. statuses.
Meanwhile, serverless computing can also be used for DNN In this article, we propose a dynamic split computing
inference.1 That is, a mobile device just defines a function to framework (DSCF) in which the orchestrator carefully decides
carry out the inference on the cloud side. Then, it transmits the the splitting point and warm status maintenance of container
request message, invoking the running of container instance instances (i.e., whether or not to maintain each container
instance in a warm status). For optimal decisions, we formulate
Manuscript received 22 June 2023; revised 28 November 2023; accepted
10 December 2023. Date of publication 13 December 2023; date of cur-
a constrained Markov decision process (CMDP) problem. The
rent version 9 April 2024. This work was supported by the Institute for optimal stochastic policy can be obtained by converting the
Information and Communications Technology Planning and Evaluation (IITP) CMDP model into a linear programming (LP) model. The
under Grant 2022-0-01015 and Grant 2022-0-00531. (Corresponding author:
Sangheon Pack.)
evaluation results show that DSCF can achieve less than half
Haneul Ko is with the Department of Electronic Engineering, Kyung Hee the inference latency compared to the local computing scheme
University, Yongin 17104, Gyeonggi, South Korea (e-mail: [email protected]). while maintaining a sufficiently low resource consumption of
Hyeonjae Jeong, Daeyoung Jung, and Sangheon Pack are with the School
of Electrical Engineering, Korea University, Seoul 02841, South Korea
edge clouds. Furthermore, DSCF can adaptively adjust policies
(e-mail: [email protected]; [email protected]; [email protected]). on the splitting point and the maintenance of a warm status
Digital Object Identifier 10.1109/JIOT.2023.3342438 by considering its operational environments.
1 Note that, in the serverless computing, a mobile device does not need
The key contributions of this article can be summarized as
to manage dedicated servers and/or container instances for the inference
of DNNs [5], [6], which implies increased resource efficiency and cloud follows. First, when employing the split computing approach
manageability [7], [8]. in distributed serverless edge clouds, the inference latency
2327-4662 
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
14524 IEEE INTERNET OF THINGS JOURNAL, VOL. 11, NO. 8, 15 APRIL 2024

depends on the splitting point and whether the corresponding the splitting point and resource allocation based on a queuing
container instances are kept in warm statuses. Therefore, model that evaluates the inference latency. To solve the for-
to minimize the DNN inference latency, it is essential to mulated problem in a practical manner, they decomposed the
jointly optimize the splitting point and the status maintenance problem into subproblems and developed a heuristic algorithm.
of container instances. Nevertheless, previous works have Tang et al. [17] investigated an optimization problem to
not addressed this joint optimization, that is, the existing split the DNN model while taking into account the resource-
work focused solely on determining the optimal splitting constrained edge cloud. In addition, they devised an algorithm
point without considering the status of container instances. that exploits the structural aspects of the defined problem to
Consequently, even with the optimal splitting point, a suf- find a solution in polynomial time. Deb et al. [18] proposed
ficiently low inference latency cannot be obtained due to a decentralized multiuser computation offloading mechanism
the increased latency for container initialization. On the using game theory to appropriately offload tasks to nearby
contrary, DSCF jointly optimizes two decisions on the splitting edge clouds.
point and status maintenance of container instances; thus, it Li et al. [19] proposed a model splitting framework con-
can efficiently reduce inference latency. Second, to assess sidering the early-exit concept where the inference can be
the effectiveness of the proposed framework in real-world conducted at an appropriate intermediate layer. The framework
scenarios, we measured the inference latency of head and jointly optimizes exit and splitting points to maximize the
tail models under different splitting points and computing accuracy of the inference while guaranteeing the inference
environments. Using these measured inference latency, we latency below a certain level. Laskaridis et al. [20] proposed
conducted comprehensive evaluations and analyzed the results, a framework that continuously checks network and mobile
offering significant guidance for optimal performance in infer- device resources and determines the splitting and early-exist
ence latency and resource consumption of distributed edge points with consideration of application service-level agree-
clouds. ment (SLA).
The remainder of this article is organized as follows. Dehkordi et al. [21] formulated an optimization problem
Section II summarizes the related works. Section III presents to decide the splitting point and the quantization level of
DSCF, and Section IV describes the CMDP model for the the weights in DNN to reduce the inference latency without
optimal operation of DSCF. Section V presents the evaluation degradation of accuracy. In addition, they proposed a heuristic
results followed by the concluding remarks in Section VI. algorithm using properties of a directed acyclic graph (DAG)
to obtain the suboptimal solution. Krouka et al. [22] presented
a method that performs pruning and compression to further
II. R ELATED W ORK reduce the energy consumption of the mobile device while
Various work has been conducted to reduce inference guaranteeing the accuracy of the inference before splitting the
latency in split computing [11], [12], [13], [14], [15], [16], DNN model. Zhou et al. [23] proposed a method using model
[17], [18], [19], [20], [21], [22], [23]. The works can be pruning and compressing the feature map at the splitting point
classified into: 1) 3GPP standardization in split comput- to reduce inference latency.
ing [11], [12]; 2) pure split computing strategy [13], [14], [15], However, there is no work considering the split computing
[16], [17], [18]; 3) split computing with the concept of early- approach for DNN inference in distributed serverless edge
exit [19], [20]; and 4) split computing with the compression clouds.
method [21], [22], [23].
3GPP analyzed the performance of split computing in
various environments [11] and derived some issues to support III. DYNAMIC S PLIT C OMPUTING F RAMEWORK
split computing in mobile networks [12]. Fig. 1 shows the proposed DSCF in distributed serverless
Kang et al. [13] designed an automatic model splitting edge clouds. We consider two edge clouds (i.e., head and tail
mechanism that consists of the deployment and runtime edge clouds) in which the head and tail models are installed
phases. In the deployment phase, prediction models on each as container instances, respectively. Furthermore, these edge
layer’s performance (i.e., energy consumption and latency) are clouds are managed by an edge cloud orchestrator. DSCF
constructed. Based on the constructed prediction models, in exploits the split computing approach in which the whole DNN
the runtime phase, the splitting point is dynamically decided model with L layers is split into head and tail models. Note
by considering which metric is more important than the that any DNN model can be considered in DSCF if it can
other. Yan et al. [14] conducted a joint optimization on be split to head and tail models. For dynamic splitting of
the model placement and the splitting point to minimize the DNN, the head and tail edge clouds have all possible container
cost function consisting of energy consumption and inference instances for head models from the first layer to the lth layer,
latency considering the dynamics in wireless channels. In IH,l (for ∀l), and for the tail models after the lth layer, IT,l
addition, they derived the closed-form solution for the specific (for ∀l), respectively.
DNN model. Eshratifar et al. [15] formulated an integer When the mobile device needs the result of the inference,
LP problem to decide on more than one splitting point. it requests the execution of the DNN inference to the edge
They obtained the suboptimal solution by converting the cloud orchestrator. The request message contains the input data
formulated problem to the well-known shortest-path problem. for inference. After receiving the inference request message,
He et al. [16] formulated a joint optimization problem for the orchestrator decides the splitting point and requests the

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
KO et al.: DYNAMIC SPLIT COMPUTING FRAMEWORK IN DISTRIBUTED SERVERLESS EDGE CLOUDS 14525

available computing power of the tail edge cloud, respectively.


Meanwhile, the latency to transmit intermediate data from the
head edge cloud to the tail edge cloud, ζD , can be obtained
as (Dl /R), where Dl and R are the intermediate data size at
the lth splitting point and the transmission rate between the
head and tail edge clouds. Therefore, to decide the optimal
splitting point, the orchestrator collects some information (i.e.,
the transmission rate between the head and tail edge clouds
and the available computing powers of the head and tail edge
clouds). Then, considering the information, it dynamically
decides the optimal splitting point.
However, even though the optimal splitting point is applied,
if the container instances for the corresponding head and tail
models are not maintained as warm statuses, the edge cloud
conducts the cold start taking a long time (over 2 s [26]).
Fig. 1. System model. Due to this long latency of the cold start (i.e., initialization
latency of head and tail models), the inference result cannot
execution of the container instance for the corresponding head be returned within a short duration. If all container instances
model to the head edge cloud.2 After receiving this request for all head and tail models are maintained as warm statuses,
message, if the container instance for the corresponding head even though any splitting point is given, the inference result
model, IH,l , is not maintained as a warm status, the head can be obtained without the concern of the latency of the cold
edge cloud performs the cold start to initialize the container start. However, warm starts need a considerable amount of
instance. Then, it conducts the inference and obtains the resources. Thus, if too many container instances are kept in
intermediate data (i.e., the output of the head model). After warm statuses, the edge cloud cannot provide more container
that, it sends the request message with the intermediate data instances for newly specified functions due to its limited
to the tail edge cloud.3 After receiving the message, the tail resource [27], [28]. To avoid this situation, the resource
edge cloud conducts the cold start to initialize the container consumed to maintain a warm status should be kept below a
instance when the container instance of the corresponding tail particular threshold. That is, the edge cloud orchestrator should
model, IT,l , is not in a warm status. After completing the cold carefully decide which container instances are kept as warm
start, the tail edge cloud can obtain the final inference result statuses or not.
by executing the tail model with the intermediate data as input. In summary, the inference latency is affected by the splitting
After obtaining the result, the tail edge cloud can return it to point and whether to maintain each container instance in a
the mobile device. warm status or not. In addition, the resource consumption
As shown in this procedure, the total inference latency of of the edge cloud is determined by whether to maintain
DSCF consists of 1) the transmission latency of the input data; each container instance in a warm status or not. Therefore,
2) the initialization and inference latency of the head model; DSCF conducts the joint optimization on the splitting point
3) the transmission latency of the intermediate data; and 4) the and warm status maintenance of container instances. The
initialization latency and inference latency of the tail model. following section describes the CMDP problem formulation
The inference latency of the head model, the transmission with the objective of minimizing the inference latency while
latency of the intermediate data, and the inference latency of maintaining the average resource consumption of distributed
the tail model depend on which layer of DNN is split (i.e., edge clouds below a certain level. In addition, the CMDP
splitting point) [3]. In addition, they are influenced by the problem is converted to an equivalent LP model to obtain the
transmission rate between the head and tail edge clouds and optimal solution.
their available computing powers. Specifically, the inference IV. C ONSTRAINT M ARKOV D ECISION P ROCESS
latency of the head model ζH,I depends on the splitting point
In this article, the edge cloud orchestrator decides the
and the available computing power of the head edge cloud.
splitting point and whether to maintain each container instance
Let FlH and CH denote floating-point operations (FLOPS) of
in a warm status or not at the time epochs T = {1, 2, 3, . . .}.
the head model with the lth splitting point and the available
Meanwhile, the CMDP model is suitable for optimization
computing power of the head edge cloud, respectively. Then,
problems where the agent performs a series of particular
the inference latency of the head model ζH,I can be calculated
actions to minimize (or maximize) the cost (or reward) under
as (FlH /CH ). Similarly, the inference latency of the tail model
constraints [29]. Therefore, we exploit the CMDP model to
ζT,I can be derived as (FlT /CT ), where FlT and CT denote
optimize the decision of the edge cloud orchestrator. Table I
FLOPS of the tail model with the lth splitting point and the
shows the summary of the important notations.
2 Since a container instance is used by a single mobile device in general
A. State Space
serverless computing [24], [25] and the mobile device is assumed to generate
a single request, we do not need to consider the case of concurrent requests for The total state space S can be defined as
the target container and thus no queuing latency is considered in this article. 
3 Note that if the head edge cloud executes the whole model, it returns the S = CH × CT × R × WH,l × WT,l (1)
inference result to the mobile device. l
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
14526 IEEE INTERNET OF THINGS JOURNAL, VOL. 11, NO. 8, 15 APRIL 2024

TABLE I
I MPORTANT N OTATIONS Similar to CH and CT , R can be discretized with the
minimum scale uR (i.e., unit transmission rate). Then, R can
be expressed as
 
R = Rmin , Rmin + uR , . . . , Rmax (4)

where Rmin and Rmax are the minimum and maximum trans-
mission rates, respectively.
The container instance can be a warm or cold status.
Therefore, WH,l and WT,l can be described as
WH,l = {0, 1} (5)
and
WT,l = {0, 1} (6)
where WH,l (or WT,l ) denotes whether or not IH,l (or IT,l ) has
a warm status. That is, if WH,l = 1, IH,l has a warm status.
Otherwise, IT,l has a cold status.

B. Action Space
The total action space can be defined as

A = AS × AW,H,l × AW,T,l (7)
l
where AS denotes the splitting point action space. In addition,
AW,H,l and AW,T,l represent the warming action spaces for
IH,l and IT,l , respectively.
Since the model has L layers, AS can be represented as
AS = {0, 1, .., L}. (8)
Note that AS = 0 represents the case where the entire model is
carried out at the tail edge cloud (i.e., the tail model is equal
to the entire model). On the other hand, AS = L describes the
case where the whole model is conducted at the head edge
(i.e., the head model is equal to the whole model).
where CH and CT are the state spaces for the available The head and tail edge clouds can make each container
computing powers of the head and tail edge clouds, respec- instance IH,l or IT,l into a warm status, and therefore AW,H,l
tively. R is the state space for the transmission rate between and AW,T,l can be represented as
the head and tail edge clouds. WH,l (or WT,l ) denotes the AW,H,l = {0, 1} (9)
state space for indicating whether the container instance for
the head model from the first layer to the lth layer, IH,l (or the and
container instance for the tail model after the lth layer, IT,l ) AW,T,l = {0, 1}. (10)
is in a warm status or not.
Let CEmax denote the maximum computing power of the edge
C. Transition Probability
cloud. Since the elements of CH and CT can be discretized
with the minimum scale uC (i.e., unit computing power), CH Based on the chosen warming actions AW,H,l and AW,T,l , the
and CT can be represented by next states WH,l and WT,l (representing whether the container
instance is in a warm state or not) can be decided. Meanwhile,
 
CH = uC , 2uC , . . . , CEmax (2) the other state transitions are not affected by the chosen action,
and all states change independently. Therefore, the transition
and probability from the current state S = [CH , CT , R, WH,l , WT,l ]
to the next state S = [CH , C , R , W  , W  ] can be described
  T H,l T,l
CT = uC , 2uC , . . . , CEmax . (3) as (11), shown at the bottom of the page.

               
P S |S, A = P CH |CH × P CT |CT × P R |R × P WH,l |WH,l , AW,H,l × P WT,l |WT,l , AW,T,l . (11)
l

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
KO et al.: DYNAMIC SPLIT COMPUTING FRAMEWORK IN DISTRIBUTED SERVERLESS EDGE CLOUDS 14527

It is assumed that the times required for container instances and


IH,l and IT,l to start cold follow exponential distributions     =0
1, if WT,l
with mean 1/γH,l and 1/γT,l , respectively. Then, if container P WT,l |WT,l = 1, AW,T,l = 0 = (19)
0, otherwise.
instances IH,l and IT,l are maintained as cold statuses (i.e.,
WH,l = 0 and WT,l = 0) and head and tail edge clouds make Note that the transition probability of other states (i.e., CH ,
those container instances into warm statuses (i.e., AW,H,l = 1 CT , and R) can be defined statistically.
and AW,T,l = 1), the probabilities that the cold starts for IH,l
and IT,l are completed within the decision epoch duration τ D. Cost and Constraint Functions
can be obtained as γH,l τ and γT,l τ , respectively [30]. Thus, 1) Cost Function: We define the cost function to minimize
 |W , A 
P[WH,l H,l W,H,l = 1] and P[WT,l |WT,l , AW,T,l = 1] can, the inference latency. The inference latency consists of the
respectively, be described as transmission latency of the input data ζI , the latency for the
⎧  =1
   ⎨ γH,l τ, if WH,l cold start of the head model ζH,C , the inference latency of the
P WH,l |WH,l = 0, AW,H,l = 1 = 1 − γH,l τ, if WH,l  =0 head model ζH,I , the transmission latency of the intermediate
⎩ data ζD , the latency for the cold start of the tail model ζT,C ,
0, otherwise
(12) and the inference latency of the tail model ζT,I . Note that
only when the container instances for the corresponding head
and and tail models are not maintained as warm statuses (i.e.,
⎧  =1 WH,l=AS = 0 and WT,l=AS = 0), the latencies for the cold
   ⎨ γT,l τ, if WT,l
 =0
P WT,l |WT,l = 0, AW,T,l = 1 = 1 − γT,l τ, if WT,l start of the head and tail models, ζH,C and ζT,C , are needed.
⎩ Therefore, the cost function r(S, A) can be represented by (20),
0, otherwise.
(13) shown at the bottom of the page, where δ[ · ] is a function that
returns 1 if the given condition is true. Otherwise, it gives 0.
When container instances IH,l and IT,l are maintained as 2) Constraint Functions: The constraint functions cH (S, A)
cold statuses (i.e., WH,l = 0 and WT,l = 0) and head and tail and cT (S, A) for the average resource consumption of the head
edge clouds do not make IH,l and IT,l into warm statuses (i.e., and tail edge clouds can be defined as
AW,H,l = 0 and AW,T,l = 0), respectively, WH,l  are
and WT,l  
always 0. Therefore, the corresponding transition probabilities cH (S, A) = δ WH,l = 1 (21)
can be expressed by l

   1, if 
WH,l=0 and
P WH,l |WH,l = 0, AW,H,l = 0 = (14)  
0, otherwise cT (S, A) = δ WT,l = 1 (22)
and l
    =0
1, if WT,l respectively.
P WT,l |WT,l = 0, AW,T,l = 0 = (15)
0, otherwise.
Meanwhile, if the container instance for the head model IH,l E. Optimization Formulation
is in a warm status (i.e., WH,l = 1), without any additional The average inference latency ζL is defined as follows:
delay, the container instance can be turned off (become in a t
cold state) or can be kept in a warm status. This indicates 1  
 ζL = lim sup E r(St , At ) (23)
that the next state WH,l is always the same as the warming t→∞ t
t
actions AW,H,l . Consequently, the corresponding transition
probabilities are given as where St and At are the state and the chosen action at t ∈ T,
respectively.
    =1
1, if WH,l
P WH,l |WH,l = 1, AW,H,l = 1 = (16) The average resource consumption of the head and tail edge
0, otherwise clouds, ψH and ψT , can be represented as
and t
1  
    =0
1, if WH,l ψH = lim sup E cH (St , At ) (24)
P WH,l |WH,l = 1, AW,H,l = 0 = (17) t→∞ t
0, otherwise. t

Similarly, for the container instance for the tail model IH,l and
in a warm status (i.e., WT,l = 1), the corresponding transition t
 
1
probabilities are represented as ψT = lim sup E cT (St , At ) (25)
t→∞ t
   1, if 
WT,l=1 t
P WT,l |WT,l = 1, AW,T,l = 1 = (18)
0, otherwise respectively.

   
r(S, A) = ζI + δ WH,l=AS = 0 ζC,H + ζI,H + ζD + δ WT,l=AS = 0 ζC,T + ζI,T (20)

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
14528 IEEE INTERNET OF THINGS JOURNAL, VOL. 11, NO. 8, 15 APRIL 2024

TABLE II
We can express the CMDP model as follows: I NFERENCE L ATENCY (U NIT: S EC ) AND I NTERMEDIATE DATA S IZE
(U NIT: MB) ACCORDING TO THE S PLITTING P OINT
min ζL (26)
π
s.t. ψH ≤ θH,R and ψT ≤ θT,R (27)
where π is a policy that implies the probabilities of choosing a
specific action at each state. In addition, θH,R and θT,R denote
the thresholds for the resource consumption of the head and
tail edge clouds, respectively.
To convert the CMDP model to an equivalent LP model,
we define the stationary probabilities of state S and action A, The default evaluation settings are as follows. We consider
ϕ(S, A), as the decision variables of the LP model. Then, the a deployment architecture where a mobile device with Intel
LP model can be represented as Celeron n3350 CPU and two edge clouds with Intel i5-
11400F CPU exist. The default latency for the cold start
min ϕ(S, A)r(S, A) (28) is assumed as 2.0 s [33]. To measure the inference latency
ϕ(S,A)
S A according to the splitting point, we exploit VGG16 which is
subject to one of the basic convolutional neural network architectures.
We have measured the inference latency of VGG16 several
ϕ(S, A)cH (S, A) ≤ θH,R (29) times on a mobile device (i.e., Intel Celeron n3350 CPU),
S A which average is 7.87 s In addition, VGG16 is divided into
ϕ(S, A)cT (S, A) ≤ θT,R (30) head and tail models, which operate under the edge cloud
S A (i.e., Intel i5-11400F CPU). We then measure the inference
    latency of the head and tail models considering five splitting
ϕ S , A = ϕ(S, A)P S |S, A (31)
points. Table II shows the measured inference latency and
A S A
intermediate data size according to the split points. Since we
ϕ(S, A) = 1 (32) measure the inference latency without any other background
S A applications, the measured inference latency can be considered
and as the inference latency when the edge cloud has the maximum
computing power. In addition, it is assumed that the inference
ϕ(S, A) ≥ 0. (33) latency is inversely proportional to the available computing
The objective function, (28), is to minimize inference power. Meanwhile, the average transmission rate between the
latency. The constraints in (29) and (30) can be compared to head and tail edge clouds is assumed to be 131.2 Mb/s [34].
the constraints of the CMDP model in (27). In addition, (31) The thresholds for the resource consumption of the edge
describes the constraint for the Chapman–Kolmogorov equa- clouds, θH,R and θT,R , are set to 2.
tion. The probability properties can be satisfied by the
constraints in (32) and (33). A. Effect of θT,R
Based on the solution ϕ ∗ (S, A) of the LP model mentioned Fig. 2 shows the effect of the threshold θT,R on the resource
above, we can derive the optimal stochastic policy π ∗ (S, A) consumption of the tail edge cloud. As shown in Fig. 2, DSCF
of the CMDP model. Using the optimal stochastic policy (i.e., can minimize the average inference latency ζL [see Fig. 2(a)]
optimal probability distribution), an appropriate action A can while maintaining the average resource consumption ψT of the
be chosen in a specific state S. tail edge clouds below the target threshold θT,R [see Fig. 2(b)].
This is because DSCF performs joint optimization on the
V. E VALUATION R ESULTS splitting point and container status. More precisely, DSCF
To show the effectiveness of DSCF on the inference latency dynamically determines the splitting point taking into account
and resource consumption of the edge cloud, we design the transmission rates and the computing power available from
the following comparison schemes: 1) MOBILE where the the head and tail edge clouds. For example, when the head
inference for the entire DNN model is conducted on the mobile edge cloud has a comparatively high computing power and
device; 2) OPT-ALL-WARM where the optimal splitting point the current transmission rate is insufficient for low-latency
is used4 and all container instances are kept in warm statuses; transmissions of intermediate data, DSCF decides that the
3) SM-ALL-COLD [32] where the DNN model is split at entire execution of the DNN model takes place in the head
a specific layer providing the smallest intermediate data size edge cloud. As another example, if the available computing
and all container instances are maintained as cold statuses; power of the tail edge cloud and the transmission rate are high,
and 4) SM-CO-WARM where the DNN model is split at a only the container instance for the whole DNN model can be
specific layer providing the smallest intermediate data size and maintained as a warm status in the tail edge cloud and conduct
the only corresponding container instances are maintained as the inference.
warm statuses. Meanwhile, from Fig. 2(a), it can be found that ζL of DSCF
decreases as θT,R increases. This is because a larger θT,R
4 The splitting point of OPT-ALL-WARM is the same as that of DSCF. indicates that more container instances for tail models can be

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
KO et al.: DYNAMIC SPLIT COMPUTING FRAMEWORK IN DISTRIBUTED SERVERLESS EDGE CLOUDS 14529

Fig. 2. Effect of θR . (a) Average inference latency. (b) Average resource Fig. 3. Effect of 1/γ . (a) Average inference latency. (b) Average resource
consumption of the tail edge cloud. consumption of the head edge cloud.

maintained as warm statuses (i.e., larger θT,R implies that the increasing 1/γ ). This result implies that DSCF can achieve a
tail edge cloud probably conducts the inference without any higher performance gain when more complex DNN models are
cold start), which leads to a decreased ζL [see Fig. 2(a)]. used. Note that the containers for complicated DNN models
On the other hand, from Fig. 2(a), it can be seen that ζL generally have longer latency for cold start.
of other comparison schemes remains constant regardless of Meanwhile, OPT-ALL-WARM maintains all container
variations in θT,R . This is because these comparison schemes instances in warm statuses. Also, in SM-CO-WARM, the
adhere to a fixed policy without accounting for the resource DNN model is split at a specific layer, providing the small-
consumption threshold of the tail edge cloud. est intermediate data size, and the corresponding container
instances are always maintained as warm statuses. Thus, the
average inference latency of OPT-ALL-WARM and SM-CO-
B. Effect of 1/γ
WARM is constant regardless of 1/γ [see Fig. 3(a)]. On the
Fig. 3(a) and (b) shows the effect of the average latency for other hand, since no container instance is maintained as a
cold start, denoted by 1/γ , on the average inference latency ζL warm status in SM-ALL-COLD, its average inference latency
and the average resource consumption of the head edge cloud, increases significantly as 1/γ increases [see Fig. 3(a)]. This
respectively. Interestingly, from Fig. 3(a), it can be seen that result indicates that the system operators of the serverless
the average inference latency ζL of DSCF decreases slightly architecture should exploit the warm start, especially when the
with increasing 1/γ . This can be explained as follows. When latency for the cold start is large.
large 1/γ is given, DSCF maintains aggressively container
instances as warm statuses to avoid any situations where
the cold start causes a long average inference latency. This C. Effect of Average Transmission Rate
operation of DSCF can be demonstrated in Fig. 3(b) (i.e., the Fig. 4 illustrates the effect of the average transmission rate
average resource consumption of the edge cloud increases with on the average inference latency ζL . It can be found that, with
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
14530 IEEE INTERNET OF THINGS JOURNAL, VOL. 11, NO. 8, 15 APRIL 2024

[4] S. Wang, X. Zhang, H. Uchiyama, and H. Matsuda., “HiveMind:


Towards cellular native machine learning model splitting,” IEEE J. Sel.
Areas Commun., vol. 40, no. 2, pp. 626–640, Feb. 2022.
[5] C. Cicconetti, M. Conti, and A. Passarella, “A decentralized framework
for serverless edge computing in the Internet of Things,” IEEE Trans.
Netw. Service Manag., vol. 18, no. 2, pp. 2166–2180, Jun. 2021.
[6] S. Sarkar, R. Wankar, S. N. Srirama, and N. K. Suryadevara, “serverless
management of sensing systems for fog computing framework,” IEEE
Sensors J., vol. 20, no. 3, pp. 1564–1572, Feb. 2020.
[7] S. Hendrickson, S. Sturdevant, T. Harter, V. Venkataramani,
A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Serverless
computation with OpenLambda,” in Proc. USENIX HotCloud, 2016,
pp. 33–39.
[8] H. Ko, S. Pack, and V. C. M. Leung, “Performance optimization of
serverless computing for latency-guaranteed and energy-efficient task
offloading in energy-harvesting Industrial IoT,” IEEE Internet Things J.,
vol. 10, no. 3, pp. 1897–1907, Feb. 2023.
[9] R. Xie, Q. Tang, C. Liang, F. R. Yu, and T. Huang, “Dynamic
computation offloading in IoT fog systems with imperfect channel-state
information: A POMDP approach,” IEEE Internet Things J., vol. 8,
Fig. 4. Effect of average transmission rate on ζL . no. 1, pp. 345–356, Jan. 2021.
[10] Q. Tang et al., “Decentralized computation offloading in IoT fog
computing system with energy harvesting: A Dec-POMDP approach,”
the exception of MOBILE, the average inference latency for all IEEE Internet Things J., vol. 7, no. 6, pp. 4898–4911, Jun. 2020.
schemes decreases as the average transmission rate increases. [11] “Study on traffic characteristics and performance requirements for
AI/ML model transfer in 5GS, Version 18.2.0,” 3GPP, Sophia Antipolis,
This trend is attributed to the fact that higher transmission France, Rep. TR 22.874, Dec. 2021.
rates facilitate faster transmission of intermediate data to the [12] “Study on 5G system support for AI/ML-based services, Version 1.0.0,”
edge cloud. 3GPP, Sophia Antipolis, France, Rep. TR 23.700-80, Sep. 2022.
[13] Y. Kang et al., “Neurosurgeon: Collaborative intelligence between the
Since the intermediate data sizes of SM-ALL-COLD and cloud and mobile edge,” in Proc. ASPLOS, 2017, pp. 615–629.
SM-CO-WARM are the smallest, it can be seen that their infer- [14] J. Yan, S. Bi, and Y. A. Zhang, “Optimal model placement and online
ence latency decreases slightly. Note that SM-ALL-COLD model splitting for device-edge co-inference,” 2021, arXiv:2105.13618.
[15] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An
and SM-CO-WARM split the DNN model at a specific layer, efficient training and inference engine for intelligent mobile cloud
providing the smallest intermediate data size. Meanwhile, computing services,” IEEE Trans. Mobile Comput., vol. 20, no. 2,
since there are no intermediate data in MOBILE, its inference pp. 565–576, Feb. 2021.
[16] W. He, S. Guo, S. Guo, X. Qiu, and F. Qi, “Joint DNN partition deploy-
latency is not affected by the average transmission rate. ment and resource allocation for delay-sensitive deep learning inference
in IoT,” IEEE Internet Things J., vol. 7, no. 10, pp. 9241–9254,
Oct. 2020.
VI. C ONCLUSION [17] X. Tang, X. Chen, L. Zeng, S. Yu, and L. Chen, “Joint multiuser DNN
partitioning and computational resource allocation for collaborative edge
In this article, we proposed DSCF, in which the optimal intelligence,” IEEE Internet Things J., vol. 8, no. 12, pp. 9511–9522,
splitting point and status of each container instance can Jun. 2021.
[18] P. K. Deb, C. Roy, A. Roy, and S. Misra, “DEFT: Decentralized
be obtained using the CMDP model. The evaluation results multiuser computation offloading in a fog-enabled IoV environ-
demonstrate that 1) DCSF can achieve less than half inference ment,” IEEE Trans. Veh. Technol., vol. 69, no. 12, pp. 15978–15987,
latency (around 3.5 s) compared to the local computing- Dec. 2020.
[19] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep
based inference method and 2) DSCF can achieve higher learning model co-inference with device-edge synergy,” in Proc. ACM
performance gain (i.e., shorter inference latency) when longer MECOMM, 2018, pp. 31–36.
average cold start latency is expected (i.e., when more complex [20] S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane,
“SPINN: Synergistic progressive inference of neural networks over
neural networks are used). However, DSCF does not simulta- device and cloud,” in Proc. ACM MOBICOM, 2020, pp. 1–15.
neously exploit the inter- and intra-layer splitting approach for [21] A. Dehkordi, N. Vedula, J. Pei, F. Xia, L. Wang, and Y. Zhang, “Auto-
a single neural network. However, by employing both inter- Split: A general framework of collaborative edge-cloud AI,” in Proc.
ACM SIGKDD Conf. KDD, 2021, pp. 2543–2553.
and intra-layer splitting simultaneously, heterogeneous edge [22] M. Krouka, A. Elgabli, C. B. Issaid, and M. Bennis, “Energy-efficient
clouds’ resources can be optimally leveraged, resulting in a model compression and splitting for collaborative inference over time-
notable decrease in inference latency. Therefore, in our future varying channels,” 2021, arXiv:2106.00995.
[23] H. Zhou, W. Zhang, C. Wang, X. Ma, and H. Yu, “BBNet: A
work, we will extend DSCF to consider both inter- and intra- novel convolutional neural network structure in edge-cloud collaborative
layer splitting for heterogeneous cloud environments. inference,” MDPI Sens., vol. 21, no. 13, pp. 1–16, Jun. 2021.
[24] (Amazon Web Serv., Seattle, WA, USA). AWS Lambda
Documentation. Accessed: Dec. 31, 2023. [Online]. Available: https://
R EFERENCES docs.aws.amazon.com/lambda/
[25] G. McGrath and P. Brenner, “Serverless computing: Design, imple-
[1] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing mentation, and performance,” in Proc. 37th IEEE ICDCSW, 2017,
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105, pp. 405–410.
no. 12, pp. 2295–2329, Dec. 2017. [26] A. Fuerst and P. Sharma “FaasCache: Keeping serverless comput-
[2] P. K. Deb, A. Mukherjee, D. Singh, and S. Misra, “Loop-the-loops: ing alive with greedy-dual caching,” in Proc. ACM ASPLOS, 2021,
Fragmented learning over networks for constrained IoT devices,” IEEE pp. 386–400.
Trans. Parallel Distrib. Syst., vol. 34, no. 1, pp. 316–327, Jan. 2023. [27] R. Xie, Q. Tang, S. Qiao, H. Zhu, F. R. Yu, and T. Huang, “When
[3] Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and early serverless computing meets edge computing: Architecture, challenges,
exiting for deep learning applications: Survey and research challenges,” and open issues,” IEEE Wireless Commun., vol. 28, no. 5, pp. 126–133,
ACM Comput. Surv., vol. 55, no. 5, pp. 1–30, Dec. 2022. Oct. 2021.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.
KO et al.: DYNAMIC SPLIT COMPUTING FRAMEWORK IN DISTRIBUTED SERVERLESS EDGE CLOUDS 14531

[28] O. Ascigil, A. G. Tasiopoulos, T. K. Phan, V. Sourlas, I. Psaras, Daeyoung Jung received the B.S. degree from the
and G. Pavlou, “Resource provisioning and allocation in function-as- School of Electrical Engineering, Korea University,
a-service edge-clouds,” IEEE Trans. Services Comput., vol. 15, no. 4, Seoul, South Korea, in 2020, where he is currently
pp. 2410–2424, Jul./Aug. 2022. pursuing the M.S. and Ph.D. (integrated course)
[29] H. Ko, J. Lee, S. Seo, S. Pack, and V. C. M. Leung, “Joint client selection degrees.
and bandwidth allocation algorithm for federated learning,” IEEE Trans. His research interests include 5G/6G networks,
Mobile Comput., vol. 22, no. 6, pp. 3380–3390, Jun. 2023. network automation, mobile-edge computing, deep
[30] H. Ko and S. Pack, “A software-defined surveillance system with energy reinforcement learning, distributed computing, and
harvesting: Design and performance optimization,” IEEE Internet Things future Internet.
J., vol. 5, no. 3, pp. 1361–1369, Jun. 2018.
[31] Z. Xu et al., “Energy-aware inference offloading for DNN-driven
applications in mobile edge clouds,” IEEE Trans. Parallel Distrib. Syst.,
vol. 32, no. 4, pp. 799–814, Apr. 2021.
[32] Q. Yang, X. Luo, P. Li, T. Miyazaki, and X. Wang, “Computation
offloading for fast CNN inference in edge computing,” in Proc. ACM
RACS, 2019, pp. 101–106.
[33] M. Shilkov, “Comparison of cold starts in serverless functions across
AWS, Azure, and GCP.” 2021. [Online]. Available: https://mikhail.io/
serverless/coldstarts/big3/
[34] K. Kiela, M. Jurgo, V. Macaitis, and R. Navickas, “5G standalone
and 4G multi-carrier network-in-a-box using a software defined radio
framework,” MDPI Sens., vol. 21, no. 16, pp. 1–18, Aug. 2021.

Haneul Ko (Senior Member, IEEE) received the


B.S. and Ph.D. degrees from the School of Electrical
Engineering, Korea University, Seoul, South Korea,
in 2011 and 2016, respectively. Sangheon Pack (Senior Member, IEEE) received
He is currently an Assistant Professor with the the B.S. and Ph.D. degrees in computer engineering
Department of Electronic Engineering, Kyung Hee from Seoul National University, Seoul, South Korea,
University, Yongin, South Korea. From 2019 to in 2000 and 2005, respectively.
2022, he was an Assistant Professor with the In 2007, he joined the faculty of Korea
Department of Computer and Information Science, University, Seoul, where he is currently a Professor
Korea University (Sejong Campus), Sejong, South with the School of Electrical Engineering. From
Korea. From 2017 to 2018, he was a Postdoctoral 2005 to 2006, he was a Postdoctoral Fellow
Fellow with the University of British Columbia, Vancouver, BC, Canada. with the Broadband Communications Research
From 2016 to 2017, he was a Postdoctoral Fellow of Mobile Network Group, University of Waterloo, Waterloo, ON,
and Communications with Korea University. His research interests include Canada. His research interests include softwarized
5G/6G networks, network automation, mobile cloud computing, SDN/NFV, networking (SDN/NFV), 5G/6G mobile core networks, mobile-edge comput-
and Future Internet. ing/programmable data plane, and vehicular networking.
Dr. Ko was the recipient of the Minister of Education Award in 2019, Prof. Pack was the recipient of the IEEE/Institute of Electronics and
the IEEE ComSoc APB Outstanding Young Researcher Award in 2022, and Information Engineers (IEIE) Joint Award for IT Young Engineers Award
the Korean Institute of Communications and Information Sciences Haedong in 2017, the Korean Institute of Information Scientists and Engineers Young
Young Engineer Award in 2023. Information Scientist Award in 2017, the Korea University TechnoComplex
Crimson Professor in 2015, the Korean Institute of Communications and
Information Sciences Haedong Young Scholar Award in 2013, the LG Yonam
Foundation Overseas Research Professor Program in 2012, and the IEEE
ComSoc APB Outstanding Young Researcher Award in 2009. He served
Hyeonjae Jeong received the B.S. degree from as the TPC Vice-Chair for Information Systems of IEEE WCNC 2020, the
Chungnam University, Daejeon, South Korea, in Track Chair for IEEE VTC 2020-Fall/2010-Fall and IEEE CCNC 2019,
2019. She is currently pursuing the M.S. and Ph.D. the TPC Chair for IEEE/IEIE ICCE-Asia 2018/2020, EAI Qshine 2016,
(integrated course) degrees with Korea University, and ICOIN 2020, the Publication Co-Chair for IEEE INFOCOM 2014 and
Seoul, South Korea. ACM MobiHoc 2015, the Symposium Chair for IEEE WCSP 2013, the TPC
Her research interests include distributed com- Vice-Chair for ICOIN 2013, and the Publicity Co-Chair for IEEE SECON
puting, federated learning, 5G/6G mobile core 2012. He is an Editor of IEEE I NTERNET OF T HINGS J OURNAL, Journal of
networks, mobile-edge computing, and network Communications and Networks, and IET Communications, and he was a Guest
automation. Editor of IEEE T RANSACTIONS ON E MERGING T OPICS IN C OMPUTING and
IEEE T RANSACTIONS ON N ETWORK S CIENCE AND E NGINEERING.

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on June 29,2024 at 13:38:52 UTC from IEEE Xplore. Restrictions apply.

You might also like