Joint Optimization of Parallelism and Resource Configuration For Serverless Func
Joint Optimization of Parallelism and Resource Configuration For Serverless Func
Joint Optimization of Parallelism and Resource Configuration For Serverless Func
4, APRIL 2024
Abstract—Function-as-a-Service (FaaS) offers a fine-grained re- only need to focus on business logic, which accelerates appli-
source provision model, enabling developers to build highly elastic cation development progress and saves operational costs. With
cloud applications. User requests are handled by a series of server- FaaS platforms like AWS Lambda [3], developers can rapidly
less functions step by step, which forms a multi-step workflow. The
developers are required to set proper configurations for functions leverage hundreds of CPU cores by invoking massive functions
to meet service level objectives (SLOs) and save costs. However, simultaneously.
developing the configuration strategy is challenging. This is mainly Benefiting from the fine-grained, high elasticity of FaaS, many
because the execution of serverless functions often suffers from applications have been developed based on serverless functions,
cold starts and performance fluctuation, which requires a dynamic including video processing [4], [5], [6], machine learning [7],
configuration strategy to guarantee the SLOs. In this article, we
present StepConf, a framework that automates the configuration [8], [9], [10], code compilation [11], big-data analytic [12], [13],
as the workflow runs. StepConf optimizes memory size for each [14], [15], [16], etc. To migrate applications to FaaS platforms,
function step in the workflow and takes inter and intra-function developers need to decouple monolithic applications into multi-
parallelism into consideration, which has been overlooked by ex- ple functions, resulting in complex multi-step serverless work-
isting work. StepConf intelligently predicts the potential configu- flows. However, developers face challenges in setting proper
rations for subsequent function steps, and proactively prewarms
function instances in a configuration-aware manner to reduce the configurations for workflows to optimize the cost and ensure
cold start overheads. We evaluate StepConf on AWS and Knative. performance. We summarize these challenges as follows:
Compared to existing work, StepConf improves performance by Vague Impact of Resource Configurations: The cost and
up to 5.6× under the same cost budget and achieves up to a 40% performance of functions in FaaS depend heavily on user-
cost reduction while maintaining the same level of performance. configured resource parameters. However, due to the unique
Index Terms—Serverless computing, resource management, resource allocation and pricing mechanisms of FaaS platforms,
resource configuration, function workflow. understanding the impact of resource configuration can be chal-
lenging. Therefore, it is difficult to determine the appropriate
resource parameters to achieve high performance and low cost.
I. INTRODUCTION As a result, developers often struggle to determine the optimal
UNCTION-AS-A-SERVICE (FaaS) is a new paradigm for resource configuration for their workflows [17], [18], [19].
F serverless computing that allows developers to run code
in the cloud without having to maintain and operate cloud
Performance Fluctuations in Workflows: FaaS frequently de-
pend on external storage services to enable extensive data ex-
resources [1], [2]. Developers need only submit function code change in workflows [20], [21]. Nevertheless, these external ser-
to FaaS platforms, where compute resources are provisioned vices demonstrate considerable variations in data transmission
and functions are executed seamlessly. Therefore, developers delay, ranging from a few hundred milliseconds to a couple of
seconds. Additionally, function cold starts prevent concurrent
mapping invocations from starting simultaneously, leading to
Manuscript received 2 July 2023; revised 16 December 2023; accepted 8
February 2024. Date of publication 13 February 2024; date of current version different mapping delays. These factors introduce fluctuations
28 February 2024. This work was supported in part by National Key Research in the performance of function steps, rendering it difficult to
& Development (R&D) Plan under Grant 2022YFB4501703, and in part by ensure end-to-end SLOs for workflows concerning configuration
the Major Key Project of PCL under Grant PCL2022A05. Recommended for
acceptance by J. Carretero Perez. (Corresponding author: Fangming Liu.) optimization.
Zhaojie Wen, Yipei Niu, Zhen Song, and Quanfeng Deng are with the National Exponential Growth of Configuration Space: FaaS app de-
Engineering Research Center for Big Data Technology and System, Services velopers are required to configure several parameters for each
Computing Technology and System Lab, Cluster and Grid Computing Lab in the
School of Computer Science and Technology, Huazhong University of Science function. As the number of functions in the workflow grows,
and Technology, Wuhan 430074, China (e-mail: [email protected]; the decision space for parameters grows exponentially, making
[email protected]; [email protected]; [email protected]). it difficult to find the optimal configuration. Adjusting the con-
Qiong Chen is with the Hangzhou Research Centre, Central Software Institute,
Distributed LAB, YuanRong Team, Huawei, Hangzhou 310000, China (e-mail: figuration according to different SLOs further complicates the
[email protected]). resource configuration problem.
Fangming Liu is with Peng Cheng Laboratory, Shenzhen 518066, China, and Challenge in Optimizing Parallelism of Function Step: The
also with the Huazhong University of Science and Technology, Wuhan 430074,
China (e-mail: [email protected]). stateless nature of FaaS functions combined with their robust
Digital Object Identifier 10.1109/TPDS.2024.3365134 scalability simplifies and enhances parallel execution. FaaS
1045-9219 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 561
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 563
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
564 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Fig. 7. CDF plots of data transmission time with different object sizes.
decompose the duration time of function steps. The execution Among these three parts, the time for downloading and up-
process of a function step corresponds to the completion of all loading data is usually determined by the function’s network
tasks in that function step during the execution of a workflow. bandwidth and is independent of the function’s computational
As we consider inter-function and intra-function parallelism complexity. The computation time depends on the function’s
within each function step, the entire process of a function step computational complexity and the resources used.
is equivalent to the execution of multiple concurrent instances Based on previous analysis, data transmission delay and map-
of a function, where each instance utilizes multiprocessing for ping delay have significant fluctuations. On the other hand, the
parallel computation. fluctuation of computation latency is relatively small. There-
Through our tests, we find that concurrently invoked function fore, we need to employ different estimation methods for these
instances are not started at the same time, and these concurrent components.
instances experience different mapping delays. The mapping
delay of a function instance is the waiting time from the start of
the function step until the function instance begins to execute. C. Piece-Wise Fitting Model for Computation Latency
The mapping delays of different concurrent function instances To accurately predict the computation latency of functions,
within the same function step form a stair-like distribution, as we propose a piece-wise fitting model. Since the duration of
shown in Fig. 2. This phenomenon is caused by the cold start function computation is usually influenced by the number of
time required for the elastic scaling of function instances, where tasks in function instances and the memory size, we consider
function instances are created one by one rather than all at once. these two key factors in the model. Let γi represent the total
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 565
number of tasks in a function step. Then, based on the inter- the square terms of these variables and their possible interac-
function parallelism pi , we define γi = γi pi , which represents tions. The model can be expressed as:
the number of tasks allocated to each function instance.
Based on the experimental data shown in Fig. 6, we find that
n
QY |X (ω) = β0 (ω) + βi (ω)Xi
the execution time of functions is inversely proportional to the i=1
memory size. Since performance is the reciprocal of time, this
indicates that the multi-core performance of function instances
n
+ γi (ω)Xi2 + δij (ω)Xi Xj (2)
is nearly linearly related to the memory size. At the same time, i=1 i<j
the performance of functions is also influenced by the internal
parallel optimization of function code. We use ms to represent Where ω is quantile, β0 (ω) is the intercept term, and βi (ω),
the memory size of single-core resources allocated by the cloud γi (ω), and δij (ω) respectively represent the coefficients for the
service provider (where s denotes single-core), and δs represents linear terms, square terms, and interaction terms. This model
the relative performance. We define min ( m mi
s
, qi ), where qi structure allows us to capture and predict the distribution of
represents the index of internal parallelism in the function. Here, the response variable Y at the quantile level, while considering
mi
ms represents the multiple of resources allocated to the function the nonlinear relationships and interaction effects among the
compared to the maximum multiple achievable by a single core, explanatory variables.
and qi represents the multiple of performance improvement These coefficients are estimated by minimizing the quantile
achievable by using internal function parallelism compared to loss function, which focuses on absolute deviations and assigns
a single thread. Taking the smaller value of these two factors different weights to errors above and below the ω quantile.
represents the upper limit of the multi-core acceleration ratio The estimated parameters are used to predict data transmission
for a single-thread optimized function with internal parallelism. time and mapping delay at the specific quantile ω. We use the
It is worth noting that if the program has already implemented number of objects and its size as parameters for predicting data
multi-core optimization, the impact of internal function paral- transmission delay, and the number of functions with a cold start
lelism will be weakened. In this case, the inverse relationship be- for predicting mapping delay. Although the quantile regression
tween execution time and memory size will change. Therefore, model does not directly estimate the complete probability dis-
we use an exponential function to improve prediction accuracy. tribution, it does provide significant insights into the conditional
Thus, we design a piece-wise fitting approach. For programs distribution of delays at the chosen quantile level.
applicable to multi-core scenarios (i.e., case 1), we use an
exponential function for fitting; while for programs that can only E. Cost Model of Function Step
utilize single-thread performance or have memory sizes lower
than ms (i.e., case 2), we use an inverse function for fitting: Let ci denote the cost of the function step vi in the workflow.
Different cloud vendors have similar pricing models for function
⎧ workflow. In the following, we take the pricing model of AWS
m
⎨(ai,1 · γi + bi,1 )e−ai,2 ·min ( msi ,qi ) + φi,1 , case 1 as a representation [17]. We denote the price for per GB-second
tci = ai,2 ·γi of function as μ0 , and the price for function requests and or-
⎩ + φi,2 , case 2
m
δs·min ( msi ,qi )+bi,2 chestration as μ1 , where μ0 and μ1 are constants. For a function
(1) step, given a intra-function parallelism pi , then we have the cost
of function step ci = pi · (ti · mi · μ0 + μ1 ).
In this formula, ai,1 , ai,2 , bi,1 , bi,2 , φi,1 , φi,2 are model pa-
IV. DYNAMIC OPTIMIZATION OF FUNCTION STEP
rameters obtained through data training.
CONFIGURATION
In this section, we introduce how to dynamically determine
D. Quantile Regression Model for Varying Delays the configuration for each function step.
Considering the inherent variability in data transmission and
mapping delays, making precise predictions poses a significant A. Problem Formulation
challenge. To address this issue, we adopt a quantile regres- A function workflow consists of a series of different function
sion model for prediction. This model is particularly suitable steps executed in sequence. We represent a serverless workflow
for predicting response variables like data transmission delay using a directed acyclic graph (DAG) G = (V, E). The vertices
and mapping delay at specific quantiles. Imagine we have a V = v1 , v2 , . . ., vn represent the n function steps within the
dataset containing numerous variables observed under various workflow. Vertices with an in-degree of 0 represent the starting
parameters. Here, Y represents the response variable, while point of the workflow, which corresponds to the first function
X = [X1 , X2 , . . . , Xn ] represents a set of explanatory variables. step to be executed. Vertices with an out-degree of 0 represent the
To more accurately capture the potential nonlinear relation- endpoint of the workflow, which corresponds to the last function
ships between X and Y , our quantile regression model considers step to be executed. The edges E = vi vj | 1 ≤ i = j ≤ N rep-
not only the effect of each individual Xi on Y , but also includes resent dependencies between function steps, where vi vj denotes
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
566 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Θi = (mi , pi , qi ) | mi ∈ M, pi ∈ P, qi ∈ Q. (3)
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 567
steps, making it challenging to determine the weights to find As our online strategy is essentially a distributed decision
the critical path. Based on our insights, each functional step progress. There is no need for each configurator of function step
has a most cost-effective configuration, defined by the ratio to obtain the global information of the graph. As a result, we
of performance to cost. We rank configurations according to establish a shared global cache for them to access the necessary
their cost-effectiveness and observe that configurations close data, which can also reduce the amount of local computation.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
568 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
argue that searching for the best configuration with (12) is hard
when the configuration space is large. To deal with it, we can
maintain a local record of optimal configuration and replace the
sub-optimal configuration when there is a configuration choice
with better performance at the same cost or less cost with the
same performance. Once established, we do not have to traverse
all configuration choices every time, which can improve the
Fig. 10. Diagram of Configuration-Aware function prewarm. efficiency of our algorithm.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 569
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
570 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
TABLE II
EXPERIMENTAL ENVIRONMENT CONFIGURATIONS
D. Function Wrapper
other platforms, including various open-source and commercial
The Function Wrapper is a Python class responsible for in- platforms.
voking the user-provided handler function within the function
instance using multiprocessing, thereby enabling intra-function
parallelism. Before executing each function step, the Configu- VII. EXPERIMENTAL EVALUATION
ration Optimizer optimizes the configuration selection for the In this section, we demonstrate the advantages of the com-
current step by considering both inter-function parallelism and ponents of StepConf through a series of experiments. Our ex-
intra-function parallelism. The workflow engine concurrently perimental evaluation aims to answer the following research
invokes the specified number of function instance based on questions (RQs):
inter-function parallelism. Within each function instance, the r RQ1: Can StepConf accurately estimate the performance
Function Wrapper takes over the entire process of data down- and cost under different configurations?
loading, computation, and data uploading. It invokes the given r RQ2: Can StepConf dynamic resource allocation scheme
number of processes based on intra-function parallelism. The guarantee the SLOs of workflows?
Function Wrapper is responsible for splitting the data and r RQ3: Can StepConf resource allocation optimization algo-
mapping it to each process and then waits for all processes rithm reduce the cost of workflows?
to complete before merging the outputs. Additionally, we have r RQ4: Does considering parallelism in StepConf optimiza-
rewritten the SDK for accessing storage media in the Function tion improve the performance and reduce the cost of work-
Wrapper using multithreading, enabling parallel data uploading flows?
and downloading to improve data transfer efficiency. r RQ5: Can the StepConf heuristic algorithm approach the
In particular, since the current FaaS platform does not support theoretical optimum with a small overhead?
dynamically adjusting the configuration of function resources, r RQ6: Can the StepConf function prewarmer reduce the
how can we dynamically invoke functions with corresponding cold start overheads in workflows?
memory sizes on demand? To address this, we deploy duplicate
functions with different memory sizes. We use different versions A. Experimental Setup
of the same function to be configured with different memory
sizes, using tags like “helloworld:1024 MB”. In this way, we Cluster and Software Information: We build a highly avail-
can select the desired function memory by invoking the function able Kubernetes cluster and installed Knative Serving as the
with different URL tags. serverless platform, where HAproxy [31] and Keepalived [32]
are installed. The cluster comprises 3 master nodes and 6 worker
nodes. Each worker node was allocated 16 cores and 32 GB
E. Function Manager of memory, while each master node was allocated 8 cores
Within the StepConf, the Function Manager plays a central and 16 GB of memory. For data exchange between functions,
role as middleware and assumes comprehensive control over we utilized an AWS S3 storage bucket located in the nearest
the execution of function steps. It manages the entire lifecycle region to our experimental environment, ensuring ample pub-
of function steps and facilitates parallel processing between lic network bandwidth over 1 Gbps and low storage access
functions by invoking a certain number of function instances latency to replicate the resource conditions of mainstream cloud
simultaneously. After the workflow engine provides the step applications [33], [34]. Table II provides detailed information
input, function name, and function configuration information, on the hardware facilities and software versions used in our
the function manager starts working. experiments.
To achieve this, we designed the Function Manager as an Workflow Applications: We select three different workflows
internal proxy service within the cluster. During its operation, the implemented in Python 3.8 to evaluate the performance of
Function Manager, guided by the Workflow Engine directives, StepConf as follows:
invokes specific functions that were previously provisioned in r Machine Learning Workflow (ML): Machine learning is
the FaaS cluster. Importantly, by using Function Manager, we a typical use case of serverless workflow. We adopt the
ensured compatibility not only with AWS Lambda, but also with ML workflow introduced in [23]. It primarily consists of
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 571
TABLE III
COMPARISON OF STEPCONF WITH BASELINES
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
572 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Fig. 17. Comparison of relative prediction error for different functions in different types.
data from a total of 100 recent invocations across different work- than 8%. This accuracy is achieved with relatively low profiling
flows on the cloud platform, to initialize the quantile regression costs, owing to our combined use of piece-wise and quantile
model. As more data is collected by the cloud platform, the models, providing accurate performance estimations.
quantile model will be able to more accurately reflect the actual
performance characteristics and fluctuation trends. The quantile
is set as P90 and we repeat the experiment 10 times for average. C. Performance Guarantee and Cost Saving
Fig. 16(a) describes a function that is optimized for a single To answer RQ2 (SLO) and RQ3 (Cost Saving), we present
thread and does not have intra-function parallelism configured. experimental results of our workflow executions.
Therefore, even though the function instances have access to First, we select four different SLOs for each workflow and
multiple CPU cores, the function cannot utilize them once ran the workflows multiple times, measuring the end-to-end
the memory size exceeds 1769 MB. This results in a nearly completion time. We collect the performance of the workflow
horizontal line shape in the figures. repeated 20 times to see how the SLO fluctuated. As shown
However, when we configure intra-function parallelism for in Fig. 18, the x-axis represents different SLOs for workflow
the single-thread optimized function, the function can utilize configurations, and the y-axis represents the normalized work-
the multiple core resources, as shown in Fig. 16(b). Compared flow end-to-end runtime, which is the actual workflow execution
to the previous case, enabling intra-function parallelism allows time divided by the SLO. Values closer to 1 indicate that the
the function to make better use of the allocated resources of actual duration time of the workflow is closer to the expected
function instances, thereby improving performance, given the SLO. We can observe that compared to the baseline, StepConf’s
same memory specifications. workflow duration has lower fluctuations and mostly remains
Fig. 16(c) describes a function that is already optimized for below 1. This indicates that StepConf, with its real-time dy-
multiple cores. From the goodness of fit of the curve in the graph, namic workflow configuration, better meets the SLOs. It’s worth
we can see that the piece-wise fitting model of the performance noting that although we can dynamically adjust the workflow
estimator can accurately fit the performance of the function configuration in real-time to correct the workflow’s execution,
under different configuration parameters. the performance fluctuations caused by the last function step
Fig. 17 shows the prediction errors for the average end-to-end cannot be completely eliminated, which is why StepConf cannot
performance of three different function steps. From the color completely eliminate performance fluctuations.
depth in the graph, it can be observed that, for different memory Next, we compare the cost of running the workflows under dif-
sizes and different numbers of tasks, the prediction errors for the ferent SLOs. We use Vanilla’s performance and cost as the base-
functions are mostly within 4%, with the largest error being less line and compare the performance and cost of other solutions.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 573
Fig. 19. Performance and cost compared with different baselines (Lower is the better).
As shown in Fig. 19, StepConf can achieve cost savings of up to to fully utilize both intra-function and inter-function parallelism,
40% while improving performance by up to 5.6× compared to resulting in higher resource utilization efficiency and better
the vanilla approach. In comparison to other baselines, StepConf performance under the same cost.
can save costs by 5% to 22.3% while enhancing performance by
1.91× to 3.3×.
E. Algorithm Effectiveness and Overhead
To answer RQ5 (Heuristic), we compare our algorithm with
D. Impact of Optimizing Parallelism
the theoretical optimal configuration namely Oracle. Since on-
To answer RQ4 (Parallelism), we investigate the impact of line algorithms cannot guide the real-time execution of actual
optimizing parallelism in the function steps of workflows and workflows, the theoretical optimal solution exists only in of-
conducted experiments comparing StepConf with partial paral- fline algorithms. Therefore, we adopt the offline version of
lelism optimization. the StepConf algorithm, ignoring performance fluctuations in
Fig. 20 shows that individually optimizing intra-function par- workflow function steps, to make a fair comparison with the
allelism or inter-function parallelism has less impact compared oracle algorithm. The oracle algorithm uses a traversal method
to optimizing both simultaneously. The key point of StepConf is to find the optimal solution. As shown in Fig. 21, our algorithm
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
Fig. 21. Avg cost of workflows compared with oracle. Fig. 23. Reduction of average mapping delay with function prewarmer.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
WEN et al.: JOINT OPTIMIZATION OF PARALLELISM AND RESOURCE CONFIGURATION 575
[46], [47]. Some researchers focus on accelerating the startup [13] B. Carver, J. Zhang, A. Wang, A. Anwar, P. Wu, and Y. Cheng,
time of sandboxes [48], [49], [50], [51], [52], [53], while others “Wukong: A scalable and locality-enhanced framework for server-
less parallel computing,” in Proc. ACM Symp. Cloud Comput., 2020,
aim to reduce the frequency of cold starts [54], [55], [56], [57]. pp. 1–15.
Defuse [58] utilizes historical invocation data to analyze func- [14] Z. Li et al., “FaaSFlow: Enable efficient workflow execution for function-
tion relationships, prewarming functions to avoid cold starts. as-a-service,” in Proc. ACM Int. Conf. Architectural Support Program.
Lang. Operating Syst., 2022, pp. 782–796.
Xanadu [59] has also designed a prewarming mechanism for [15] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the
cascading function work chains. Unlike these approaches, Step- cloud: Distributed computing for the 99%,” in Proc. ACM Symp. Cloud
Conf introduces a Configuration-Aware function prewarming Comput., 2017, pp. 445–451.
[16] V. Shankar et al., “Serverless linear algebra,” in Proc. ACM SoCC, 2020,
mechanism, considering the memory configurations and deter- pp. 281–295.
mining the number of prewarmed container instances based on [17] AWS lambda power tuning, 2021. [Online]. Available: https://aws.
parallelism, thereby reducing cold start overhead more effec- amazon.com/lambda/pricing
[18] A. Casalboni, “AWS lambda power tuning,” 2020. [Online]. Available:
tively. https://github.com/alexcasalboni/aws-lambda-power-tuning
[19] F. Liu and Y. Niu, “Demystifying the cost of serverless computing: Towards
a win-win deal,” IEEE Trans. Parallel Distrib. Syst., vol. 35, no. 1, pp. 59–
IX. CONCLUSION 72, Jan. 2024.
[20] Databases on AWS, 2024. [Online]. Available: https://aws.amazon.com/
We propose StepConf, an SLO-aware dynamic resource con- products/databases
figuration framework for serverless function workflows. We [21] Amazon S3: Object storage built to retrieve any amount of data from
develop a heuristic algorithm to dynamically configure each anywhere, 2024. [Online]. Available: https://aws.amazon.com/s3
[22] N. Akhtar, A. Raza, V. Ishakian, and I. Matta, “COSE: Configuring server-
function step, ensuring end-to-end SLOs for the workflow. Fur- less functions using statistical learning,” in Proc. IEEE Conf. Comput.
thermore, StepConf utilizes piece-wise fitting models and quan- Commun., 2020, pp. 129–138.
tile regression models to accurately estimate the performance of [23] A. Mahgoub, E. B. Yi, K. Shankar, S. Elnikety, S. Chaterji, and S.
Bagchi, “{ORION } and the three rights: Sizing, bundling, and prewarming
function steps under different configuration parameters. In ad- for serverless { DAGs},” in Proc. USENIX Symp. Operating Syst. Des.
dition, we design a workflow engine and Function Manager that Implementation, 2022, pp. 303–320.
support various FaaS platforms and reduce cold start overhead [24] C. Lin and H. Khazaei, “Modeling and optimization of performance and
cost of serverless applications,” IEEE Trans. Parallel Distrib. Syst., vol. 32,
through Configuration-Aware function prewarming. Compared no. 3, pp. 615–632, Mar. 2021.
to existing strategies, StepConf can improve performance by up [25] Z. Wen, Y. Wang, and F. Liu, “StepConf: SLO-aware dynamic resource
to 5.6 × under the same cost budget and achieve up to a 40% configuration for serverless function workflows,” in Proc. IEEE Conf.
Comput. Commun., 2022, pp. 1868–1877.
cost reduction while maintaining the same level of performance. [26] Y.-K. Kwok and I. Ahmad, “Dynamic critical-path scheduling: An effec-
tive technique for allocating task graphs to multiprocessors,” IEEE Trans.
Parallel Distrib. Syst., vol. 7, no. 5, pp. 506–521, May 1996.
REFERENCES [27] Serverless warmup plugin, 2022. [Online]. Available: https://github.com/
juanjoDiaz/serverless-plugin-warmup
[1] I. Baldini et al., “Serverless computing: Current trends and open problems,”
[28] AWS step functions: Visual workflows for modern applications, 2024.
in Research Advances in Cloud Computing, Berlin, Germany: Springer,
[Online]. Available: https://aws.amazon.com/step-functions
2017, pp. 1–20.
[29] Azure logic apps, 2024. [Online]. Available: https://learn.microsoft.com/
[2] Z. Li, L. Guo, J. Cheng, Q. Chen, B. He, and M. Guo, “The serverless
zh-cn/azure/logic-apps/logic-apps-overview
computing survey: A technical primer for design architecture,” ACM
[30] E. Bernhardsson et al., “Luigi,” 2022. [Online]. Available: https://github.
Comput. Surv., vol. 54, no. 10s, pp. 1–34, 2022.
com/spotify/luigi, Spotify.
[3] Amazon AWS lambda, 2024. [Online]. Available: https://aws.amazon.
[31] Haproxy load balancer, 2023. [Online]. Available: https://github.com/
com/lambda/
haproxy/haproxy
[4] F. Romero, M. Zhao, N. J. Yadwadkar, and C. Kozyrakis, “Llama: A
[32] Keepalived: Loadbalancing and high-availability, 2023. [Online]. Avail-
heterogeneous & serverless framework for auto-tuning video analytics
able: https://github.com/acassen/keepalived
pipelines,” in Proc. ACM Symp. Cloud Comput., 2021, pp. 1–17.
[33] E. Jonas et al., “Cloud programming simplified: A berkeley view on
[5] L. Ao, L. Izhikevich, G. M. Voelker, and G. Porter, “Sprocket: A serverless
serverless computing,” 2019, arXiv: 1902.03383.
video processing framework,” in Proc. ACM Symp. Cloud Comput., 2018,
[34] F. Wu, Q. Wu, and Y. Tan, “Workflow scheduling in cloud: A survey,” J.
pp. 263–274.
Supercomputing, vol. 71, no. 9, pp. 3373–3418, 2015.
[6] S. Fouladi et al., “Encoding, fast and slow: Low-latency video processing
[35] R. S. Kannan, L. Subramanian, A. Raju, J. Ahn, J. Mars, and L. Tang,
using thousands of tiny threads,” in Proc. USENIX Conf. Netw. Syst. Des.
“GrandSLAm: Guaranteeing SLAs for jobs in microservices execution
Implementation, 2017, pp. 363–376.
frameworks,” in Proc. 14th ACM EuroSys Conf., 2019, pp. 1–16.
[7] J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and R. Katz, “Cirrus: A
[36] Y. Zhang, W. Hua, Z. Zhou, G. E. Suh, and C. Delimitrou, “Sinan: ML-
serverless framework for end-to-end ML workflows,” in Proc. ACM Symp.
based and QoS-aware resource management for cloud microservices,” in
Cloud Comput., 2019, pp. 13–24.
Proc. ACM Int. Conf. Architectural Support Program. Lang. Operating
[8] F. Xu, Y. Qin, L. Chen, Z. Zhou, and F. Liu, “λDNN: Achieving predictable
Syst., 2021, pp. 167–181.
distributed DNN training with serverless architectures,” IEEE Trans. Com-
[37] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and
put., vol. 71, no. 2, pp. 450–463, Feb. 2022.
M. Alizadeh, “Learning scheduling algorithms for data processing
[9] J. Thorpe et al., “Dorylus: Affordable, scalable, and accurate GNN training
clusters,” in Proc. ACM Special Int. Group Data Commun., 2019,
with distributed CPU servers and serverless threads,” in Proc. USENIX
pp. 270–288.
Symp. Operating Syst. Des. Implementation, 2021, pp. 495–514.
[38] Y. Li, F. Liu, Q. Chen, Y. Sheng, M. Zhao, and J. Wang, “MarVeLScaler:
[10] V. Sreekanti, H. Subbaraj, C. Wu, J. E. Gonzalez, and J. M. Heller-
A multi-view learning based auto-scaling system for MapReduce,” IEEE
stein, “Optimizing prediction serving on low-latency serverless dataflow,”
Trans. Cloud Comput., vol. 10, no. 1, pp. 506–520, First Quarter, 2022.
2020, arXiv: 2007.05832.
[39] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS: Automated
[11] S. Fouladi et al., “From laptop to lambda: Outsourcing everyday jobs
model-less inference serving,” in Proc. USENIX Annu. Tech. Conf., 2021,
to thousands of transient functional containers,” in Proc. USENIX Conf.
pp. 397–411.
Usenix Annu. Tech. Conf., 2019, pp. 475–488.
[40] A. Singhvi, A. Balasubramanian, K. Houck, M. D. Shaikh, S. Venkatara-
[12] H. Zhang, Y. Tang, A. Khandelwal, J. Chen, and I. Stoica, “Caerus: Nimble
man, and A. Akella, “Atoll: A scalable low-latency serverless platform,”
task scheduling for serverless analytics,” in Proc. USENIX Conf. Netw.
in Proc. ACM Symp. Cloud Comput., 2021, pp. 138–152.
Syst. Des. Implementation, 2021, pp. 653–669.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.
576 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 35, NO. 4, APRIL 2024
[41] O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang, Qiong Chen received the BEng and MEng degrees
“CherryPick: Adaptively unearthing the best cloud configurations for big from the School of Computer Science and Technol-
data analytics,” in Proc. USENIX Conf. Netw. Syst. Des. Implementation, ogy, Huazhong University of Science and Technol-
2017, pp. 469–482. ogy, Wuhan, China. He is currently a research staff
[42] S. Venkataraman, Z. Yang, M. Franklin, B. Recht, and I. Stoica, “Ernest: with the Central Software Institute of Huawei. His
Efficient performance prediction for {Large-Scale} advanced analytics,” in research interests include applied machine learning
Proc. USENIX Conf. Netw. Syst. Des. Implementation, 2016, pp. 363–378. and serverless computing. He received the Best Paper
[43] S. Eismann, J. Grohmann, E. Van Eyk, N. Herbst, and S. Kounev, “Pre- Award of ACM International Conference on Future
dicting the costs of serverless workflows,” in Proc. ACM/SPEC Int. Conf. Energy Systems (ACM e-Energy) in 2018.
Perform. Eng., 2020, pp. 265–276.
[44] T. Elgamal, “Costless: Optimizing cost of serverless computing through
function fusion and placement,” in Proc. IEEE Symp. Edge Comput., 2018,
pp. 300–312.
[45] J. Kijak, P. Martyna, M. Pawlik, B. Balis, and M. Malawski, “Challenges
for scheduling scientific workflows on cloud functions,” in Proc. IEEE
11th Int. Conf. Cloud Comput., 2018, pp. 460–467.
[46] J. M. Hellerstein et al., “Serverless computing: One step forward, two steps Yipei Niu received the BEng degree from Henan
back,” 2018, arXiv: 1812.03651. University, and the PhD degree from the Huazhong
[47] Q. Pei, Y. Yuan, H. Hu, Q. Chen, and F. Liu, “AsyFunc: A high-performance University of Science and Technology. His research
and resource-efficient serverless inference system via asymmetric func- interests include cloud computing, serverless comput-
tions,” in Proc. ACM Symp. Cloud Comput., 2023, pp. 324–340. ing, container networking, and FPGA acceleration.
[48] D. Du et al., “Catalyzer: Sub-millisecond startup for serverless computing
with initialization-less booting,” in Proc. ACM Int. Conf. Architectural
Support Program. Lang. Operating Syst., 2020, pp. 467–481.
[49] E. Oakes et al., “SOCK: Rapid task provisioning with serverless-optimized
containers,” in Proc. USENIX Conf. Usenix Annu. Tech. Conf., 2018,
pp. 57–70.
[50] Z. Li et al., “RunD: A lightweight secure container runtime for high-density
deployment and high-concurrency startup in serverless computing,” in Zhen Song is currently working toward the master’s
Proc. USENIX Annu. Tech. Conf., 2022, pp. 53–68. degree with the School of Computer Science and
[51] A. Mohan, H. S. Sane, K. Doshi, S. Edupuganti, N. Nayak, and V. Technology, Huazhong University of Science and
Sukhomlinov, “Agile cold starts for scalable serverless,” in Proc. 11th Technology, China. His research interests include
USENIX Conf. Hot Top. Cloud Comput., 2019, pp. 3 357 034–3 357 060. serverless computing and WebAssembly.
[52] A. Agache et al., “Firecracker: Lightweight virtualization for serverless
applications,” in Proc. USENIX Conf. Netw. Syst. Des. Implementation,
2020, pp. 419–434.
[53] J. Cadden, T. Unger, Y. Awad, H. Dong, O. Krieger, and J. Appavoo,
“SEUSS: Skip redundant paths to make serverless fast,” in Proc. 15th Eur.
Conf. Comput. Syst., 2020, pp. 1–15.
[54] M. Shahrad et al., “Serverless in the wild: Characterizing and optimizing
the serverless workload at a large cloud provider,” in Proc. USENIX Conf. Quanfeng Deng is currently working toward the
Usenix Annu. Tech. Conf., 2020, pp. 205–218. PhD degree with the School of Computer Science
[55] R. B. Roy, T. Patel, and D. Tiwari, “IceBreaker: Warming serverless and Technology, Huazhong University of Science
functions better with heterogeneity,” in Proc. ACM Int. Conf. Architectural and Technology, China. His research interests include
Support Program. Lang. Operating Syst., 2022, pp. 753–767. serverless computing and cloud-native networking.
[56] A. Fuerst and P. Sharma, “FaasCache: Keeping serverless computing alive
with greedy-dual caching,” in Proc. ACM Int. Conf. Architectural Support
Program. Lang. Operating Syst., 2021, pp. 386–400.
[57] L. Pan, L. Wang, S. Chen, and F. Liu, “Retention-aware container caching
for serverless edge computing,” in Proc. IEEE Conf. Comput. Commun.,
2022, pp. 1069–1078.
[58] J. Shen, T. Yang, Y. Su, Y. Zhou, and M. R. Lyu, “Defuse: A dependency-
guided function scheduler to mitigate cold starts on FaaS platforms,” in Fangming Liu (Senior Member, IEEE) received the
Proc. IEEE 41st Int. Conf. Distrib. Comput. Syst., 2021, pp. 194–204. BEng degree from the Tsinghua University, Beijing,
[59] N. Daw, U. Bellur, and P. Kulkarni, “Xanadu: Mitigating cascading cold and the PhD degree from the Hong Kong Univer-
starts in serverless function chain deployments,” in Proc. 21st Int. Middle- sity of Science and Technology, Hong Kong. He is
ware Conf., 2020, pp. 356–370. currently a full professor with the Huazhong Uni-
versity of Science and Technology, Wuhan, China.
His research interests include cloud computing and
edge computing, data center and green computing,
Zhaojie Wen is currently working toward the PhD SDN/NFV/5 G, and applied ML/AI. He received the
degree with the School of Computer Science and National Natural Science Fund (NSFC) for Excellent
Technology, Huazhong University of Science and Young Scholars and the National Program Special
Technology, China. His research interests include Support for Top-Notch Young Professionals. He is a recipient of the Best
serverless computing, resource allocation, and task Paper Award of IEEE/ACM IWQoS 2019, ACM e-Energy 2018 and IEEE
scheduling. GLOBECOM 2011, the First Class Prize of Natural Science of the Ministry
of Education in China, as well as the Second Class Prize of National Natural
Science Award in China.
Authorized licensed use limited to: Indira Gandhi Centre for Atomic Research (IGCAR). Downloaded on April 08,2024 at 04:55:58 UTC from IEEE Xplore. Restrictions apply.