Beyond Load Balancing: Package-Aware Scheduling For Serverless Platforms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

UI*&&&"$.

*OUFSOBUJPOBM4ZNQPTJVNPO$MVTUFS $MPVEBOE(SJE$PNQVUJOH $$(3*%

Beyond Load Balancing:


Package-Aware Scheduling for Serverless Platforms
Gabriel Aumala∗ , Edwin F. Boza∗ , Luis Ortiz-Avilés∗ , Gustavo Totoy∗ , Cristina L. Abad
Escuela Superior Politécnica del Litoral, ESPOL
Email: {gaumala, eboza, luiaorti, gtotoy, cabad}@fiec.espol.edu.ec

Abstract—Fast deployment and execution of cloud functions in


Function-as-a-Service (FaaS) platforms is critical, for example,
for microservices architectures. However, functions that require
large packages or libraries are bloated and start slowly. An
optimization is to cache packages at the worker nodes instead
of bundling them with the functions. However, existing FaaS
schedulers are vanilla load balancers, agnostic of packages cached
in response to prior function executions, and cannot properly
reap the benefits of package caching. We study the case of
package-aware scheduling and propose PASch, a novel scheduling
algorithm that seeks package affinity during scheduling so that
worker nodes can re-use execution environments with preloaded Fig. 1. Example of function execution requests arriving at a FaaS scheduler.
packages. PASch leverages consistent hashing and the power of 2 The color of each request represents the largest package that the function
choices, while actively avoiding worker overload. We implement requires. The stack on the upper right corner of the worker nodes represents
PASch in a new scheduler for the OpenLambda framework the local import (package) cache, available to all functions that run on the
and evaluate it using simulations and real experiments. When worker. Current schedulers have a load balancing goal and make no attempt
using PASch instead of a least loaded balancer, tasks perceive to try to schedule a function where it could run faster; in this case, that would
be at one of the nodes that has already cached and pre-imported the package.
an average speedup of 1.29x, and 80th percentile latency that is
23x faster. Furthermore, for the workload studied in this paper,
PASch outperforms consistent hashing with bounded loads—a
state-of-the-art load balancing algorithm—yielding a 1.3x average (SLOs) of applications from performance-critical domains like
speedup, and a speedup of 1.5x at the 80th percentile. interactive web applications and IoT environments.
Index Terms—serverless computing, cloud computing, Cloud functions can launch rapidly, as they run in preallo-
function-as-a-service, scheduling, load balancing, data locality
cated virtual machines (VMs) and containers. However, when
these functions depend on large packages, they start slowly;
I. I NTRODUCTION this affects the elasticity of the application, as it reduces its
Function-as-a-Service (FaaS) cloud platforms let tenants ability to rapidly respond to sharp load bursts [3]1 . Moreover,
deploy and execute functions on the cloud, without having to long function launch times have a direct negative impact on
worry about server provisioning. In a FaaS platform, cloud the performance of serverless applications using the FaaS
functions are typically small, stateless tasks, with a single model [2]. A solution is to cache pre-imported packages at
functional responsibility, and are triggered by events. The the worker nodes, leading to speed-ups of up to 2000x when
FaaS provider manages the infrastructure and other operational the packages are preloaded prior to function execution instead
concerns, enabling developers to easily deploy, monitor, and of having to bundle them with the cloud function [4].
invoke the functions [1]. These functions or tasks run on any However, existing FaaS schedulers are agnostic of any
of a pool of servers or workers managed by the provider and package caching implemented at the worker nodes and—when
potentially shared between the tenants. assigning functions to workers—fail to properly leverage the
The FaaS model holds good promise for future cloud cached packages, as illustrated in Figure 1.
applications, but raises new performance challenges that hinder In this paper, we study the case of package-aware schedul-
its adoption [2]. One of these challenges is reducing the FaaS ing, which we define as a special case of near-data scheduling
overhead. In particular, the provisioning overhead that results optimizations for FaaS platforms.
from deploying the cloud function on demand as part of Existing FaaS schedulers, like those in OpenWhisk, Kube-
unpredictable workloads, can make launching functions slow, less, Fission, OpenFaaS, and OpenLambda, are simple load
and as a result, out-of-the-box exceed service level objectives balancers that make no attempt to target any code or artifacts

Partially funded by a Google Faculty Research Award. 1 For simplicity, we talk about large packages, but the start-up time includes
Corresponding author: C. L. Abad (cabad@fiec.espol.edu.ec) the time to download and install the package, and the run-time import
∗ These authors contributed equally to this work. processes; on average, all this can take more than four seconds [3], [4].

¥*&&& 
%0*$$(3*%
cached at the worker nodes. With a regular load balancer, II. BACKGROUND AND R ELATED W ORK
a function can be mapped to a worker node that does not A. FaaS and OpenLambda
contain a required package, even though one or more worker
nodes that do have the package may be available (see Fig- A Function-as-a-Service (FaaS) platform supports the cre-
ure 1). A solution could be to use available techniques like ation of distributed applications composed by small, single-
consistent hashing [5], [6] to route all function requests that task, cloud functions. These functions run in lightweight
require a specific package p to the same worker node, thus sandbox environments, which run on top of virtual machines.
maximizing the cache hit rate. However, this approach suffers The sandboxes, runtime environments, and virtual machines
from a problem of load imbalance, particularly under skewed are managed by the cloud provider. Thus, a developer can
workloads [7]—as is the case of the distribution of packages create elastic cloud applications without having to worry about
required by functions in FaaS platforms (see Table I and [3]). server provisioning and elasticity managers. Examples of FaaS
platforms include OpenLambda,2 Fission,3 OpenWhisk,4 AWS
Within the domain of stream processing engines and the Lambda,5 Google Cloud Functions,6 and Azure Functions7 .
general balls and bins model, recent work has tried to map OpenLambda: OpenLambda is a serverless computing
requests to specific servers while keeping a balanced load [7]– platform that supports the FaaS execution model [11], [12].
[9], however these solutions avoid imbalance between the Figure 2 shows the OpenLambda architecture. In Open-
worker nodes at all costs, regardless of whether a worker could Lambda, a developer must upload the code of their cloud
tolerate more work without sacrificing performance. We show functions to the code store or registry. When a cloud function
that this is not necessary, and relax the balancing goal so that is triggered, a request is sent to the load balancer (scheduler),
the ultimate goal is not keeping the workers balanced, but which selects a worker based on the configured algorithm.
rather avoiding exceeding worker capacity. When a worker receives a request, it runs the cloud function
To solve this problem, we propose a fast scheduling al- in a sandbox. OpenLambda currently supports Docker con-
gorithm that seeks package affinity during scheduling, while tainers and lightweight SOCK containers [13]. The first time
actively avoiding worker overload. Our algorithm leverages the a function runs on a worker, the worker has to contact the
power of 2 choices technique [10] to map a task to the least code store to get the code of the function; the code is cached
loaded of two affinity nodes, based on the largest package so that this step is not needed in future invocations.
required by the function. If, however, both nodes exceed a Function scheduling in OpenLambda: The function
configurable overload threshold, then the scheduler reverts to scheduling—or mapping of functions to workers—is per-
simple load balancing, and maps the task to the least loaded formed by the NGINX software load balancer. The request
worker node. To seek package affinity during task assignment, routing methods currently supported by NGINX are [14]:
we use consistent hashing [5], [6] so that we minimize the • Round-robin: maps requests to servers in round robin
expected number of movements in case worker nodes are fashion.
added or removed by an auto-scaler or elasticity manager. As • Least-connected: assigns a request to the server with the
a result, worker nodes can re-use execution environments with least number of active connections.
preloaded packages, thus resulting in faster task launch time, • IP-hash: uses a hash-function to map all requests coming
and consequently, a faster task turnaround time. from the same IP address to the same server.
Our contributions consist of the following: These methods distribute the load between the workers,
but lack functionality to make intelligent decisions that seek
1) We carefully describe the problem and related work, to, for example, minimize data transfers between workers or
including state-of-the-art algorithms (section II). with an external repository (e.g., a distributed file system or a
2) We formalize the problem of scheduling functions in repository of packages required by the cloud functions).
FaaS platforms with the goal of maximizing code local- Caching to improve task launch times: Oakes et al. [3]
ity while actively avoiding worker overload (section III). proposed Pipsqueak, a shared import (package) cache available
3) We propose PASch, a novel scheduling package-aware at each OpenLambda worker. Pipsqueak seeks to reduce the
algorithm for FaaS platforms (section IV). start-up time of cloud functions via supporting lean functions
4) We provide a working implementation of PASch on whose required packages are cached at the worker nodes. The
olscheduler, a lightweight scheduler for the Open- cache maintains a set of Python interpreters with packages
Lambda framework that we have released (section IV-B). pre-imported, in a sleeping state. When a cloud function is
5) We evaluate PASch under realistic workloads (section V) assigned to a worker node, it checks if the required packages
and find that it considerably outperforms the least loaded are cached. To use a cached entry, Pipsqueak: (1) Wakes up
scheduler: 1.29x average speedup and 80th percentile in
2 open-lambda.org
latency is 23x faster. In our experiments, PASch also 3 fission.io
outperforms a state-of-the-art algorithm [9] yielding a 4 openwhisk.apache.org
1.3x faster average latency and a 80th percentile in 5 aws.amazon.com/lambda
6 cloud.google.com/functions
latency that is 1.5x faster.
7 azure.microsoft.com/en-us/services/functions


Fig. 2. The OpenLambda architecture. The function scheduling (or task of assigning functions to workers) is performed by the NGINX software load balancer.

and forks the corresponding sleeping Python interpreter from intensive computing frameworks like Hadoop8 , where each
the cache, (2) relocates its child process into the handler type of task has different processing rates on different subsets
container, and (3) handles the request. If a cloud function of servers [17]. Tasks that process local data execute the
requires two packages that are cached in different sleeping fastest, followed by tasks that require data in the same rack,
interpreters, then only one can be used and the missing followed by tasks that read remote data. Several near-data
package must be loaded into the child of that container (created schedulers have been proposed for Hadoop [18]–[20]. How-
by step 2 above). To deal with multiple package dependencies, ever, these are not directly applicable to the problem studied
Pipsqueak uses a tree cache in which one entry can cache in this paper, as they require a centralized directory to keep
package A, another entry can cache package B, and a child of track of the location of the data blocks (i.e., the namenode in
either of these entries can cache both A and B. Hadoop). In contrast, our proposed algorithm uses hash-based
Having pre-initialized packages in sleeping containers affinity mapping, a mechanism that requires no centralized
speeds up function start-up time because this eliminates the directory and has minimal overhead. Implementing a directory
following steps present in an unoptimized implementation: (1) of cached packages in a serverless computing platform would
downloading the package, (2) installing the package, and (3) impose significant overhead on the infrastructure due to extra
importing the package. The last step also includes the time communication and storage requirements. Moreover, unlike
to initialize the module and its dependencies. Especially for data block storage in Hadoop, the contents of an import cache
cloud functions with large libraries, this process can be time could change rapidly, as packages can enter and leave the
consuming, as it can take 4.007s on average and as much as cache frequently, leading to problems where the scheduler
12.8s for a large library like Pandas [4]. would assign tasks based on stale knowledge.
In addition to the Pipsqueak import (package) cache, Open- Task Scheduling in FaaS Platforms: In the FaaS domain,
Lambda workers have two other caches: an on-disk (package) industry schedulers—like the generic NGINX and the custom
install cache and a handler (function) cache. function schedulers for OpenWhisk and Fission—are vanilla
load balancers that do not seek to minimize data movement in
the system. Other than earlier work from our own group [21],
B. Load Balancing and Scheduling we are not aware of any publications in this emerging area.
We build upon a large body of work in task scheduling and However, FaaS scheduling is expected to attract work in the
load balancing for server clusters. In this section, we discuss near future, as it is an important serverless challenge [2].
the most relevant prior work. Content-aware Request Routing for Web Servers: Better
related to our problem is the case of locality or content-aware
Affinity Scheduling: Early work in affinity scheduling
request distribution in Web-server farms [22]. In this context,
sought to improve performance in multi-processor systems by
the simplest solution is static partitioning of server files using
reducing cache misses via preferential scheduling of a process
URL hashing, to improve cache hit rate; though this can
on a CPU where it has recently run [15], [16]. However, the
lead to extreme server overload for skewed workloads [22].
issue here is not how to map threads to CPUs, but how to
Others have proposed algorithms that partition Web content to
re-schedule them soon enough to reap caching benefits, while
improve cache hit rate using static tables or variants of hash-
avoiding unfairness and respecting thread priority.
based affinity [22], [23] that are inadequate for the skewed
Near-data scheduling: The near-data scheduling problem
is a special case of affinity scheduling applicable to data- 8 hadoop.apache.org


and highly dynamic workloads of modern cloud systems. In a prior, work-in-progress publication [21], we proposed
Moreover, some of these solutions assumed that the workloads a set of package-aware scheduling algorithms for FaaS plat-
are relatively stable. However, modern cloud workloads are forms, spanning different scheduler models: pull-based or
more dynamic, making solutions that require offline workload push-based, and centralized or distributed scheduler. Our
analysis (e.g., see [24]), inadequate for this domain. simulation-based evaluation suggested that it is a promising
Nevertheless, some early ideas in Web request routing and area of research, as our package-aware scheduling algorithm
traditional load balancing can be partially applied to the yielded improvements in latency of 66% (versus a least-loaded
problem studied in this paper. Specifically, we leverage two scheduler). In this paper, we propose PASch, a variant of this
classic techniques in our work: consistent hashing [5] and family of algorithms that it is suitable for the OpenLambda
power of two choices load balancing [10]. We provide a brief scheduler model: a push-based, centralized, scheduler with
description of these techniques next. a processor-sharing service discipline (workers). We evaluate
Consistent Hashing: Consistent hashing [5], [6] is an PASch with simulations and real experiments in AWS.
alternative to hash-based affinity that solves the problem of III. P ROBLEM D EFINITION
having a dynamic set of servers. This technique assigns a set of
items (keys, client IPs, etc.) to servers so that each one receives We consider a Function-as-a-Service platform in which
roughly the same number of items. However, unlike standard tasks are cloud functions that are executed on worker nodes,
hashing schemes, a small change in the server set does not as a response to user-defined events.
induce a total remapping of items to servers and instead yields A. Assumptions and Limitations
only a slightly different assignment. However, this approach
suffers from a problem of load imbalance, particularly under A worker node is capable of running multiple tasks simulta-
skewed workloads [7], [9], [10]. neously and can be, for example, a virtual machine managed
by a container orchestration engine such as Kubernetes. In
Power of two choices: The power of two choices [10] is
addition to worker nodes, there is a scheduler which assigns
a technique that achieves near perfect load balance as follows:
or maps tasks to workers. The workers use a processor-sharing
When a request arrives to the system, the balancer selects two
service discipline in which tasks share the worker’s local
random servers and assigns the request to the least loaded of
resources. The number of workers is dynamic; this is common
those two servers. Our approach combines hash-based routing
in cloud platforms with auto-scalers or elasticity managers that
with power of two choices, so that the two servers are not
add or remove nodes to the cluster, as a response to changes
chosen at random, but are rather based on the affinity that a
in demand.
server has to a specific item (package). As this can lead to an
We assume there is a package or library caching mechanism
unbalanced load, we add the concept of a maximum tolerable
implemented at the worker level; as described in section II-A.
threshold, and thus avoid saturating the workers.
When a function depends on more than one package, we seek
Recent related work: Others have recently looked into to achieve scheduling (soft) affinity for the largest one, as this
how to map items to servers while simultaneously balancing is the package that is most useful to accelerate its loading
the load of the servers, even in the presence of skewed time.
workloads and a dynamic server set [7]–[9]. We assume package popularity in cloud functions is
Partial key grouping [7] and consistent grouping [8] are skewed, with a long tail of unpopular packages. To con-
specific to streaming frameworks like Apache Storm. For this firm our intuition, we performed an analysis of the GitHub
reason, they make the strong assumption that it is detrimental repositories of projects that could potentially contain AWS
to the performance of the system to split keys between workers Lambda functions. Specifically, we found Python files that
(strict affinity). Partial key grouping supports splitting keys contain the substring amazonaws and Java files that con-
only between two workers, while consistent grouping does not tain com.amazonaws.services.lambda. We analyzed the import
support key splitting at all. In contrast, as we deal with caching statements of those files and found that the popularity distribu-
we can implement soft affinity instead: if both of the affinity tion of the required packages is indeed skewed (see Table I).
workers for a key (package) overloaded, then the request can Finally, the scheduler is agnostic of the contents of the
be sent to any other worker. worker caches. We believe that having the scheduler keep track
Consistent hashing with bounded loads [9] is framed as a of the contents of the import caches is not adequate in the
generic hashing solution. Though similar to ours, the problem FaaS setting, as this would impose additional overhead on the
solved by this algorithm has stronger requirements: a deter- system (network communications to keep information updated,
ministic algorithm for mapping keys to servers that achieves and resources to store and manage the caching directory).
load balancing that is bounded within some . We relax both
requirements as described in section III. B. System Model
While these recent prior work could be used to solve the Given an instance of a FaaS platform, let W be the set
problem at hand, the slight differences in the problems that of workers in which tasks can run, where |W | = n. Each
they try to solve (discussed in section III) yield solutions that worker w ∈ W runs on a machine with limited capacity and
are suboptimal (as shown in section V). known execution threshold tw . For simplicity, we assume that


TABLE I may vary [25]. Better alternatives are Join-the-Shortest-Queue
T OP TEN REQUIRED PACKAGES BY P YTHON AND JAVA FILES IN G IT H UB .
W E ANALYZED ONLY THOSE FILES THAT REFERENCE AMAZONAWS
(JSQ) [26] and Join-Idle-Queue (JIQ) [25]; however, these
(P YTHON ) OR COM . AMAZONAWS . SERVICES . LAMBDA (JAVA ). are not applicable when the workers use a processor sharing
approach, as is the case in OpenLambda and Fission. When
Python Package % Java Package %
the workers use a processor sharing, the least loaded policy
boto 27.0% com.amazonaws 42.0%
ansible 6.3% java.util 11.1% balances the load, for example, as implemented by NGINX
os 3.7% java.io 9.3% (least-connected policy). However, this policy does not do
tests 3.3% com.fasterxml 4.4% anything to maximize cache affinity.
time 2.4% org.apache 3.3%
json 2.4% org.junit 3.2% To maximize cache affinity with an added goal of stability
xmodule 2.0% javax.annotation 2.9% in the presence of a dynamic set of workers, we can use
django 1.9% org.mockito 2.2%
sys 1.9% com.visionarts 1.5%
consistent hashing [5], [6] to assign all tasks that require a
datetime 1.8% com.google 0.1% particular package to the same worker. However, as package
All others 47.3% All others 20.0% popularity is not uniformly distributed, this approach would
create hot spots, overloading workers that cache popular
packages [7], [8].
there is a single important resource on which machines are
constrained such as memory or processing. Each worker node D. Recent Applicable Solutions
w ∈ W can execute an unbounded number of concurrent tasks;
however, the tw threshold (expressed in normalized units of As discussed in section II, three recent algorithms deal with
the constrained resource) is the maximum number of resource problems similar to ours [7]–[9]. While we could adapt these
units that can be in use at w without reducing the performance for the FaaS scheduling problem, slight differences in the
of the tasks at w. problems they solve make them suboptimal alternatives.
The input to the scheduler is a sequence of task execution First, lets consider the algorithms proposed by Nasir et
requests r = i, f, p, ti , where i identifies the request, f is the al. [7], [8]. These algorithms have been designed to solve
function to execute, p ∈ P is the largest package required by the problem of balancing the load of workers (processing
the task, and ti is the timestamp at which the request arrives element instances) in a distributed stream processing engine
to the scheduler. The requests arrive in ascending order by ti . like Apache Storm. These works make the strong assumption
Upon receiving a request r, the scheduler makes a placement that splitting requests for a specific key between multiple
decision and chooses one of the workers in W to execute r. workers is not desirable. In our case, it is acceptable to send
As tasks arrive to the system, the scheduler maps and tasks that require a package p to different workers; doing may
forwards the request to one of the workers. After the task is not be optimal in terms of performance, but it does not affect
processed, it leaves the system. We define the task turnaround the correctness of the system.
time or latency to be the difference between the time at which The problem tackled by Mirrokni et al. [9] is more similar
a task arrives to the system (ti ) and when it leaves the system. to ours in that requests (balls) to a key k can be sent to
Given the system model and assumptions described above, different workers (bins) in the cases that the first target worker
we define the problem we are trying to solve as follows. is overloaded—under a definition of overload that means: ex-
Problem: Given a sequence of task execution requests, ceeding the average load by more than some . As they frame
each of which requires a package drawn from a skewed the problem as a classic hashing problem, their algorithm
distribution P , and a set of workers w ∈ W , each with limited ensures that the assignments of balls to bins is deterministic
capacity and known execution threshold tw , find a scheduling (through the use of a linear probing technique). Our problem
function S : P → W that maximizes affinity to the import definition relaxes both assumptions. First, we tolerate high
cache at each worker while avoiding exceeding the workers’ imbalance as long as individual workers are not overloaded.
execution thresholds. Second, if the affinity workers for an item are overloaded, we
Additional goal, stability: We also consider a stability make no attempt to deterministically decide a fallback target,
goal that seeks to contain or minimize any changes in package- and revert to plain load balancing instead (by choosing the
to-worker affinity, when the set of workers is dynamic. least loaded worker). Nevertheless, consistent hashing with
bounded loads could potentially solve the problem outlined
C. Naive Solutions in this section. For this reason, we include this algorithm in
A naive solution would be to try to either balance the load our evaluations and empirically show that PASch yields better
or to make placement decisions seeking strict affinity from performance for our particular setting.
packages to workers. However, these solutions do not consider
both goals together and make suboptimal placement decisions, IV. P ROPOSED S OLUTION
as briefly explained next. For more detail on these and other We describe a solution to the problem formalized in sec-
related work, see section II. tion III. Given a set of task execution requests and a set of
To balance the load, round robin or random assign- worker nodes on which these can run, the goal is to implement
ment is not optimal, as the resource consumption of tasks a lightweight scheduler that maps requests to workers seeking


to minimize the overall task turnaround time, while actively Algorithm 2: Mapping function, M ; given a package,
avoiding worker overload. returns two affinity nodes.
We propose PASch, a novel scheduling algorithm that Global data: A consistent hash implementation,
leverages the power of 2 choices technique [10], consistent consistent, and value to be added to the
hashing [5], [6], and least loaded scheduling, adding the notion package ID to map a second affinity
of a maximum per-worker load threshold. In this way, the worker to it, salt
restrictions of the FaaS scheduling problem are met. Input: Package id, p
PASch uses a greedy approach in which a mapping function Output: Affinity workers for p, a1, a2
M : P → W maps each request to two affinity workers for the /* Get two affinity workers */
largest package required by the task9 The least loaded of these 1 a1 = consistentHash.get(p)
workers will be the one to process the task, unless the load of 2 a2 = consistentHash.get(p + salt)
the worker exceeds its execution threshold tw ; this provides 3 return a1, a2
affinity while avoiding worker overload. In addition, if the
worker exceeds its threshold, then the scheduler reverts to a
least loaded scheduler, and forwards the request to the worker
with the (normalized) least load, for some specific resource is O(n) if performed via a linear search—a frequently used
unit. Algorithm 1 shows the details of the proposed procedure. approach in real implementations. For example, this is how
the NGINX software load balancer implements its least-
Algorithm 1: Package-aware scheduler algorithm (PASch) connected policy. Alternatively, we could use an efficient heap
implementation. For example, using a Fibonacci heap [27],
Global data: List of workers, W = {w1 , ..., wn }, and
finding the least loaded worker takes O(log n) amortized time.
their load thresholds, T = {t1 , ..., tn },
Regarding the balancing of the loads, we consider two
mapping function M
situations. When the system load(sl) exceeds the overall
Input: Function, f , largest required package, p
capacity of the workers (i.e., sl ≥ tw ) the load is balanced
1 if (p is not null)then
according to a least loaded policy and all the workers are
/* Get affinity workers */ similarly overloaded. When the system load does not  exceed
2 a1, a2 = M (p)
the overall capacity of the workers (i.e., sl < tw ) the
/* Select target with least load */ maximum load a worker w can have is tw ; however, some
3 if (load(wa1 ) < load(wa2 ))then
workers could have a load close to 0, in case the packages
4 A := a1
that have affinity to that node are unpopular.
5 else
6 A := a2 B. Implementation in OpenLambda
/* If target is not overloaded, we We implemented PASch on a lean scheduler we developed
are done */ for OpenLambda, called olscheduler10 . We chose Go as
7 if (load(wA ) < tA )then the programming language as its primitives lets the devel-
8 Assign f to wA oper easily build high-performant distributed systems. For
9 return the consistent hashing implementation, we used the Package
consistent implementation [28], which uses the BLAKE2b
/* Balance load */ collision-resistant cryptographic hash function [29]. The cur-
10 Assign f to least loaded worker, wi
rent implementation uses the number of active requests to
measure the per worker load and the threshold is defined
To provide stability, the mapping function M : P → W uses as a maximum number of active requests; this is a common
consistent hashing so that we minimize the expected number approach used by load balancers. However, the algorithm
of movements in case worker nodes are added or removed by works for other types of resource limits.
an auto-scaler or elasticity manager (see Algorithm 2). The To facilitate the implementation of future schedulers,
use of consistent hashing ensures stability of the package-to- olscheduler exposes the following information:
worker affinity mappings, even in the presence of a dynamic 1) Workers: Number and references to the workers.
set of workers. We add a salt to the package id, p, to map it 2) Per-worker load: Measured as the number of active
to a second affinity worker using the hashing function. requests that each worker is currently handling.
3) Required packages: Each cloud function request includes
A. Analysis
the list of required packages, sorted by size.
The performance of the scheduling decisions is dominated 4) Function schemas: Obtained by querying the code repos-
by how efficient it is to find the loaded worker. This operation itory and cached in memory.
9 For simplicity, we do not deal with collisions of the affinity workers; in 10 To seek early feedback, we presented olscheduler at the Min-Move
case of collision, there is only one affinity worker for a function. workshop @ ASPLOS 2018, which has non-archival proceedings.


To get the number of required packages (item 3 above), we • The popularity of the packages follows a Zipf distribu-
extended the HTTP Post request (cloud function request in tion, with parameter s = 1.1.
Figure 2) so that it supports receiving the list as an annotation • The time to start the packages—time to download, install
on the function call. An alternative design would have been to and import—is randomly sampled from an exponential
query the cloud store to get this information; we rejected this distribution with an average time to start of 4.007s.
idea to avoid an extra step on the critical path of the requests. • Each function requires a random number of packages,
olscheduler currently supports the following scheduling sampled according to an exponential distribution, with
algorithms: an average number of 3 required packages.
• Round robin: Distributes the requests uniformly between • Each worker has a 500MB LRU import cache.

the workers, in round robin fashion. • The sizes of the (cacheable) packages is modeled after

• Least loaded: Assigns a request to the worker that has the sizes of the packages in the PyPi repository.
the least number of active connections (akin to NGINX’s • Time to launch a function that requires no packages: 1s.

least-connected). • The running time of a task (after loading required pack-

• Random: Distributes requests randomly between the ages) is exponentially distributed with mean = 100ms.
workers, according to user defined worker weights. • Experiment duration: 30 minutes.

• PASch: The algorithm proposed in this paper. • Overload threshold of all nodes: t = st = 100.

• Consistent hashing with bounded loads [9]: A recent While the configuration described above represents an ar-
variant of consistent hashing that provides a constant tificial scenario, the values model real observed behavior, as
bound on the load of the maximum loaded worker. reported by prior work [3], [11], [13], [21], [30]. To isolate
To find the least loaded worker we use a linear search. For the effect of PASch on the import cache, the simulations do
the implementation of consistent hashing with bounded loads, not use the install or handler caches. The simulator supports
we use the Package consistent implementation [28], which has three scheduling algorithms: (1) least loaded, which optimizes
no configurable parameters and uses a balancing parameter of for load balancing, (2) hash affinity, which optimizes for
c = 1.25; this algorithm guarantees that the maximum load is cache affinity, and (3) PASch, which combines both goals as
at most cm/n, where m is the number of clients (current described in this paper.
We also ran real experiments on AWS EC2. We use six
requests) and n is the number of servers (workers).
virtual machines (VMs) with the 4.4.0-1072 Linux kernel.
The implementation of PASch on olscheduler took only
On one VM we run olscheduler and PipBench; on
52 lines of code.
the other five ones, we launch five OpenLambda workers.
V. E XPERIMENTAL E VALUATION We used m4.xlarge instances, with 4 vCPUs, 16 GB of
RAM, and EBS storage. The experiments used OpenLambda’s
We assess the performance of PASch using simulations and lightweight SOCK [13] containers. In the real experiments
a real deployment in a public cloud. The simulations run we used olscheduler which, as described in section IV-B,
large-scale experiments with 1 000 workers. We also run real supports five scheduling algorithms: three algorithms that are
experiments on a public cloud, using a small deployment with common in software load balancers like NGINX (round robin,
5 workers. Together, these experiments enable us to obtain a least loaded and random) and two algorithms that consider
broad understanding of the benefits and costs of using PASch. scheduling affinity in addition to load management (PASch
Our experiments seek to answer the following questions: and consistent hashing with bounded loads). For the case
Q1: How effective is PASch in increasing the import of PASch, we configure the threshold parameter to 80. The
cache hit rate? consistent hashing with bounded loads implementation we are
Q2: What is the cost (in load imbalance) of using PASch? using (Package consistent) has no configurable parameters and
Q3: How much does PASch speed up individual tasks? uses a balancing parameter of c = 1.25, as recommended by
Q4: Is PASch successful in reducing median and tail task Mirrokni et al. and an industry blog [9], [31].
turnaround times? All three caches were active during the experiments: the
Q5: How sensitive is PASch to the threshold parameter? on-disk install cache, a 6GB import (package) cache and a
1GB handler cache. While PASch seeks to improve the import
A. Experimental setup cache hit rate, the hit rates of the other caches also increase
We implemented a simulator in Python, using the SimPy as a result of better package affinity. The install cache also
simulation framework11 and ran tests with the following con- deals with packages, so affinity to the import cache also yields
figuration parameters: affinity to the install cache. The handler on-demand cache
• Arrivals are exponentially distributed, with a mean inter- stores function handlers at each worker; affinity to the largest
arrival time of 0.1ms. package required by a function would mean subsequent calls
• The number of worker nodes is 1 000; each worker can to the same function would also end up going to the same
run 100 tasks simultaneously (st = 100). worker. For a study of what percentage of the performance
improvement comes from using each of the caches, we refer
11 https://pythonhosted.org/SimPy/ the reader to the work of Oakes et al. [3], [13], who studied


90 1.00
Hit Rate (%)

80
0.75
70

CDF
60
0.50
50
Least loaded Hash affinity PASch
Scheduling Algorithm 0.25

Fig. 3. Box plots of the hit rates of the import caches at the workers. PASch
improves the average hit rate by actively seeking to improve package-affinity. 0.00
0 5000 10000 15000 20000 25000 30000

Hash affinity
Speedup
500

400 Fig. 5. CDF of per-task speedups with PASch versus each task running on
an empty cluster, with all resources available but no pre-cached packages.
300
CV

200

100 each worker, and show the coefficient of variation for every 1-
PASch
second timeslot in the experiment in Figure 4. We can observe
0 Least loaded
that PASch sacrifices some node balance, to seek a higher hit
0 500 1000 1500
rate (and smaller task turnaround time). Least loaded achieves
Timeslot (1 sec)
near perfect balancing, while the hash-based affinity algorithm
produces the most unbalanced task assignments. We show next
Fig. 4. Node imbalance: Coefficient of variation of the number of tasks
assigned to each worker during each 1-second timeslot (smaller is better).
that the use of the threshold parameter in PASch is able to
effectively contain the imbalance to a level manageable by
the worker nodes, and as a result, the task turnaround times
the performance of the caches on OpenLambda, using vanilla improve, even in the presence of (moderately) unbalanced
NGINX (least-connected). worker loads.
We use the PipBench benchmark [3] to issue requests to Q3: Effect on individual tasks: We ran real experiments
function handlers which import packages from a repository to observe the effect of PASch on individual tasks. To do
populated with packages generated to emulate the directory this, we first calculate a performance baseline for each task
structure, file sizes, dependencies, and popularity from real by executing it on an empty cluster so that the task can
packages hosted in the web. We used the default configuration finish as fast as possible, with no restrictions on resources
of PipBench. at the worker, but without the benefit of caching. After having
the performance baseline for each task, we used PipBench to
B. Experimental results asses if each task runs slower or faster than its baseline. For
Q1: Effect on import cache hit rate: The main premise each task, we calculate the speedup. A task’s speedup may
of our solution is that increasing package affinity during be greater than one if, for example, the task executes faster
scheduling decisions can improve performance as a direct through the benefit of better cache affinity. On the other hand,
result of an increased import cache hit rate, even at the cost of a task’s speedup may be less than one if it runs on a node
a manageable node imbalance. To confirm that we can improve where other tasks are running and the concurrent tasks are
the cache hit rate, we ran simulations to assess the impact that competing for the use of the CPU. Figure 5 shows that PASch
PASch has on the import cache hit rate. The results show that is able to improve the performance of most tasks through cache
the median hit rate increases from 51.2% with the vanilla least affinity, at the penalty of some tasks performing slower due to
loaded scheduler to 64.1% with PASch (see Figure 3). Note imperfect load balancing. Specifically, only 3.3% of the tasks
that the median hit rate with PASch is slightly higher than have a speedup of less than one, while the median speedup
with the hash affinity scheduler; this is due to the finite size is 3841. These speedups represent the maximum achievable
of the cache. Without a per-worker request limit, the resulting speedups for the tasks when compared to an unoptimized
working set may not fit in the cache. baseline (no caching) but without incurring in performance
Q2: Effect on load balance: We can quantify how well each penalties due to worker overload.
scheduling algorithm balances the load using the coefficient of Q4: Effect on median and tail task turnaround time:
variation, which is a measure of dispersion defined as the Having confirmed that PASch is able to improve the hit
ratio of the standard deviation to the mean: cv = σ/μ. We ran rate of the import caches of the worker nodes and that the
simulations and count the total number of tasks assigned to worker imbalance is not so excessive as to decrease task


1.00 VI. C ONCLUSIONS
The main reason for load balancing is to improve perfor-
0.75 Scheduler mance, as tasks assigned to overloaded workers are bound
Consistent hashing w/ BL to be delayed in their completion. However, the moderate
Least loaded imbalance of our proposed algorithm is not an issue, as our
CDF

0.50
PASch experiments show that we actually improve performance: tasks
Random that run on workers that have preloaded a required package,
0.25 Round robin take less time to finish; by improving the cache hit rate, we
improve overall system performance.
0.00
Two key insights from our work are: (1) near-data schedul-
10 100 1,000 10,000 100,000
ing techniques for FaaS platforms can yield significant perfor-
Latency in milliseconds mance improvements, yet all the open source FaaS platforms
that we’ve looked into use vanilla load balancers; and, (2)
Fig. 6. CDFs of the task turnaround times, when using different scheduling when redirecting requests to workers, balancing the load of
algorithms. PASch outperforms all other scheduling algorithms we evaluated. workers may not be a necessary goal: our experiments showed
that relaxing this requirement and changing it to “avoid worker
50% 75% overload” gives us more flexibility in mapping decisions that
6.00 can lead to considerable performance gains.
Latency (ms)

Latency (ms)

3.03 5.50
In the future, we want to study the performance of PASch
3.00 5.00
under different and dynamic workloads. We also plan to exper-
4.50
2.98
iment with varying the balancing parameter of the consistent
4.00
2.95
16 256 4K 64K 16 256 4K 64K hashing with bounded loads algorithm, to better understand
Threshold Threshold when one outperforms the other. Finally, PASch could work
80% 99% for other domains too, like redirecting web requests to a server
60K in a server farm, with the goal of increasing cache affinity. In
Latency (ms)

Latency (ms)

60

40
55K the future we want to explore using PASch in other domains.
20
50K We have released the code of our scheduler on GitHub12 .
45K
16 256
Threshold
4K 64K 16 256
Threshold
4K 64K
ACKNOWLEDGMENTS
This work has been partially funded by a Google Faculty
Fig. 7. Effect of threshold on 50th, 75th, 80th and 99th percentile latency. Research Award and AWS Cloud Credits for Research. We
thank Erwin van Eyk who helped shape some of our initial
ideas of on package-aware FaaS scheduling [21]. We thank
performance, we assess the effect of PASch on task turnaround the members of the serverless working group within the
time (latency), when compared to other scheduling algorithms. SPEC Cloud RG who have helped establish a research agenda
Towards this end, we ran real experiments with PipBench. for interesting problems in the FaaS domain [2]. We thank
Figure 6 shows the cumulative distribution functions (CDFs) Rafael Rivadeneira who performed the analysis of popularity
of the task turnaround times, when using different scheduling of packages on GitHub (section III-A). Finally, we thank Ed
algorithms. On average, PASch achieves a 1.29x speedup Oakes, Tyler Harter and Kevin Houck from the OpenLambda
compared to the least loaded scheduler. However, if you team, who helped us better understand and run OpenLambda.
analyze the CDFs, you’ll notice that the improvement is most
noticeable in the [61 − 87] percentile range. For example, the R EFERENCES
80th percentile in task turnaround time has a 23x speedup. [1] E. van Eyk, A. Iosup, S. Seif, and M. Thömmes, “The SPEC cloud
Consistent hashing with bounded loads performs better than group’s research vision on FaaS and serverless architectures,” in Intl.
least loaded, but PASch is still outperforms it for the workload Workshop on Serverless Comp. (WoSC), 2017.
[2] E. van Eyk, A. Iosup, C. Abad, J. Grohmann, and S. Eismann, “A SPEC
studied in this paper: PASch achieves a 1.3x average speedup, RG cloud group’s vision on the performance challenges of FaaS cloud
and speedups of 1.5x, 2.7x and 1.04x at the 80th, 90th and architectures,” in ACM/SPEC Intl. Conf. Perf. Eng. (ICPE), 2018.
99th percentiles, respectively. [3] E. Oakes, L. Yang, K. Houck, T. Harter, A. Arpaci-Dusseau, and
R. Arpaci-Dusseau, “Pipsqueak: Lean Lambdas with large libraries,”
Q5: Sensitivity to threshold parameter: PASch has one con- in IEEE Intl. Conf. Distrib. Comp. Sys. Workshops (ICDCSW), 2017.
figurable parameter: the threshold value. Our model assumes [4] ——, “Pipsqueak: Lean Lambdas with large libraries,” 2017, (Presenta-
that a worker has limited capacity and a known execution tion at) Intl. Workshop on Serverless Comp. (WoSC).
[5] D. Karger, A. Sherman, A. Berkheimer, B. Bogstad, R. Dhanidina,
threshold t. To assess the sensitivity to this parameter, we run K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi, “Web caching
real experiments and evaluate the resulting task turnaround with consistent hashing,” Comp. Netw., vol. 31, no. 11, 1999.
time, for varying values of t. Figure 7 shows that PASch is
12 https://github.com/disel-espol/olscheduler
not particularly sensitive to the threshold parameter.


[6] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan, [20] Q. Xie, M. Pundir, Y. Lu, C. Abad, and R. Campbell, “Pandas: Robust
“Chord: A scalable peer-to-peer lookup service for Internet applica- locality-aware scheduling with stochastic delay optimality,” IEEE/ACM
tions,” SIGCOMM Comput. Commun. Rev., vol. 31, no. 4, Aug. 2001. Trans. Netw., vol. 25, no. 2, 2017.
[7] M. Nasir, G. Morales, D. Garcia-Soriano, N. Kourtellis, and M. Serafini, [21] C. Abad, E. Boza, and E. van Eyk, “Package-aware scheduling of FaaS
“The power of both choices: Practical load balancing for distributed functions,” in HotCloudPerf workshop, co-located with ACM/SPEC Intl.
stream processing engines,” in IEEE Intl. Conf. Data Eng. (ICDE), 2015. Conf. Perf. Eng. (ICPE), 2018.
[8] M. Nasir, H. Horii, M. Serafini, N. Kourtellis, R. Raymond, S. Girdz- [22] V. Cardellini, E. Casalicchio, M. Colajanni, and P. Yu, “The state of
ijauskas, and T. Osogami, “Load balancing for skewed streams on the art in locally distributed Web-server systems,” ACM Comput. Surv.,
heterogeneous cluster,” arXiv preprint arXiv:1705.09073, 2017. vol. 34, no. 2, 2002.
[9] V. Mirrokni, M. Thorup, and M. Zadimoghaddam, “Consistent hashing [23] V. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel,
with bounded loads,” in ACM-SIAM Symp. Discrete Algo. (SODA), 2018. and E. Nahum, “Locality-aware request distribution in cluster-based
[10] M. Mitzenmacher, “The power of two choices in randomized load network servers,” SIGOPS Oper. Syst. Rev., vol. 32, no. 5, 1998.
balancing,” IEEE Trans. Par. Distrib. Sys., vol. 12, no. 10, 2001. [24] L. Cherkasova and S. Ponnekanti, “Optimizing a content-aware load
[11] S. Hendrickson, S. Sturdevant, T. Harter, V. Venkataramani, A. Arpaci- balancing strategy for shared web hosting service,” in Intl. Symp. Model.,
Dusseau, and R. Arpaci-Dusseau, “Serverless computation with Open- Anal. and Sim. of Comp. and Telecomm. Sys. (MASCOTS), 2000.
Lambda,” in Work. Hot Topics in Cloud Comp. (HotCloud), 2016. [25] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. Larus, and A. Greenberg, “Join-
[12] S. Hendrickson, S. Sturdevant, E. Oakes, T. Harter, V. Venkataramani, Idle-Queue: A novel load balancing algorithm for dynamically scalable
A. Arpaci-Dusseau, and R. Arpaci-Dusseau, “Serverless computation Web services,” Perform. Eval., vol. 68, no. 11, 2011.
with OpenLambda,” Usenix ;login:, vol. 41, 2016. [26] V. Gupta, M. Harchol Balter, K. Sigman, and W. Whitt, “Analysis of
[13] E. Oakes, L. Yang, D. Zhou, K. Houck, T. Harter, A. Arpaci- Join-the-Shortest-Queue Routing for Web server farms,” Perform. Eval.,
Dusseau, and R. Arpaci-Dusseau, “SOCK: Rapid task provisioning with vol. 64, 2007.
serverless-optimized containers,” in USENIX ATC, 2018. [27] M. Fredman and R. Tarjan, “Fibonacci heaps and their uses in improved
[14] “Using NGINX as HTTP load balancer,” available at: http://nginx.org/ network optimization algorithms,” Journal of the ACM (JACM), vol. 34,
en/docs/http/load balancing.html. Last accessed: Dec. 13, 2018. no. 3, Jul. 1987.
[15] J. Torrellas, A. Tucker, and A. Gupta, “Evaluating the performance of [28] K. Lafi, “Package consistent: A Golang implementation of consistent
cache-affinity scheduling in shared-memory multiprocessors,” Jrnl. Par. hashing and consistent hashing with bounded loads.” available at: https:
Distr. Comp. (JPDC), vol. 24, no. 2, 1995. //github.com/lafikl/consistent. Last accessed on Nov. 1st, 2018.
[16] D. Feitelson, “Job scheduling in multiprogrammed parallel systems,” [29] J. Aumasson, S. Neves, Z. Wilcox-O’Hearn, and C. Winnerlein,
IBM Thomas J. Watson Research Center, Tech. Rep., 1997, iBM “BLAKE2: Simpler, smaller, fast as MD5,” in Intl. Conf. Appl. Crypto.
Research Report 19790. and Netw. Sec. (ACNS), 2013.
[17] Q. Xie and Y. Lu, “Priority algorithm for near-data scheduling: Through- [30] W. Lloyd, S. Ramesh, S. Chinthalapati, L. Ly, and S. Pallickara, “Server-
put and heavy-traffic optimality,” in IEEE Conf. Comp. Comm. (INFO- less computing: An investigation of factors influencing microservice
COM), 2015. performance,” in IEEE Intl. Conf. Cloud Eng. (IC2E), 2018.
[18] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, [31] A. Roland, “Improving load balancing with a new consistent-hashing
and I. Stoica, “Delay scheduling: A simple technique for achieving algorithm,” Dec. 2016, vimeo Engineering Blog, Medium. Available at:
locality and fairness in cluster scheduling,” in European Conf. Comp. https://link.medium.com/EaD2ZaHAAS. Last accessed: Dec. 11, 2018.
Sys. (EuroSys), 2010.
[19] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, “MapTask scheduling
in MapReduce with data locality: Throughput and heavy-traffic optimal-
ity,” IEEE/ACM Trans. Netw., vol. 24, no. 1, 2016.



You might also like