0% found this document useful (0 votes)
75 views16 pages

Elastic Cloud Services: Scaling Snowflake's Control Plane

Uploaded by

Nhat Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views16 pages

Elastic Cloud Services: Scaling Snowflake's Control Plane

Uploaded by

Nhat Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Elastic Cloud Services: Scaling Snowflake’s Control

Plane
Themis Melissaris Kunal Nabar Rares Radut
Snowflake Inc. Snowflake Inc. Snowflake Inc.

Samir Rehmtulla Arthur Shi Samartha Chandrashekar


Snowflake Inc. Snowflake Inc. Snowflake Inc.
Ioannis Papapanagiotou
Gemini Trust
ABSTRACT cross-region replication. We showcase the effect of these ca-
Snowflake’s "Data Cloud", provided as Software-as-a-Service pabilities through empirical results on systems that execute
(SaaS), enables data storage, processing, and analytic solu- millions of queries over petabytes of data on a daily basis.
tions in a performant, easy to use, and flexible manner. Al- ACM Reference Format:
though cloud service providers provide the foundational Themis Melissaris, Kunal Nabar, Rares Radut, Samir Rehmtulla,
infrastructure to run and scale a variety of workloads, oper- Arthur Shi, Samartha Chandrashekar, and Ioannis Papapanagiotou.
ating Snowflake on cloud infrastructure presents interesting 2022. Elastic Cloud Services: Scaling Snowflake’s Control Plane. In
challenges. Customers expect Snowflake to be available at all SoCC ’22: ACM Symposium on Cloud Computing (SoCC ’22), Novem-
times and to run their workloads with high performance. Be- ber 7–11, 2022, San Francisco, CA, USA. ACM, New York, NY, USA,
hind the scenes, the software that runs customer workloads 16 pages. https://doi.org/10.1145/3542929.3563483
needs to be serviced and managed. Additionally, failures in
individual components such as Virtual Machines (VM) need 1 INTRODUCTION
to be handled without disrupting running workloads. As Snowflake’s Data Cloud is a SaaS platform that supports mul-
a result, lifecycle management of compute artifacts, their tiple data workloads such as data warehousing, data lakes,
scheduling and placement, software rollout (and rollback), data science, data engineering, and others. This paper focuses
replication, failure detection, automatic scaling, and load on Snowflake’s control plane, called Elastic Cloud Services
balancing become extremely important. (ECS) that enables elasticity, availability, and performant
In this paper, we describe the design and operation of operation of customer workloads at scale. We present key
Snowflake’s Elastic Cloud Services (ECS) layer that man- contributions of Snowflake’s ECS to individual functions of
ages cloud resources at global scale to meet the needs of control planes such as VM placement, cluster management,
the Snowflake Data Cloud. It provides the control plane to safe software rollout & rollback, management of pools of run-
enable elasticity, availability, fault tolerance and efficient ex- ning VMs, autoscaling, throttling, and load-balancing. To the
ecution of customer workloads. ECS runs on multiple cloud best of our knowledge, this is the only work that presents the
service providers and provides capabilities such as cluster perspective of realizing and operating a production-ready
management, safe code rollout and rollback, management control plane.
of pre-started pools of running VMs, horizontal and vertical Snowflake’s architecture consists of (1) Database Storage,
autoscaling, throttling of incoming requests, VM placement, (2) Query Processing, and (3) Elastic Cloud Services (ECS) lay-
load-balancing across availability zones and cross-cloud and ers, summarized in Figure 1. The architecture of Snowflake
relies on decoupling compute from persistent storage to en-
Permission to make digital or hard copies of all or part of this work for
sure that each can be scaled independently of the other [10].
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
Data ingested into Snowflake is organized into an optimized,
this notice and the full citation on the first page. Copyrights for components compressed, columnar format and stored in a Cloud’s Object
of this work owned by others than ACM must be honored. Abstracting with Storage (e.g. S3 on AWS). Snowflake organizes, structures,
credit is permitted. To copy otherwise, or republish, to post on servers or to and manages data and metadata, making them accessible
redistribute to lists, requires prior specific permission and/or a fee. Request to users via SQL queries. The query processing layer serves
permissions from [email protected].
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
as the "data plane" made of "virtual warehouses". Massively
© 2022 Association for Computing Machinery.
Parallel Processing (MPP) compute clusters composed of
ACM ISBN 978-1-4503-9414-7/22/11. . . $15.00 multiple VMs provisioned by Snowflake handle the execu-
https://doi.org/10.1145/3542929.3563483 tion of customer jobs. These "virtual warehouses" come in

142
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

T-shirt sizes (S, M, L, etc.) corresponding to the amount of rolling back changes safely, without downtime to customer
compute power in the warehouse and consist of VMs with workloads. Having the ability to instantly rollback ensures
software stacks managed by Snowflake to run customers’ continuity of service for customers and the ability to swiftly
jobs. Each virtual warehouse is an independent compute revert from production incidents. An additional desired prop-
cluster that does not share compute resources with other erty is to automatically update or revert code versions on
virtual warehouses. different computing entities, such as VMs, independently, as
ECS is the "control plane" that provides the abstractions workloads might be heterogeneous.
and connective tissue across cloud service providers to sched- Managing resource lifecycle: It is certain that not all
ule and place customer jobs on the data plane as well as VMs under management will be healthy and functional at
ensure elasticity and availability. ECS operates a fleet of scale. To ensure correct operation, the system control plane
VMs that help manage Snowflake’s infrastructure and are manages the lifecycle of cloud services across a fleet of com-
managed independently of the data plane VMs who are re- puting nodes and monitors individual nodes’ health. The
sponsible for running customer workloads. The data plane system needs to make decisions for every VM state change,
and the control plane interact with each other to avoid sys- for example, when it is necessary to terminate unhealthy
tem overload. ECS includes a collection of services responsi- VMs, to move VMs in and out of a cluster and to scale the
ble for managing and orchestrating Snowflake components. number of nodes in a cluster horizontally or vertically de-
It runs on compute VMs provisioned from cloud service pending on the workload. This must be done in a graceful
providers and provides functionalities such as Infrastructure way as Data Cloud workloads can last from seconds to days.
Management, Metadata Management, Query parsing and Ensuring high availability: Cloud service providers set
ties together different components of the Snowflake Data a high standard for availability, typically bounded by a Ser-
Cloud to manage and process user requests. ECS VMs are vice Level Agreement. To reduce the blast radius of failures
organized in clusters, identified by the code version the VMs to availability, cloud service providers use Availability Zones
run on, the type of service they provide and the customer (AZs) which are logical data centers in each cloud region
accounts that the clusters serve. The ECS layer also supports [1]. For SaaS platforms, the use of multiple cloud service
multi-tenancy, the ability of an ECS cluster to provide a given providers in conjunction with multiple AZs across multiple
control plane service to multiple customer accounts. cloud regions can be used to increase service availability, as
Leveraging cloud service providers to provide Software-as- AZs are in theory isolated and fail independently of each
a-Service at scale has many benefits but also poses a number other [14]. Balancing VMs across availability zones is in-
of interesting challenges [19]: tended to limit customer impact in the presence of any zonal
Control plane that runs on multiple cloud providers: outage, as requests can be transparently redirected to a VM
Snowflake operates in multiple regions of cloud service pro- in another zone. All Snowflake production deployments exist
viders such as Amazon Web Services, Microsoft Azure, and in a cloud region with multiple availability zones and the
Google Cloud Platform. By providing a consistent customer control plane is able to maintain zone balance at both the
experience irrespective of region and underlying cloud provi- cluster level and the deployment level.
der, Snowflake enables customers to avoid the risks associ- Enabling dynamic autoscaling and throttling: The re-
ated with using a single cloud service provider. To achieve source demands of workloads can often fluctuate. To account
this, behind the scenes Snowflake deploys ECS on Virtual for workloads with spiky or unpredictable behavior while
Machines (VMs) as the atomic unit of compute since they maintaining responsiveness, it is desirable for a cloud ser-
provide the lowest common denominator of functionality vice to have elasticity and automatically scale by adjusting
across cloud providers. resources allocated to each application based on its needs
Automatic and safe software management: As soft- [11, 22]. Dynamic autoscaling automatically sizes clusters
ware complexity and size rise, it is becoming increasingly factoring several system properties such as CPU load, net-
challenging to ensure that a specific software release does work throughput, request rejection rate or memory usage.
not have unforeseen effects in production environments. At To ensure smooth operation of VMs within pre-defined re-
the same time, the system needs to ensure that software re- source limits, throttling of resources (such as CPU, memory
leases can happen online continuously without downtime, or network) can also be enforced. Dynamic autoscaling and
while also ensuring that the software release is transparent to throttling operate in tandem in the presence of traffic. If a
Snowflake customers. While we apply a wide range of testing cluster is overloaded with traffic, throttling is temporarily ac-
[4], continuous integration and delivery methodologies [18] tivated and more VMs are subsequently activated. Likewise,
to minimize erroneous behaviors, it is impossible to test all if a cluster is underutilized, it is over-provisioned and the con-
permutations of possibilities at scale. It is therefore critical to trol plane will reduce the number of VMs in the cluster. At
manage code deployments automatically by rolling out and Snowflake scale, providing customers with capacity instantly

143
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

external object storage. All ECS metadata (e.g. Job metadata)


are persisted to FoundationDB. ECS VMs do not persist or
maintain customer data during their execution lifetime. All
data processing happens on VMs in the data plane, which
are customer specific and no data is shared across control
plane VMs. Snowflake queries only run on one data plane
VM. While virtual warehouses in the data plane are sized
and scaled by the customer based on their needs, ECS is
scaled and managed by Snowflake in a manner that is not
exposed to the customer. Customers can choose whether
their accounts are routed to a multi-tenant or a dedicated
ECS cluster of VMs in the ECS layer, and there will often be
Figure 1: Presentation of the Snowflake architecture. great disparities in incoming query volume throughout the
The Elastic Cloud Services (ECS) layer is a collection day. In multi-tenant clusters, ECS maintains good customer
of services that coordinate activities across Snowflake. experience by leveraging ECS’s autoscaling and throttling
ECS is responsible for cluster management, Availabil- mechanisms. In addition, as Snowflake deployments scale,
ity Zone (AZ) balancing, autoscaling and throttling. the number of active ECS VMs has increased: there are more
The Query Processing layer represents Snowflake’s ex- customer queries to run, more files to be cleaned up, more
ecution engine and the Database Storage layer orches- metadata to be purged, etc. A Snowflake deployment is a
trates data management. distinct Virtual Private Cloud (VPC) that contains the major
poses a challenge, as cluster load can have fluctuations of parts of Snowflake software, including a scalable ECS tier.
many orders of magnitude.
Section 2 presents the mechanisms for managing auto-
matic and safe code rollout, rollback, and targeted code man- 2.1 ECS cluster management
agement that are used to release code to production weekly. The majority of ECS administration activities are performed
Section 3 demonstrates how ECS manages availability across by a service called ECS Cluster Manager. The ECS Clus-
different cloud service providers while limiting skew be- ter Manager is responsible for code upgrade/downgrade of
tween Availability Zones with global zone balancing and Snowflake deployments; Creation, update, and cleanup of
Section 4 showcases how ECS makes Snowflake’s Data Cloud ECS clusters; Enforcement of mapping manifests and con-
elastic with autoscaling and throttling. Section 5 discusses straints of accounts to clusters and accounts to version; In-
related work and Section 6 concludes the paper. teraction with our Cloud Provisioning Service to generate
and terminate cloud VMs across all Cloud providers, manage
every VM in their lifecycle, and perform corrective actions
2 ELASTIC CLOUD SERVICES in case of unhealthy VMs. Figure 2 shows how we create an
ECS ensures Snowflake is resilient to failure and to service ECS-managed cluster. Arrows indicate assignment of cus-
customers seamlessly without interruptions in service. ECS tomer accounts and code versions to a cluster. The arrow of
is responsible for VM and cluster lifecycle, health manage- the reverse proxy (Nginx) indicates how queries are routed
ment and self-healing automation, code management, query to VMs. The first step is to create an ECS cluster and reg-
planning, account/service placement and topology, traffic ister a package with the latest Snowflake release. Then we
control, and resource management, including autoscaling, declaratively allocate a customer to the cluster and allow
throttling. It performs functions such as: deciding where the Cluster Manager to converge the actual system to the
each customer job runs, keeping running VMs up to date declared topology. Once the customer is ready to execute
with applicable configurations, receiving metering data, logs, a query workload, we scale out the number of VMs in the
and metrics emitted by the data plane, deploying new soft- cluster to accommodate the needs for the corresponding
ware to the data plane, scaling the data plane, as well as the workload and update the topology on our reverse proxy (cur-
creation and management of the data plane. rently we leverage Nginx). The reverse proxy is responsible
The VMs used in ECS can be run on cloud service providers for distributing load to ECS’s healthy VMs. ECS continuously
such as Amazon Web Services, Microsoft Azure, or Google updates the control plane’s topology so that new work is not
Cloud. Hence, ECS does not depend on specific cloud provider routed via the reverse proxy to unhealthy VMs.
interfaces but is portable across the underlying virtualized Most of these administrative actions happen through priv-
Cloud infrastructure. ECS is stateless, ECS VMs will only ileged SQL queries. Since the ECS Cluster Manager is declar-
manage query planning and will only maintain pointers to ative, human operators only need to declare the correct state

144
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

Figure 4: Cloud Services Cluster Rollback.


process to ensure that there is no impact on customers’ ser-
vice. ECS’s role in the release process is to ensure that code
Figure 2: Creation and Registration of an ECS Cluster. rollout and rollback is fast and reliable to avoid impacting
customers.

2.2.1 Automatic and safe code rollout. We roll out a new


code version at Snowflake on a weekly cadence. We have
fully automated the process to minimize human error. ECS’s
Java code upgrades are orchestrated safely, e.g., if a customer
is running a query on an ECS VM, that VM cannot be shut
down until all its queries finish execution, as the customer
would otherwise see a query interruption. We manage online
upgrades with a rollout process. First, new ECS VMs are
provisioned with the new software version installed, and
some initialization work such as cache warming is done.
We then update the Nginx topology to route new queries
to the new VMs. Since customer queries can run for hours,
we keep the older VMs in the cluster running the previous
version until the workload terminates. Once older VMs in
the cluster have completed their work, they are removed,
and the upgrade is complete. Figure 3 presents the cluster
rollout process.
Figure 3: Cloud Services Cluster rollout. A separate rollout process takes place for the data plane
VMs. Once a binary containing a new code version is reg-
of the world without understanding the system’s internals to
istered by the ECS Cluster Manager, the control plane will
prescribe iterative steps to accomplish the goal. For example,
enqueue binary download jobs onto each data plane VM in
operators do not have to know how to obtain more VMs, do
batches. A query can be executed on the new code after the
not have to clean up excess VMs, and do not have to track
new version is downloaded.
VM health, etc. The ECS Cluster Manager fully automates
these responsibilities. 2.2.2 Automatic and safe code rollback. Rolling out a
new code version is inherently risky. While we hold a high
2.2 Automatic code management bar for testing, there are cases where bugs are only seen in
Snowflake code rollout ensures fast delivery of new features production. When this occurs, we have two ways to rollback:
and improved Snowflake service to customers. There is a (a) The fast rollback is in case of major issues related to the
single rollout for all Snowflake services. To achieve this, release. We have a grace period during releases in which
Snowflake’s release process includes multiple layers of test- both the old and new software versions run, with the old
ing, including unit testing, regression testing, integration set kept idle and out of topology. This allows us to perform
testing, performance and load testing on pre-production and instantaneous rollback by updating the Nginx topology to
production-like environments and includes a gradual release map requests to the old, stable software version. Figure 4

145
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

showcases the fast rollback path. (b) We can also perform


a targeted rollback. Not all bugs hit customers equally, as
many customers have unique workloads. For example, a bug
may be released that affects only a handful of customers
due to their particular workload, but the other thousands
of accounts may be able to go forward with the new code
release. It is unnecessary to do full rollbacks for all customers
because of a single customer-facing issue, but we also do
not want to provide any customer with a subpar experience.
Thus, we support rollback on a per-account basis by explicitly
mapping only affected accounts to the old software version.

2.2.3 ECS VM Lifecycle and Cluster Pools. At scale, sev-


eral VMs can become unhealthy and need to be replaced to
maintain high service availability. Snowflake ECS incorpo-
rates mechanisms to monitor VM health on a per-cluster Figure 5: ECS Cluster State Machine.
level, with VMs self reporting indicators of their health.
Healthy VMs are part of the Active Pool, but can also belong
These self reporting indicators enable ECS to determine if
to the Free Pool so that they can be quickly swapped into
individual control plane VMs exceed load related metrics
clusters if any active VMs enter quarantine. The last state of
and consequently need to be lifecycled. ECS also has mech-
our state machine is the Graveyard, which includes VMs that
anisms to isolate VMs that showcase abnormal behavior,
have lived through the VM lifecycle and are released back
e.g. memory management, CPU, concurrency characteristics,
to the Cloud provider. We allow any non-Graveyard VM to
JVM failures, hardware failures and others, to enable fur-
enter the Holding Pool to provide flexibility to the operator
ther research and root cause analysis. Unhealthy VMs allow
when needing to debug something that was not in a working
queries to run to completion and then restart. More broadly,
cluster. VMs move from Free Pool to Quarantine when they
each VM has a lifecycle beginning with its provisioning and
have been marked terminal. This usually happens when they
ending with its release back to the cloud provider. ECS auto-
fail to provision or start within a parameterized time limit
matically quiesces, quarantines, and replaces unhealthy VMs
or when ECS decides that the free pool is over-provisioned.
to maintain a high level of availability. However, it is difficult
Figure 6 presents the process of ECS VM recycling. In the
to know exactly what caused a VM to enter a bad health
case of a software upgrade, ECS routes customer queries to
status and often requires an analysis of logs to diagnose.
VMs with the newer code version and moves the older ver-
Therefore, we have a holding mechanism to retain unhealthy
sion VMs to the Quarantine pool until the VMs complete any
VMs for diagnostics purposes. To recycle VMs after software
pending tasks. Old ECS VMs are then moved to the Grave-
upgrades, replace unhealthy VMs with healthy ones, and
yard state to be released to the cloud provider. To account
retain VMs encountering unusual health issues, we main-
for updates in VM health and transitions in cluster state, the
tain a few "VM pools". The movement between the pools is
ECS logic runs in set intervals.
automated based on a state machine presented in Figure 5.
We have observed that not all VMs will be healthy and
All VMs are provisioned by an abstraction layer across all
functional at scale. Some VMs can enter a bad state (JVM
supported cloud service providers (AWS, GCP, Azure, etc.)
Garbage Collection death spiral, full disk, broken disk, cor-
in our state machine. The provisioning layer abstracts away
rupt file system, too many file descriptors, etc.) and at which
specific CSP functionality, allowing the control plane to sub-
point they become dysfunctional and need to be rotated out.
mit requests to expand the VM fleet or decommission excess
These unhealthy VMs may still be running customer queries,
VMs. VMs are initially provisioned into a resource pool that
some of which may take hours to complete. Unhealthy VMs
retains VMs started, initialized and ready to be used, collec-
appear sporadically as there is an ever-expanding number
tively referred to as the Free Pool. The Free Pool capacity is
of Snowflake deployments, some of which are capable of
dynamic and depends on the rate with which VMs become
running thousands of ECS VMs. Furthermore, since not all
unhealthy and on cluster defined policy. Since the fulfillment
unhealthy states indicate VM issues, it is easier to restart the
of VMs directly to working clusters to satisfy the required
ECS server process than to procure a new VM from a cloud
capacity is challenging, the Free Pool allows clusters to uti-
provider. Figure 7 presents the ECS’s process of isolating
lize available VMs instantly. The Quarantine Pool represents
an unhealthy VM. When a VM is identified by ECS or self
all ECS VMs that need to be removed from their clusters to
reports itself as unhealthy, it begins quiescing and a Free
self-resolve any pending tasks assigned to them and restart.

146
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

Figure 6: ECS Cluster VM recycling. Figure 8: Availability zone outage and failover. Zone B
has an outage. Requests get transparently redirected
to zone C.
of balancing of ECS VMs across AZs there is no need for
data rebalancing as all the data is located and is accessible in
object storage.
Figure 9 presents scenarios of regional (clusters) and global
(deployments) load distribution across different AZs. On the
left the first scenario presents clusters in Zones A and B that
are balanced. However, globally across the deployment there
is imbalance, as Zone C has no VMs. The middle scenario
illustrates a balanced deployment where individual clusters
are not balanced (i.e., the green cluster only has VMs in Zone
A). The scenario on the right strikes a balance between global
and cluster balancing by calculating the difference between
the number of VMs in the most loaded zone and the least
loaded zone, which we will call AZ skew.
Minimizing AZ skew is more than striping VMs across
Figure 7: ECS Cluster Instance Isolation.
availability zones during provisioning. Within a single re-
Pool instance will move into the active cluster. ECS will then gional deployment, ECS implements a multi-cluster architec-
move the VM into Quarantine. ture where each cluster serves different groups of customers.
Each cluster can scale independently to respond to the cur-
3 BALANCING ACROSS AVAILABILITY rent load. To minimize the impact of an AZ outage on each
ZONES cluster and the deployment, we must zone balance at both
the cluster level and the deployment (global) level. A naive so-
Customers expect Snowflake to be always available. This
lution entails assigning 𝑛 VMs from each AZ to each cluster.
means designing ECS, which coordinates services and sched-
However, this falls apart as the number of VMs in a cluster is
ules warehouses to run queries, to be resilient to failures.
not always divisible by the number of AZs, and the number
One rare case is when a cloud service provider’s datacenter
of VMs in a cluster is small (typically less than eight) and
suffers an unexpected outage.
the number of clusters in a deployment is on the order of
All of the cloud service providers that Snowflake runs
hundreds.
on provide the notion of Availability Zones (AZs), which
Not only are these competing goals at times, but the free
are isolated datacenters in a single region where we can
pool used to draw VMs to scale clusters, may not have VMs
provision resources. By keeping ECS VMs balanced across
of that type in the zone we want. Since cloud provisioning
these AZs, we ensure minimal customer impact in the event
and preparing a new VM can take the order of minutes, ECS
of zonal failures, as requests are transparently redirected to
maintains a free pool to trade VMs to and from clusters,
a VM in another zone, as presented in Figure 8. In the case

147
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

allowing us to scale each cluster up or down within seconds.


An active VM may fail, and we may not have a free VM in
that zone to replace it, or suddenly we may need to scale up
a cluster to handle the increased load, and we only have free
VMs in zones that are heavily loaded by that cluster. Even
if we start well-balanced, the cluster and deployment can
become more imbalanced over time.
Furthermore, while we can rebalance a cluster by adding
a VM in one zone and removing it from another, this incurs
a cost overhead since the VM being removed needs to finish
executing queries. Therefore, we would also like to keep the (a) Zone C is the least loaded (b) Zone B is the most loaded
number of rebalancing actions to an acceptable level. At this for the green cluster and glob- for the yellow cluster and
ally. globally.
stage, our objective is to minimize AZ skew for each cluster
and the entire deployment, constrained by an acceptable Figure 10: Autoscaling in a zone-balanced manner.
number of rebalancing changes. We also want to prioritize
minimizing cluster skew over minimizing global skew to
avoid outage for any single cluster.

Figure 9: The scenario on the left shows balanced clus-


ters, but we have a global imbalance since we have Figure 11: If the free pool is depleted, we cannot main-
no VMs in zone C. The middle scenario illustrates a tain zone balancing (left). There are also cases where
balanced deployment, but individual clusters are not cluster level zone balancing can worsen global zone
balanced (i.e., the green cluster only has VMs in Zone balance (right).
A). The scenario on the right strikes a balance between
global and cluster balancing.

As shown in Figure 10, we modified our cluster manager


to scale clusters in a zone-balanced manner. When we scale a
cluster out, we pick the least loaded zone globally out of the
set of least loaded zones for that cluster. Likewise, when we
scale a cluster in, we pick the most loaded zone globally out of
the set of most loaded zones for that cluster. However, there
may be situations in which we are unable to maintain either
global zone balancing or even cluster-level zone balancing,
as demonstrated in Figure 11. Thus, an additional active zone
rebalancing task runs in the background, which examines Figure 12: Moving a VM from overloaded zone A to
the current state of the deployment and executes a series underloaded zone C.
of moves to balance it. It prioritizes moves in the following
divided by the number of available zones. Then, any move
order:
from a zone with more VMs than that level to a zone with
(1) Moves that improve both cluster level and global zone fewer VMs than that level cannot worsen cluster level balanc-
balancing ing as is shown in Figure 12. After that, we evaluate which
(2) Moves that improve cluster-level balancing criteria each move improves our deployment and select the
(3) Moves that improve global zone balancing best one. Moves are executed with minimal customer impact,
To generate the moves, we compute the balanced thresh- as the old VM is allowed to finish currently running jobs
old for a cluster, which is the number of VMs in that cluster while the new one accepts incoming queries. Additionally,

148
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

the cluster manager knows the number of VMs needed to


reach a target skew and provisions free VMs of the correct
category and zone.
After enabling this change on some small deployments,
we noticed that it was difficult to maintain full global zone
balance (skew ≤ 1) without making too many moves within
a certain period. For example, a cluster could scale up into a
specific zone to maintain cluster level zone balancing, but if
that zone was already heavily loaded globally, we might need
to rebalance another cluster to maintain global zone balanc-
ing. In addition, as mentioned above, churning through too
many VMs incurs overhead for us as VMs need to finish exe-
cuting running jobs. As a result, we added a global AZ skew
leniency threshold below which the zone balancer will only
consider cluster-level rebalancing moves. This parameter
allows us to trade skew leniency for VM churn rate. Figure 14: Enabling global zone balancing on a large
Our expectation for cluster zone balancing was that in Snowflake deployment; the global AZ skew decreases
steady-state, the AZ skew should be at most 1. If AZ skew is from over 45 down to less than 5.
greater than or equal to 2, one can move a VM from the most
loaded zone to the least loaded zone and decrease AZ skew.
Figure 13 illustrates that in a large Snowflake deployment,
most clusters stay balanced with an AZ skew of 0 or 1, and the
few that become unbalanced are rebalanced within the hour.
Figure 14 illustrates the trend when global zone balancing is
enabled on a highly AZ skewed deployment. The AZ skew
began above 45, but with the global zone balancing logic, we
see the AZ skew shrink rapidly down and remain at a value
less than 5.
Figure 15: ECS enables cross-cloud and cross-region
replication by setting up accounts and replicating pri-
mary databases to that account from Region A to Re-
gion B. Upon detection of an outage in Region A, recov-
ery is achieved via failover to the secondary databases
in Region B.
Figure 15 presents the mechanisms for enabling Snowflake’s
cross-cloud and cross-region replication. Data is replicated
Figure 13: AZ skew across a large Snowflake deploy- via egress networking across object storage across cloud
ment with many clusters remains less than or equal to providers (or regions of the same cloud provider) and kept in
one. When clusters end up with higher AZ skew, they sync on an ongoing basis. To enable cross cloud replication,
are quickly rebalanced. every deployment participating in the replication mesh has
a set of messaging queues, each of which maps to one repli-
cation peer deployment. New data from one deployment is
transmitted to the deployment’s replication peers via each
3.1 Cross-cloud and cross-region one of the deployment’s dedicated queues. The replication
replication peers then process the new data delivered by the deployment.
Snowflake also enables the replication of databases across The replication mesh is using specific CSP abstractions to
different regions of the same cloud provider as well as across achieve data replication for each cloud provider. For exam-
different cloud providers (AWS, Google Cloud Platform, or ple, in AWS deployments the replication mesh operates over
Microsoft Azure). When a Snowflake account in one region Replication S3 buckets distributed across Snowflake accounts.
is unavailable, another can be promoted to be primary to When a cloud region is unavailable, queries are routed to
continue business operations. This can happen across cloud a new warehouse that is spun up at one of the replicated
providers or regions. regions (on any cloud provider).

149
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

4 THROTTLING & AUTOSCALING workloads that would be throttled by static limits could be
The traffic routed to each cluster varies widely by time of composed of easy-to-compute requests that underutilize the
day, sometimes in unpredictable ways without clear patterns. host’s hardware.
Autoscaling and throttling are enabled by default for all cus- When a VM accepts more work than it can handle, CPU
tomers and concern Snowflake’s control plane. Providing a and memory utilization become dangerously high and unde-
consistent service experience and ensuring platform avail- sirable knock-on effects begin to occur. For example, CPU
ability necessitates an elastic service layer. To create elas- thrashing can limit request processing or cause commit fail-
ticity, we implement Dynamic Throttling and Autoscaling ures for operational metadata that needs to be persisted. Ad-
based on the following tenets: ditionally, memory pressure can slowdown or halt program
Responsiveness: Queries should be processed smoothly execution.
and immediately upon receipt. The period that queries block Our solution is to implement resource-aware throttling,
before beginning or run slowly due to a shortage of resources which we call Dynamic Throttling. As opposed to setting
should be nearly non-existent in the aggregate. Snowflake’s static concurrency limits in our system, we dynamically cal-
availability SLAs depend on the ability of our cloud services culate concurrency limits based on current host-level re-
layer to to add resources to clusters quickly enough to keep source utilization every 30 seconds. When CPU load becomes
the retry-on-throttle duration short for rejected requests. high in a VM (as measured by the Linux /proc/loadavg), the
This gives the appearance of instant capacity to customers. host-local concurrency limits are immediately lowered and
Cost-efficiency: Deliver a reliable and performant service adjusted until the CPU load returns to and remains at an
using as few physical resources (VMs) as possible. acceptable level. For example, if the load on a machine is
Cluster volatility: We aim to minimize the frequency read at 2.0 and we have 200 queries, we adjust our future
of configuration changes to clusters (oscillation). There is incoming concurrency to be 100 queries which will in the
a system-operation burden to provisioning/de-provisioning future get us closer to our desired 1.0 load.
compute resources in a cloud environment due to unpro- Any further incoming requests will be rejected and re-
ductive setup and teardown time, warming cold caches and tried on other VMs with free resources. This is transparent
losing cached data, etc. For example, an algorithm that does to customers and adds minimal latency, protecting our sys-
a scale-in (reduce VM count) and then immediately performs tems while preserving business continuity. If this VM-level
a scale-out is highly undesirable - provisioning the new VM throttling results in rejections, a signal is transmitted to
for the scale-out, updating routing tables, warming caches, the autoscaler and triggers a cluster scale-out. On the other
and other tasks tend to bear a higher system burden than is hand, if VM load is low, we recognize that we can take more
saved in terms of cost by removing a VM for only a short work on and increase limits to improve our cost-efficiency.
duration To ensure fairness, concurrency limits are computed at the
Throughput: The total number of successful queries that account- and user-level within each VM to avoid single users
client(s) can submit to the cluster. The system should scale or accounts in multitenant clusters from ever saturating con-
to meet demand. currency limits and temporarily denying service from other
Latency: The average duration for a query to complete. customers.
This should be as low as possible. Given a static cluster size, To complement this VM-level throttling, we run a central-
there is an inverse relationship between latency and through- ized autoscaler. When the aggregate resource load is high
put: generally, one query can finish quickly, but when sub- across the active VMs in a cluster or if a quorum of the clus-
mission throughput is sufficiently high requests will start ter’s VMs is rejecting work, we will increase the number
piling up. Beyond that point query latency increases due to of VMs in the cluster. Likewise, we reduce the cluster size
queueing. A good autoscaling algorithm should keep latency when no rejections are occurring and the cluster load is low
low even as the request volume increases. to reduce costs.

4.1 Dynamic Throttling 4.2 Autoscaling


A naive approach to throttling defines static limits on the The motivation for autoscaling from an availability and elas-
number of concurrent requests at various levels, such as ticity perspective have already been described in previous
per-host, per-account, and per-user to avoid overwhelming sections. Now we will examine the details of how Snowflake
individual host VMs. This approach is unreliable: not all implements cluster-level autoscaling.
requests have similar resourcing demands; some workloads The primary job of autoscaling is to accurately determine
with a low query cardinality can run within static limits the number of VMs that should comprise a cluster to serve its
but still cause load issues, while some high concurrency workload efficiently. With the current autoscaling approach,

150
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

Figure 16: As current CPU load average oscillates, ECS makes decisions to scale in or out based on current load. The
red line represents "cluster size stability". As the load stays within that window , we will not change the number
of VMs in the cluster. As the number of current ECS VMs increases, the cluster size stability decreases. ECS has
mechanisms to avoid cluster size oscillations.

Snowflake’s ECS considers signals around memory pressure and 1.0), we use the desired VM count to compute a new
and CPU load to make decisions. ECS’s autoscaler provides size for the cluster. Finally, to scale out we only look at the
a clear delineation of what constitutes unhealthy behavior 1-minute load average and check if it is high. To scale in,
and its policies give ECS flexibility in comparison to the we look at the load average using different time windows
state of the art to integrate with mechanisms for VM health (1-minute, 5-minute, 15-minute), all of which have to be low
and independently evolve each of the autoscaler and the in order to avoid oscillation in the cluster.
isolation/health management.
ECS leverages rejecting gateways that will return a 503
4.3 Throttling & Autoscaling Evaluation
HTTP status code to the caller with an expectation to retry,
and the rejection will be ingested into the autoscaler as an Our success evaluation criteria used for throttling was to
indicator to scale up. All customer queries and work must maintain responsiveness while ensuring the stability of our
go through one of these gateways to provide the autoscaler VMs. We used a combination of synthetic testing and analysis
with accurate information on how to scale the cluster. of results on real production servers to test this. Figure 17
The number of desired VMs is calculated by the number illustrates a synthetic workload running a representative
of current VMs multiplied by the ratio of the current load benchmark. In this test environment, we disabled autoscaling.
over the desired load. As the cluster size increases, we are Only Dynamic Throttling was active, and we configured the
more likely to scale on minor deviations in the normalized cluster to have two VMs. The generated workload exceeded
CPU load average (load per core), Figure 16 shows a graph of the CPU capacity of two VMs. Dynamic throttling reacted to
the scaling decisions that our autoscaling mechanism would the excessive load and reduced the gateway limits to maintain
make based on CPU load average. The red section is the the healthy state of the available VMs. In Figure 17a, we
"cluster size stability" section. As the load stays within that report the CPU load of the two VMs. We observe that the
window, we will not change the number of VMs in the cluster, load exceeded the available CPU capacity. This is correlated
a mechanism that introduces stability. As we add more VMs, with the query count in Figure 17c. Figure 17b shows the
the window becomes tighter. For example, if the current load throttle coefficient of the throttled VM being lowered to
does not fall between some set acceptable window (e.g. 0.9 enforce a new gateway size, effectively applying a small
multiplier to our gateways. Finally, fewer queries are let

151
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

through after the throttler determines the new gateway limit. removed manual work to scale VMs or limits. In addition,
In Figure 17c, the query throughput drops to a sustainable it unlocked higher throughput on each VM. As a result, the
amount, which also causes a drop in the associated load system recognizes when clusters are overprovisioned and
the VMs face. Furthermore, after the initial estimation, the can reduce the number of VMs to optimize resource usage.
coefficient in Figure 17b adjusts incrementally. It initially
4.3.2 Example 2: Customer Query Concurrency increased.
drops lower to temper the workload further and helps keep
CPU load at approximately 1.0, our target. Figure 17d displays Figure 20 shows a customer’s workload, the top chart shows
our coefficient used for the account and user-specific limits. our account coefficient, which in this case will be greater
To maintain fairness, we apply the limit evenly, causing all than 1 when we expand the respective customer’s gateways.
accounts to reduce evenly, and therefore the account with the After the rollout around 3 pm, the rejection rate in the bottom
chart was reduced and concurrent requests increased which
highest job count will be the first to be limited. The limitation
can be seen in the center chart.
of dynamic throttling is transient, and upon rejecting these
queries, we signal to the autoscaling framework to add VMs 4.3.3 Example 3: Noisy Neighbor. We use autoscaling and
to overload clusters. Our intent is to temper transient load dynamic throttling to work around "noisy neighbor" issues.
spikes and prevent them from causing downstream issues in Figure 21 is an example of a cluster we scaled up dramati-
our system until we can accommodate their workload. cally to handle the increased workload. Prior to the noisy
Now that we have the base algorithm, we enter the opti- neighbor, our traffic was low enough for the autoscaling
mization stage. There are a number of parameters we can system to decide to reduce VM counts to two or three. The
tweak to attempt to optimize for the features we desire. These successful queries per cluster showcase the effective through-
features that we seek effectively form the pillars of our throt- put after the autoscaling system increased the number of
tling technique. The three pillars of throttling are: VMs. Effectively, we were able to immediately quadruple the
• Lower the load as soon as possible, but not overshoot; throughput of the cluster.
• Minimize oscillation;
4.3.4 Example 4: Deployment load. Dynamic Throttling also
• Reduce throttle as soon as possible.
reduced cases of high CPU load that VMs encounter. As
We reject queries as cheaply as possible and retry once more shown in Figure 22 after the release of Dynamic Throttling,
VMs are available. We are effectively deferring load to a later the 5-minute CPU load average charts lowered noticeably.
point in time by which the autoscaling will have increased In particular almost no VM had a 5-minute CPU load higher
our computing capacity. Because we throttle to keep our than our target threshold of 1.
CPU load at a healthy level, the throttler must surface met-
rics about its rejection behavior as a signal to the autoscaler — 4.4 Throttling and Autoscaling for Memory
the CPU load signal ideally will not kick in if we throttle effec-
Due to the heterogeneous nature of workloads, it is possible
tively. These rejections are persisted in FoundationDB [51],
for operations such as complex query compilations to require
our operational metadata database, and aggregated globally
significantly more memory than CPU - the worst of these
by our Autoscaling framework to make decisions. Once ca-
extremes would be a single query requiring more memory
pacity has been added, we incrementally revert the throttling
than the VM even has. In this case, simply adding more
coefficients to bring clusters back to a steady-state where
VMs to the cluster where none of the new VMs have more
they are neither throttling nor overloaded, satisfying our
physical memory will not accommodate the workload. The
three throttling design pillars.
most intuitive solution is to move the workload to a new set
4.3.1 Example 1 — Snowflake internal analysis cluster: We of VMs, each with larger physical memory. This motivated a
scaled in the cluster when enabling the features on our in- concept of "vertical scaling", in which clusters may migrate
ternal data analysis account. Figure 18 shows the cluster VM to new VM types with different levels of hardware resources
count in the bottom chart. We reduced the cluster to two to best accommodate a workload. This is different from the
VMs as the scaling framework recognizes we did not require aforementioned model of adding or removing VMs of the
the compute resources of all four VMs. On Figure 19, two of same type within a cluster, which we call "horizontal scaling".
the lines, each representing a VM, drop to serving 0 queries Although it is highly unlikely that a single query uses the
per second as they are removed from topology. The queries entire available memory of a VM, there are benefits to mov-
per second and concurrent requests of the remaining two ing a workload to a larger VM. One of the largest advantages
VMs increase as we now have to maintain the same overall is the reduced cache redundancy within a cluster of VMs.
cluster throughout with two VMs. The throttling coefficients Metadata caches consume a significant portion of the avail-
are expanded automatically with dynamic throttling, provid- able memory, and having a large number of VMs with small
ing additional space for queries we can safely handle. This amounts of memory results in each VM having a relatively

152
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

(a) CPU load for two VMs over time. (b) VM throttle coefficient over time for two VMs.

(c) Cluster throughput in queries/sec over time for ECS cluster. (d) Account and user throttle coefficient over time for two VMs.
Figure 17: As CPU load increases we deflect future load by dynamically throttling and reducing future concurrent
requests to an amount estimated to keep load safe.

Figure 18: Autoscaling reduces VM count therefore driving down unnecessary over-provisioning across Snowflake.

smaller cache and more likely to duplicate cache entries that isolation events (VM replaced due to OOM, full garbage col-
already exist on other VMs when compared to a cluster with lection, etc.) by about 90%. Figure 23 illustrates that in the
a smaller number of high memory VMs. The autoscaling sys- first sample of clusters to enable vertical autoscaling begin-
tem implements memory-based vertical scaling by reading ning the week of November 8, 2021, we have moved from a
JVM-level heap memory and throttling VMs that are close highly variable rate of 15-30 isolations per day with spikes
to exhaustion. This acts as a signal to quiesce the old VMs of as many as 100 to a much more consistent rate of 0-2
and migrate the cluster to a new set of larger VMs. This mi- isolations per day.
gration has sharply reduced occurrences of memory-related

153
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

the storage architecture, data caching, query scheduling and


multitenancy [44].
Cloud Analytics Systems. Several systems have been de-
signed to offer analytics solutions, including cloud databases
like Amazon Redshift [17], Aurora [38], Athena [37], Google
BigQuery [36] (based on the architecture of Dremel [25]),
Vertica [43], Microsoft Azure Synapse Analytics [2] and the
Relational Cloud [9].
Other works present cluster management systems [20, 24,
Figure 19: Dynamic throttling allows underutilized 32] that can also operate across clouds [5, 6, 12, 48]. Man-
VMs to increase the amount of work they can take aging the deployment of these systems safely at scale also
on. Two VMs bear the entire workload without excess presents challenges [21, 49]. While these papers present the
rejections. system architecture and analytics related capabilities of these
systems, this is the only work, to the best of our knowledge,
that presents a distributed control-plane operating on multi-
ple cloud service providers. This paper focuses on the topics
of automatic code and resource management, in particular
autoscaling, and throttling.
Cloud service availability. Improving availability for
cloud services at global scale poses many challenges [26].
Related work has used modeling of availability guarantees
to investigate the efficacy of different techniques [42]. To
achieve high availability other works leveraged techniques
like load balancing using a combination of multiple avail-
Figure 20: Customer concurrency increases after en- ability zones and multiple cloud providers to achieve high
abling dynamic throttling. availability [27, 50]. Our work presents AZ skew data across
clusters of a Snowflake deployment and highlights the im-
pact of rolling out global load balancing in production.
Cloud resource management and autoscaling. Re-
source management techniques have been very prevalent
in academic literature over the past decades. With the pro-
liferation of the cloud, resources like CPU [46, 47] and net-
work bandwidth [15] can now be tightly coupled in resource
pools, and scale the workload independently to match de-
mand [11, 28, 29, 35]. Unlike existing systems like Kubernetes
that focus on containers and are limited in terms of vertical
scaling and cluster stability, ECS’ autoscaler was designed
to enable VM clusters to scale both horizontally and verti-
cally with high stability while enabling dynamic instance
replacement reactive to VM health. To better enable resource
Figure 21: Autoscaling adapting to bursts in traffic al- management, multitenant systems introduced concepts for
lowing the cluster to increase its VM size resulting in memory sharing in virtual machines including ballooning
quadrupling of its initial throughput. and idle-memory taxation [45]. Memshare [8] introduced
mechanisms for sharing cache capacity in multitenant sys-
5 RELATED WORK tems to maximize the hit rate for applications. Other efforts
This section presents related work around Snowflake and on cloud resource allocation focus on performance, cost and
focuses on the Architecture of Snowflake’s Elastic Cloud fairness optimizations [16, 23, 31, 40, 41]. Our paper adds to
Services, including aspects of automatic code management, the existing work by introducing techniques that preserve
cross-cloud operation, cloud elasticity, throttling and re- ECS elasticity for orders of magnitude higher load and by
source management. Prior work discusses topics around providing empirical results from enabling autoscaling tech-
Snowflake SQL and architecture [10] and recent changes niques in production.
in the design and implementation of Snowflake, including

154
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

Figure 22: After Dynamic Throttling was rolled out, CPU overloading stops. The figure presents VM throttling
coefficients (top) and CPU load averages (bottom) per VM. Coefficient values less than 1.0 on the logarithmic
y-axis indicate hosts with reduced concurrency limits, and values greater than 1.0 indicate hosts with expanded
concurrency limits.

to achieve a good trade-off between resource usage and per-


formance [30] and Raghavan et al. discusses design and im-
plementation considerations of distributed rate limiters [34].

6 CONCLUSION
We present the design and architecture of Snowflake’s con-
trol plane. Elastic Cloud Services manages the Snowflake
fleet at scale and is responsible for VM and cluster lifecy-
cle, health management, self-healing automation, topology,
account service placement, traffic control, cross-cloud and
cross-region replication, and resource management. ECS’s
goal is to support a stable Snowflake service, resilient to
failures that transparently manage dynamic resource utiliza-
tion in a cost-efficient manner. We showcase how we have
Figure 23: Memory based throttling and vertical scaling been able to support automatic code management through
drastically reduces memory isolations. safe rollout/rollback, VM lifecycle and cluster pool manage-
ment as well as balance Snowflake’s deployments equally
Resource throttling. Architectures that share resources across availability zones. We optimize resource management
across multiple cores in CPUs and multiple tenants in a vir- through horizontal and vertical autoscaling. We have begun
tual machine exacerbate resource management problems, initiatives to predictively scale clusters prior to hitting any
since resources like CPU, memory, or network utilization are resource limits. After enabling throttling and autoscaling, we
shared and lead to resource contention. Works like Cheng observed many clusters that have cyclical workloads with
et al. [7] restrict the number of concurrent memory tasks intervals ranging from minutes to weeks. We are currently
to avoid interference among memory requests and throttle exploring applying different statistical methods on the tem-
down cores after estimating unfair resource usage in the poral data collected from different clusters in order to better
memory subsystem [13]. In the context of resource manage- serve our customers needs while also minimizing costs.
ment in the cloud, cloud service providers expose APIs for We combine those capabilities with throttling such that
resource throttling [3, 33, 39]. Other works automatically customers do not get impacted by underlying infrastructure
partition and place rules at both hypervisors and switches changes. We evaluated ECS capabilities in production and
present results at the scale of Snowflake’s Data Cloud.

155
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA

REFERENCES https://doi.org/10.1145/2723372.2742795
[1] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, [18] Jenkins. Retrieved: 2022-5-31. Jenkins continuous integration and
Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, continuous delivery . https://www.jenkins.io.
Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2009. Above the Clouds: A [19] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai,
Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009- Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Menezes Car-
28. reira, Karl Krauth, Neeraja Yadwadkar, Joseph Gonzalez, Raluca Ada
[2] Microsoft Azure. Retrieved: 2022-5-31. Azure Synapse Analytics. https: Popa, Ion Stoica, and David A. Patterson. 2019. Cloud Programming
//azure.microsoft.com/en-us/services/synapse-analytics. Simplified: A Berkeley View on Serverless Computing. Technical Report
[3] Microsoft Azure. Retrieved: 2022-5-31. Throttling Resource Manager UCB/EECS-2019-3.
requests . https://docs.microsoft.com/en-us/azure/azure-resource- [20] Vangelis Koukis, Constantinos Venetsanopoulos, and Nectarios Koziris.
manager/management/request-limits-and-throttling. 2013. okeanos: Building a Cloud, Cluster by Cluster. IEEE Internet
[4] V.R. Basili and R.W. Selby. 1987. Comparing the Effectiveness of Soft- Computing 17, 3 (2013), 67–71. https://doi.org/10.1109/MIC.2013.43
ware Testing Strategies. IEEE Transactions on Software Engineering [21] Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj
(1987). https://doi.org/10.1109/TSE.1987.232881 Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and
[5] Daniel Baur and Jörg Domaschka. 2016. Experiences from Building a Murali Chintalapati. 2020. Gandalf: An Intelligent, End-To-End Ana-
Cross-Cloud Orchestration Tool. In CrossCloud ’16. https://doi.org/10. lytics Service for Safe Deployment in Large-Scale Cloud Infrastructure.
1145/2904111.2904116 In NSDI ’20. 389–402.
[6] José Carrasco, Francisco Durán, and Ernesto Pimentel. 2018. Trans- [22] Tania Lorido-Botrán, Jose Miguel-Alonso, and Jose Lozano. 2014. A
cloud: CAMP/TOSCA-based bidimensional cross-cloud. Comput. Stand. Review of Auto-scaling Techniques for Elastic Applications in Cloud
Interfaces 58 (2018), 167–179. Environments. Journal of Grid Computing 12 (2014). https://doi.org/
[7] Hsiang-Yun Cheng, Chung-Hsiang Lin, Jian Li, and Chia-Lin Yang. 10.1007/s10723-014-9314-7
2010. Memory Latency Reduction via Thread Throttling. In MICRO-43. [23] Jose Luis Lucas-Simarro, Rafael Moreno-Vozmediano, Rubén Montero,
53–64. https://doi.org/10.1109/MICRO.2010.39 and Ignacio Llorente. 2013. Scheduling strategies for optimal service
[8] Asaf Cidon, Daniel Rushton, Stephen M. Rumble, and Ryan Stutsman. deployment across multiple clouds. Future Generation Computer Sys-
2017. Memshare: a Dynamic Multi-tenant Key-value Cache. In (USENIX tems 29 (2013), 1431–1441. https://doi.org/10.1016/j.future.2012.01.007
ATC ’17). 321–334. [24] E. Michael Maximilien, Ajith Ranabahu, Roy Engehausen, and Laura C.
[9] Carlo Curino, Evan Jones, Raluca Popa, Nirmesh Malviya, Eugene Anderson. 2009. IBM altocumulus: a cross-cloud middleware and
Wu, Samuel Madden, Hari Balakrishnan, and Nickolai Zeldovich. 2011. platform. In OOPSLA ’09 Companion.
Relational Cloud: A Database-as-a-Service for the Cloud. CIDR ’11, [25] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
235–240. Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interac-
[10] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, tive Analysis of Web-Scale Datasets. Proc. VLDB Endow. 3, 1–2 (2010),
Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Mar- 330–339. https://doi.org/10.14778/1920841.1920886
tin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, [26] Jeffrey C. Mogul, Rebecca Isaacs, and Brent Welch. 2017. Thinking
Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Tri- about Availability in Large Service Infrastructures. In Proc. HotOS XVI.
antafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data [27] R. Moreno-Vozmediano, R. S. Montero, E. Huedo, and I. M. Llorente.
Warehouse. In Proceedings of the 2016 International Conference on Man- 2018. Orchestrating the Deployment of High Availability Services on
agement of Data. 215–226. https://doi.org/10.1145/2882903.2903741 Multi-Zone and Multi-Cloud Scenarios. J. Grid Comput. 16, 1 (2018),
[11] Kubernetes Documentation. Retrieved: 2022-5-31. Horizontal 39–53.
Pod Autoscaling. https://kubernetes.io/docs/tasks/run-application/ [28] Rafael Moreno-Vozmediano, Rubén S. Montero, Eduardo Huedo, and
horizontal-pod-autoscale/. Ignacio M. Llorente. 2019. Efficient Resource Provisioning for Elastic
[12] Ravi Teja Dodda, Chris Smith, and Aad van Moorsel. 2009. An Archi- Cloud Services Based on Machine Learning Techniques. J. Cloud
tecture for Cross-Cloud System Management. In IC3. Comput. 8, 1 (2019). https://doi.org/10.1186/s13677-019-0128-9
[13] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2012. [29] Rafael Moreno-Vozmediano, Ruben S. Montero, and Ignacio M.
Fairness via Source Throttling: A Configurable and High-Performance Llorente. 2009. Elastic Management of Cluster-Based Services in the
Fairness Substrate for Multicore Memory Systems. ACM Trans. Comput. Cloud. In Proceedings of the 1st Workshop on Automated Control for
Syst. 30, 2 (2012). https://doi.org/10.1145/2166879.2166881 Datacenters and Clouds (ACDC ’09). 19–24. https://doi.org/10.1145/
[14] Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, 1555271.1555277
Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. [30] Masoud Moshref, Minlan Yu, Abhishek Sharma, and Ramesh Govindan.
2010. Availability in Globally Distributed Storage Systems. In OSDI’10. 2012. vCRIB: Virtualized Rule Management in the Cloud. In HotCloud
61–74. ’12.
[15] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, [31] Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel:
Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Isolation and Sharing in Disaggregated Rack-Scale Storage. In NSDI
2016. Network Requirements for Resource Disaggregation. In Proceed- ’17. 17–33.
ings of the 12th USENIX Conference on Operating Systems Design and [32] Rene Peinl, Florian Holzschuher, and Florian Pfitzer. 2016. Docker
Implementation. 249–264. Cluster Management for the Cloud - Survey Results and Own Solution.
[16] Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Journal of Grid Computing 14 (2016). https://doi.org/10.1007/s10723-
Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair 016-9366-y
Allocation of Multiple Resource Types. In NSDI’11. 323–336. [33] Google Cloud Platform. Retrieved: 2022-5-31. Rate-limiting strategies
[17] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul and techniques. https://cloud.google.com/architecture/rate-limiting-
Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift strategies-techniques.
and the Case for Simpler Data Warehouses. In SIGMOD ’15. 1917–1923. [34] Barath Raghavan, Kashi Vishwanath, Sriram Ramabhadran, Kenneth
Yocum, and Alex C. Snoeren. 2007. Cloud Control with Distributed

156
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou

Rate Limiting. SIGCOMM Comput. Commun. Rev. 37, 4 (aug 2007), 797–809. https://doi.org/10.1145/3183713.3196938
337–348. https://doi.org/10.1145/1282427.1282419 [44] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong,
[35] Rodrigo da Rosa Righi, Vinicius Facco Rodrigues, Cristiano André Ashish Motivala, and Thierry Cruanes. 2020. Building An Elastic
da Costa, Guilherme Galante, Luis Carlos Erpen de Bona, and Tiago Query Engine on Disaggregated Storage. In NSDI ’20. 449–462.
Ferreto. 2016. AutoElastic: Automatic Resource Elasticity for High [45] Carl A. Waldspurger. 2003. Memory Resource Management in VMware
Performance Applications in the Cloud. IEEE Transactions on Cloud ESX Server. SIGOPS Oper. Syst. Rev. 36 (2003), 181–194. https://doi.
Computing 4, 1 (2016), 6–19. https://doi.org/10.1109/TCC.2015.2424876 org/10.1145/844128.844146
[36] Kazunori Sato. Retrieved: 2022-5-31. An Inside Look at Google Big- [46] Carl A. Waldspurger and William E. Weihl. 1994. Lottery Scheduling:
Query. https://cloud.google.com/files/BigQueryTechnicalWP.pdf. Flexible Proportional-Share Resource Management. In OSDI ’94.
[37] Amazon Web Services. Retrieved: 2022-5-31. Amazon Athena Server- [47] C. A. Waldspurger and E. Weihl. W. 1995. Stride Scheduling: Determin-
less Interactive Query Service. https://aws.amazon.com/athena. istic Proportional- Share Resource Management. Technical Report.
[38] Amazon Web Services. Retrieved: 2022-5-31. Amazon Aurora MySQL [48] Huaimin Wang, Peichang Shi, and Yiming Zhang. 2017. JointCloud: A
PostgreSQL Relational Database. https://aws.amazon.com/rds/aurora/. Cross-Cloud Cooperation Architecture for Integrated Internet Service
[39] Amazon Web Services. Retrieved: 2022-5-31. Throttle API requests for Customization. In ICDCS ’17. 1846–1855. https://doi.org/10.1109/
better throughput . https://docs.aws.amazon.com/apigateway/latest/ ICDCS.2017.237
developerguide/api-gateway-request-throttling.html. [49] Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, and Len Bass.
[40] David Shue, Michael J. Freedman, and Anees Shaikh. 2013. Fairness 2012. Automatic Undo for Cloud Management via AI Planning. In
and Isolation in Multi-Tenant Storage as Optimization Decomposition. HotDep ’12. https://www.usenix.org/conference/hotdep12/workshop-
SIGOPS Oper. Syst. Rev. 47, 1 (2013), 16–21. https://doi.org/10.1145/ program/presentation/Weber
2433140.2433145 [50] Xin Xie, Chentao Wu, Junqing Gu, Han Qiu, Jie Li, Minyi Guo, Xubin
[41] Ioan Stefanovici, Eno Thereska, Greg O’Shea, Bianca Schroeder, Hitesh He, Yuanyuan Dong, and Yafei Zhao. 2019. AZ-Code: An Efficient
Ballani, Thomas Karagiannis, Antony Rowstron, and Tom Talpey. 2015. Availability Zone Level Erasure Code to Provide High Fault Tolerance
Software-Defined Caching: Managing Caches in Multi-Tenant Data in Cloud Storage Systems. In MSST ’19. 230–243. https://doi.org/10.
Centers. In SoCC ’15. 174–181. https://doi.org/10.1145/2806777.2806933 1109/MSST.2019.00004
[42] Astrid Undheim, Ameen Chilwan, and Poul Heegaard. 2011. Differ- [51] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex
entiated Availability in Cloud Computing SLAs. In 2011 IEEE/ACM Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty
12th International Conference on Grid Computing. 129–136. https: Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins,
//doi.org/10.1109/Grid.2011.25 David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Mup-
[43] Ben Vandiver, Shreya Prasad, Pratibha Rana, Eden Zik, Amin Saeidi, pana, Xiaoge Su, and Vishesh Yadav. 2021. FoundationDB: A Distributed
Pratyush Parimal, Styliani Pantela, and Jaimin Dave. 2018. Eon Mode: Unbundled Transactional Key Value Store. 2653–2666.
Bringing the Vertica Columnar Database to the Cloud. In SIGMOD ’18.

157

You might also like