Elastic Cloud Services: Scaling Snowflake's Control Plane
Elastic Cloud Services: Scaling Snowflake's Control Plane
Plane
Themis Melissaris Kunal Nabar Rares Radut
Snowflake Inc. Snowflake Inc. Snowflake Inc.
142
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
T-shirt sizes (S, M, L, etc.) corresponding to the amount of rolling back changes safely, without downtime to customer
compute power in the warehouse and consist of VMs with workloads. Having the ability to instantly rollback ensures
software stacks managed by Snowflake to run customers’ continuity of service for customers and the ability to swiftly
jobs. Each virtual warehouse is an independent compute revert from production incidents. An additional desired prop-
cluster that does not share compute resources with other erty is to automatically update or revert code versions on
virtual warehouses. different computing entities, such as VMs, independently, as
ECS is the "control plane" that provides the abstractions workloads might be heterogeneous.
and connective tissue across cloud service providers to sched- Managing resource lifecycle: It is certain that not all
ule and place customer jobs on the data plane as well as VMs under management will be healthy and functional at
ensure elasticity and availability. ECS operates a fleet of scale. To ensure correct operation, the system control plane
VMs that help manage Snowflake’s infrastructure and are manages the lifecycle of cloud services across a fleet of com-
managed independently of the data plane VMs who are re- puting nodes and monitors individual nodes’ health. The
sponsible for running customer workloads. The data plane system needs to make decisions for every VM state change,
and the control plane interact with each other to avoid sys- for example, when it is necessary to terminate unhealthy
tem overload. ECS includes a collection of services responsi- VMs, to move VMs in and out of a cluster and to scale the
ble for managing and orchestrating Snowflake components. number of nodes in a cluster horizontally or vertically de-
It runs on compute VMs provisioned from cloud service pending on the workload. This must be done in a graceful
providers and provides functionalities such as Infrastructure way as Data Cloud workloads can last from seconds to days.
Management, Metadata Management, Query parsing and Ensuring high availability: Cloud service providers set
ties together different components of the Snowflake Data a high standard for availability, typically bounded by a Ser-
Cloud to manage and process user requests. ECS VMs are vice Level Agreement. To reduce the blast radius of failures
organized in clusters, identified by the code version the VMs to availability, cloud service providers use Availability Zones
run on, the type of service they provide and the customer (AZs) which are logical data centers in each cloud region
accounts that the clusters serve. The ECS layer also supports [1]. For SaaS platforms, the use of multiple cloud service
multi-tenancy, the ability of an ECS cluster to provide a given providers in conjunction with multiple AZs across multiple
control plane service to multiple customer accounts. cloud regions can be used to increase service availability, as
Leveraging cloud service providers to provide Software-as- AZs are in theory isolated and fail independently of each
a-Service at scale has many benefits but also poses a number other [14]. Balancing VMs across availability zones is in-
of interesting challenges [19]: tended to limit customer impact in the presence of any zonal
Control plane that runs on multiple cloud providers: outage, as requests can be transparently redirected to a VM
Snowflake operates in multiple regions of cloud service pro- in another zone. All Snowflake production deployments exist
viders such as Amazon Web Services, Microsoft Azure, and in a cloud region with multiple availability zones and the
Google Cloud Platform. By providing a consistent customer control plane is able to maintain zone balance at both the
experience irrespective of region and underlying cloud provi- cluster level and the deployment level.
der, Snowflake enables customers to avoid the risks associ- Enabling dynamic autoscaling and throttling: The re-
ated with using a single cloud service provider. To achieve source demands of workloads can often fluctuate. To account
this, behind the scenes Snowflake deploys ECS on Virtual for workloads with spiky or unpredictable behavior while
Machines (VMs) as the atomic unit of compute since they maintaining responsiveness, it is desirable for a cloud ser-
provide the lowest common denominator of functionality vice to have elasticity and automatically scale by adjusting
across cloud providers. resources allocated to each application based on its needs
Automatic and safe software management: As soft- [11, 22]. Dynamic autoscaling automatically sizes clusters
ware complexity and size rise, it is becoming increasingly factoring several system properties such as CPU load, net-
challenging to ensure that a specific software release does work throughput, request rejection rate or memory usage.
not have unforeseen effects in production environments. At To ensure smooth operation of VMs within pre-defined re-
the same time, the system needs to ensure that software re- source limits, throttling of resources (such as CPU, memory
leases can happen online continuously without downtime, or network) can also be enforced. Dynamic autoscaling and
while also ensuring that the software release is transparent to throttling operate in tandem in the presence of traffic. If a
Snowflake customers. While we apply a wide range of testing cluster is overloaded with traffic, throttling is temporarily ac-
[4], continuous integration and delivery methodologies [18] tivated and more VMs are subsequently activated. Likewise,
to minimize erroneous behaviors, it is impossible to test all if a cluster is underutilized, it is over-provisioned and the con-
permutations of possibilities at scale. It is therefore critical to trol plane will reduce the number of VMs in the cluster. At
manage code deployments automatically by rolling out and Snowflake scale, providing customers with capacity instantly
143
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
144
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
145
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
146
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
Figure 6: ECS Cluster VM recycling. Figure 8: Availability zone outage and failover. Zone B
has an outage. Requests get transparently redirected
to zone C.
of balancing of ECS VMs across AZs there is no need for
data rebalancing as all the data is located and is accessible in
object storage.
Figure 9 presents scenarios of regional (clusters) and global
(deployments) load distribution across different AZs. On the
left the first scenario presents clusters in Zones A and B that
are balanced. However, globally across the deployment there
is imbalance, as Zone C has no VMs. The middle scenario
illustrates a balanced deployment where individual clusters
are not balanced (i.e., the green cluster only has VMs in Zone
A). The scenario on the right strikes a balance between global
and cluster balancing by calculating the difference between
the number of VMs in the most loaded zone and the least
loaded zone, which we will call AZ skew.
Minimizing AZ skew is more than striping VMs across
Figure 7: ECS Cluster Instance Isolation.
availability zones during provisioning. Within a single re-
Pool instance will move into the active cluster. ECS will then gional deployment, ECS implements a multi-cluster architec-
move the VM into Quarantine. ture where each cluster serves different groups of customers.
Each cluster can scale independently to respond to the cur-
3 BALANCING ACROSS AVAILABILITY rent load. To minimize the impact of an AZ outage on each
ZONES cluster and the deployment, we must zone balance at both
the cluster level and the deployment (global) level. A naive so-
Customers expect Snowflake to be always available. This
lution entails assigning 𝑛 VMs from each AZ to each cluster.
means designing ECS, which coordinates services and sched-
However, this falls apart as the number of VMs in a cluster is
ules warehouses to run queries, to be resilient to failures.
not always divisible by the number of AZs, and the number
One rare case is when a cloud service provider’s datacenter
of VMs in a cluster is small (typically less than eight) and
suffers an unexpected outage.
the number of clusters in a deployment is on the order of
All of the cloud service providers that Snowflake runs
hundreds.
on provide the notion of Availability Zones (AZs), which
Not only are these competing goals at times, but the free
are isolated datacenters in a single region where we can
pool used to draw VMs to scale clusters, may not have VMs
provision resources. By keeping ECS VMs balanced across
of that type in the zone we want. Since cloud provisioning
these AZs, we ensure minimal customer impact in the event
and preparing a new VM can take the order of minutes, ECS
of zonal failures, as requests are transparently redirected to
maintains a free pool to trade VMs to and from clusters,
a VM in another zone, as presented in Figure 8. In the case
147
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
148
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
149
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
4 THROTTLING & AUTOSCALING workloads that would be throttled by static limits could be
The traffic routed to each cluster varies widely by time of composed of easy-to-compute requests that underutilize the
day, sometimes in unpredictable ways without clear patterns. host’s hardware.
Autoscaling and throttling are enabled by default for all cus- When a VM accepts more work than it can handle, CPU
tomers and concern Snowflake’s control plane. Providing a and memory utilization become dangerously high and unde-
consistent service experience and ensuring platform avail- sirable knock-on effects begin to occur. For example, CPU
ability necessitates an elastic service layer. To create elas- thrashing can limit request processing or cause commit fail-
ticity, we implement Dynamic Throttling and Autoscaling ures for operational metadata that needs to be persisted. Ad-
based on the following tenets: ditionally, memory pressure can slowdown or halt program
Responsiveness: Queries should be processed smoothly execution.
and immediately upon receipt. The period that queries block Our solution is to implement resource-aware throttling,
before beginning or run slowly due to a shortage of resources which we call Dynamic Throttling. As opposed to setting
should be nearly non-existent in the aggregate. Snowflake’s static concurrency limits in our system, we dynamically cal-
availability SLAs depend on the ability of our cloud services culate concurrency limits based on current host-level re-
layer to to add resources to clusters quickly enough to keep source utilization every 30 seconds. When CPU load becomes
the retry-on-throttle duration short for rejected requests. high in a VM (as measured by the Linux /proc/loadavg), the
This gives the appearance of instant capacity to customers. host-local concurrency limits are immediately lowered and
Cost-efficiency: Deliver a reliable and performant service adjusted until the CPU load returns to and remains at an
using as few physical resources (VMs) as possible. acceptable level. For example, if the load on a machine is
Cluster volatility: We aim to minimize the frequency read at 2.0 and we have 200 queries, we adjust our future
of configuration changes to clusters (oscillation). There is incoming concurrency to be 100 queries which will in the
a system-operation burden to provisioning/de-provisioning future get us closer to our desired 1.0 load.
compute resources in a cloud environment due to unpro- Any further incoming requests will be rejected and re-
ductive setup and teardown time, warming cold caches and tried on other VMs with free resources. This is transparent
losing cached data, etc. For example, an algorithm that does to customers and adds minimal latency, protecting our sys-
a scale-in (reduce VM count) and then immediately performs tems while preserving business continuity. If this VM-level
a scale-out is highly undesirable - provisioning the new VM throttling results in rejections, a signal is transmitted to
for the scale-out, updating routing tables, warming caches, the autoscaler and triggers a cluster scale-out. On the other
and other tasks tend to bear a higher system burden than is hand, if VM load is low, we recognize that we can take more
saved in terms of cost by removing a VM for only a short work on and increase limits to improve our cost-efficiency.
duration To ensure fairness, concurrency limits are computed at the
Throughput: The total number of successful queries that account- and user-level within each VM to avoid single users
client(s) can submit to the cluster. The system should scale or accounts in multitenant clusters from ever saturating con-
to meet demand. currency limits and temporarily denying service from other
Latency: The average duration for a query to complete. customers.
This should be as low as possible. Given a static cluster size, To complement this VM-level throttling, we run a central-
there is an inverse relationship between latency and through- ized autoscaler. When the aggregate resource load is high
put: generally, one query can finish quickly, but when sub- across the active VMs in a cluster or if a quorum of the clus-
mission throughput is sufficiently high requests will start ter’s VMs is rejecting work, we will increase the number
piling up. Beyond that point query latency increases due to of VMs in the cluster. Likewise, we reduce the cluster size
queueing. A good autoscaling algorithm should keep latency when no rejections are occurring and the cluster load is low
low even as the request volume increases. to reduce costs.
150
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
Figure 16: As current CPU load average oscillates, ECS makes decisions to scale in or out based on current load. The
red line represents "cluster size stability". As the load stays within that window , we will not change the number
of VMs in the cluster. As the number of current ECS VMs increases, the cluster size stability decreases. ECS has
mechanisms to avoid cluster size oscillations.
Snowflake’s ECS considers signals around memory pressure and 1.0), we use the desired VM count to compute a new
and CPU load to make decisions. ECS’s autoscaler provides size for the cluster. Finally, to scale out we only look at the
a clear delineation of what constitutes unhealthy behavior 1-minute load average and check if it is high. To scale in,
and its policies give ECS flexibility in comparison to the we look at the load average using different time windows
state of the art to integrate with mechanisms for VM health (1-minute, 5-minute, 15-minute), all of which have to be low
and independently evolve each of the autoscaler and the in order to avoid oscillation in the cluster.
isolation/health management.
ECS leverages rejecting gateways that will return a 503
4.3 Throttling & Autoscaling Evaluation
HTTP status code to the caller with an expectation to retry,
and the rejection will be ingested into the autoscaler as an Our success evaluation criteria used for throttling was to
indicator to scale up. All customer queries and work must maintain responsiveness while ensuring the stability of our
go through one of these gateways to provide the autoscaler VMs. We used a combination of synthetic testing and analysis
with accurate information on how to scale the cluster. of results on real production servers to test this. Figure 17
The number of desired VMs is calculated by the number illustrates a synthetic workload running a representative
of current VMs multiplied by the ratio of the current load benchmark. In this test environment, we disabled autoscaling.
over the desired load. As the cluster size increases, we are Only Dynamic Throttling was active, and we configured the
more likely to scale on minor deviations in the normalized cluster to have two VMs. The generated workload exceeded
CPU load average (load per core), Figure 16 shows a graph of the CPU capacity of two VMs. Dynamic throttling reacted to
the scaling decisions that our autoscaling mechanism would the excessive load and reduced the gateway limits to maintain
make based on CPU load average. The red section is the the healthy state of the available VMs. In Figure 17a, we
"cluster size stability" section. As the load stays within that report the CPU load of the two VMs. We observe that the
window, we will not change the number of VMs in the cluster, load exceeded the available CPU capacity. This is correlated
a mechanism that introduces stability. As we add more VMs, with the query count in Figure 17c. Figure 17b shows the
the window becomes tighter. For example, if the current load throttle coefficient of the throttled VM being lowered to
does not fall between some set acceptable window (e.g. 0.9 enforce a new gateway size, effectively applying a small
multiplier to our gateways. Finally, fewer queries are let
151
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
through after the throttler determines the new gateway limit. removed manual work to scale VMs or limits. In addition,
In Figure 17c, the query throughput drops to a sustainable it unlocked higher throughput on each VM. As a result, the
amount, which also causes a drop in the associated load system recognizes when clusters are overprovisioned and
the VMs face. Furthermore, after the initial estimation, the can reduce the number of VMs to optimize resource usage.
coefficient in Figure 17b adjusts incrementally. It initially
4.3.2 Example 2: Customer Query Concurrency increased.
drops lower to temper the workload further and helps keep
CPU load at approximately 1.0, our target. Figure 17d displays Figure 20 shows a customer’s workload, the top chart shows
our coefficient used for the account and user-specific limits. our account coefficient, which in this case will be greater
To maintain fairness, we apply the limit evenly, causing all than 1 when we expand the respective customer’s gateways.
accounts to reduce evenly, and therefore the account with the After the rollout around 3 pm, the rejection rate in the bottom
chart was reduced and concurrent requests increased which
highest job count will be the first to be limited. The limitation
can be seen in the center chart.
of dynamic throttling is transient, and upon rejecting these
queries, we signal to the autoscaling framework to add VMs 4.3.3 Example 3: Noisy Neighbor. We use autoscaling and
to overload clusters. Our intent is to temper transient load dynamic throttling to work around "noisy neighbor" issues.
spikes and prevent them from causing downstream issues in Figure 21 is an example of a cluster we scaled up dramati-
our system until we can accommodate their workload. cally to handle the increased workload. Prior to the noisy
Now that we have the base algorithm, we enter the opti- neighbor, our traffic was low enough for the autoscaling
mization stage. There are a number of parameters we can system to decide to reduce VM counts to two or three. The
tweak to attempt to optimize for the features we desire. These successful queries per cluster showcase the effective through-
features that we seek effectively form the pillars of our throt- put after the autoscaling system increased the number of
tling technique. The three pillars of throttling are: VMs. Effectively, we were able to immediately quadruple the
• Lower the load as soon as possible, but not overshoot; throughput of the cluster.
• Minimize oscillation;
4.3.4 Example 4: Deployment load. Dynamic Throttling also
• Reduce throttle as soon as possible.
reduced cases of high CPU load that VMs encounter. As
We reject queries as cheaply as possible and retry once more shown in Figure 22 after the release of Dynamic Throttling,
VMs are available. We are effectively deferring load to a later the 5-minute CPU load average charts lowered noticeably.
point in time by which the autoscaling will have increased In particular almost no VM had a 5-minute CPU load higher
our computing capacity. Because we throttle to keep our than our target threshold of 1.
CPU load at a healthy level, the throttler must surface met-
rics about its rejection behavior as a signal to the autoscaler — 4.4 Throttling and Autoscaling for Memory
the CPU load signal ideally will not kick in if we throttle effec-
Due to the heterogeneous nature of workloads, it is possible
tively. These rejections are persisted in FoundationDB [51],
for operations such as complex query compilations to require
our operational metadata database, and aggregated globally
significantly more memory than CPU - the worst of these
by our Autoscaling framework to make decisions. Once ca-
extremes would be a single query requiring more memory
pacity has been added, we incrementally revert the throttling
than the VM even has. In this case, simply adding more
coefficients to bring clusters back to a steady-state where
VMs to the cluster where none of the new VMs have more
they are neither throttling nor overloaded, satisfying our
physical memory will not accommodate the workload. The
three throttling design pillars.
most intuitive solution is to move the workload to a new set
4.3.1 Example 1 — Snowflake internal analysis cluster: We of VMs, each with larger physical memory. This motivated a
scaled in the cluster when enabling the features on our in- concept of "vertical scaling", in which clusters may migrate
ternal data analysis account. Figure 18 shows the cluster VM to new VM types with different levels of hardware resources
count in the bottom chart. We reduced the cluster to two to best accommodate a workload. This is different from the
VMs as the scaling framework recognizes we did not require aforementioned model of adding or removing VMs of the
the compute resources of all four VMs. On Figure 19, two of same type within a cluster, which we call "horizontal scaling".
the lines, each representing a VM, drop to serving 0 queries Although it is highly unlikely that a single query uses the
per second as they are removed from topology. The queries entire available memory of a VM, there are benefits to mov-
per second and concurrent requests of the remaining two ing a workload to a larger VM. One of the largest advantages
VMs increase as we now have to maintain the same overall is the reduced cache redundancy within a cluster of VMs.
cluster throughout with two VMs. The throttling coefficients Metadata caches consume a significant portion of the avail-
are expanded automatically with dynamic throttling, provid- able memory, and having a large number of VMs with small
ing additional space for queries we can safely handle. This amounts of memory results in each VM having a relatively
152
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
(a) CPU load for two VMs over time. (b) VM throttle coefficient over time for two VMs.
(c) Cluster throughput in queries/sec over time for ECS cluster. (d) Account and user throttle coefficient over time for two VMs.
Figure 17: As CPU load increases we deflect future load by dynamically throttling and reducing future concurrent
requests to an amount estimated to keep load safe.
Figure 18: Autoscaling reduces VM count therefore driving down unnecessary over-provisioning across Snowflake.
smaller cache and more likely to duplicate cache entries that isolation events (VM replaced due to OOM, full garbage col-
already exist on other VMs when compared to a cluster with lection, etc.) by about 90%. Figure 23 illustrates that in the
a smaller number of high memory VMs. The autoscaling sys- first sample of clusters to enable vertical autoscaling begin-
tem implements memory-based vertical scaling by reading ning the week of November 8, 2021, we have moved from a
JVM-level heap memory and throttling VMs that are close highly variable rate of 15-30 isolations per day with spikes
to exhaustion. This acts as a signal to quiesce the old VMs of as many as 100 to a much more consistent rate of 0-2
and migrate the cluster to a new set of larger VMs. This mi- isolations per day.
gration has sharply reduced occurrences of memory-related
153
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
154
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
Figure 22: After Dynamic Throttling was rolled out, CPU overloading stops. The figure presents VM throttling
coefficients (top) and CPU load averages (bottom) per VM. Coefficient values less than 1.0 on the logarithmic
y-axis indicate hosts with reduced concurrency limits, and values greater than 1.0 indicate hosts with expanded
concurrency limits.
6 CONCLUSION
We present the design and architecture of Snowflake’s con-
trol plane. Elastic Cloud Services manages the Snowflake
fleet at scale and is responsible for VM and cluster lifecy-
cle, health management, self-healing automation, topology,
account service placement, traffic control, cross-cloud and
cross-region replication, and resource management. ECS’s
goal is to support a stable Snowflake service, resilient to
failures that transparently manage dynamic resource utiliza-
tion in a cost-efficient manner. We showcase how we have
Figure 23: Memory based throttling and vertical scaling been able to support automatic code management through
drastically reduces memory isolations. safe rollout/rollback, VM lifecycle and cluster pool manage-
ment as well as balance Snowflake’s deployments equally
Resource throttling. Architectures that share resources across availability zones. We optimize resource management
across multiple cores in CPUs and multiple tenants in a vir- through horizontal and vertical autoscaling. We have begun
tual machine exacerbate resource management problems, initiatives to predictively scale clusters prior to hitting any
since resources like CPU, memory, or network utilization are resource limits. After enabling throttling and autoscaling, we
shared and lead to resource contention. Works like Cheng observed many clusters that have cyclical workloads with
et al. [7] restrict the number of concurrent memory tasks intervals ranging from minutes to weeks. We are currently
to avoid interference among memory requests and throttle exploring applying different statistical methods on the tem-
down cores after estimating unfair resource usage in the poral data collected from different clusters in order to better
memory subsystem [13]. In the context of resource manage- serve our customers needs while also minimizing costs.
ment in the cloud, cloud service providers expose APIs for We combine those capabilities with throttling such that
resource throttling [3, 33, 39]. Other works automatically customers do not get impacted by underlying infrastructure
partition and place rules at both hypervisors and switches changes. We evaluated ECS capabilities in production and
present results at the scale of Snowflake’s Data Cloud.
155
Elastic Cloud Services: Scaling Snowflake’s Control Plane SoCC ’22, November 7–11, 2022, San Francisco, CA, USA
REFERENCES https://doi.org/10.1145/2723372.2742795
[1] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, [18] Jenkins. Retrieved: 2022-5-31. Jenkins continuous integration and
Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, continuous delivery . https://www.jenkins.io.
Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2009. Above the Clouds: A [19] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai,
Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009- Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Menezes Car-
28. reira, Karl Krauth, Neeraja Yadwadkar, Joseph Gonzalez, Raluca Ada
[2] Microsoft Azure. Retrieved: 2022-5-31. Azure Synapse Analytics. https: Popa, Ion Stoica, and David A. Patterson. 2019. Cloud Programming
//azure.microsoft.com/en-us/services/synapse-analytics. Simplified: A Berkeley View on Serverless Computing. Technical Report
[3] Microsoft Azure. Retrieved: 2022-5-31. Throttling Resource Manager UCB/EECS-2019-3.
requests . https://docs.microsoft.com/en-us/azure/azure-resource- [20] Vangelis Koukis, Constantinos Venetsanopoulos, and Nectarios Koziris.
manager/management/request-limits-and-throttling. 2013. okeanos: Building a Cloud, Cluster by Cluster. IEEE Internet
[4] V.R. Basili and R.W. Selby. 1987. Comparing the Effectiveness of Soft- Computing 17, 3 (2013), 67–71. https://doi.org/10.1109/MIC.2013.43
ware Testing Strategies. IEEE Transactions on Software Engineering [21] Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj
(1987). https://doi.org/10.1109/TSE.1987.232881 Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and
[5] Daniel Baur and Jörg Domaschka. 2016. Experiences from Building a Murali Chintalapati. 2020. Gandalf: An Intelligent, End-To-End Ana-
Cross-Cloud Orchestration Tool. In CrossCloud ’16. https://doi.org/10. lytics Service for Safe Deployment in Large-Scale Cloud Infrastructure.
1145/2904111.2904116 In NSDI ’20. 389–402.
[6] José Carrasco, Francisco Durán, and Ernesto Pimentel. 2018. Trans- [22] Tania Lorido-Botrán, Jose Miguel-Alonso, and Jose Lozano. 2014. A
cloud: CAMP/TOSCA-based bidimensional cross-cloud. Comput. Stand. Review of Auto-scaling Techniques for Elastic Applications in Cloud
Interfaces 58 (2018), 167–179. Environments. Journal of Grid Computing 12 (2014). https://doi.org/
[7] Hsiang-Yun Cheng, Chung-Hsiang Lin, Jian Li, and Chia-Lin Yang. 10.1007/s10723-014-9314-7
2010. Memory Latency Reduction via Thread Throttling. In MICRO-43. [23] Jose Luis Lucas-Simarro, Rafael Moreno-Vozmediano, Rubén Montero,
53–64. https://doi.org/10.1109/MICRO.2010.39 and Ignacio Llorente. 2013. Scheduling strategies for optimal service
[8] Asaf Cidon, Daniel Rushton, Stephen M. Rumble, and Ryan Stutsman. deployment across multiple clouds. Future Generation Computer Sys-
2017. Memshare: a Dynamic Multi-tenant Key-value Cache. In (USENIX tems 29 (2013), 1431–1441. https://doi.org/10.1016/j.future.2012.01.007
ATC ’17). 321–334. [24] E. Michael Maximilien, Ajith Ranabahu, Roy Engehausen, and Laura C.
[9] Carlo Curino, Evan Jones, Raluca Popa, Nirmesh Malviya, Eugene Anderson. 2009. IBM altocumulus: a cross-cloud middleware and
Wu, Samuel Madden, Hari Balakrishnan, and Nickolai Zeldovich. 2011. platform. In OOPSLA ’09 Companion.
Relational Cloud: A Database-as-a-Service for the Cloud. CIDR ’11, [25] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
235–240. Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interac-
[10] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, tive Analysis of Web-Scale Datasets. Proc. VLDB Endow. 3, 1–2 (2010),
Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Mar- 330–339. https://doi.org/10.14778/1920841.1920886
tin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, [26] Jeffrey C. Mogul, Rebecca Isaacs, and Brent Welch. 2017. Thinking
Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Tri- about Availability in Large Service Infrastructures. In Proc. HotOS XVI.
antafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data [27] R. Moreno-Vozmediano, R. S. Montero, E. Huedo, and I. M. Llorente.
Warehouse. In Proceedings of the 2016 International Conference on Man- 2018. Orchestrating the Deployment of High Availability Services on
agement of Data. 215–226. https://doi.org/10.1145/2882903.2903741 Multi-Zone and Multi-Cloud Scenarios. J. Grid Comput. 16, 1 (2018),
[11] Kubernetes Documentation. Retrieved: 2022-5-31. Horizontal 39–53.
Pod Autoscaling. https://kubernetes.io/docs/tasks/run-application/ [28] Rafael Moreno-Vozmediano, Rubén S. Montero, Eduardo Huedo, and
horizontal-pod-autoscale/. Ignacio M. Llorente. 2019. Efficient Resource Provisioning for Elastic
[12] Ravi Teja Dodda, Chris Smith, and Aad van Moorsel. 2009. An Archi- Cloud Services Based on Machine Learning Techniques. J. Cloud
tecture for Cross-Cloud System Management. In IC3. Comput. 8, 1 (2019). https://doi.org/10.1186/s13677-019-0128-9
[13] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2012. [29] Rafael Moreno-Vozmediano, Ruben S. Montero, and Ignacio M.
Fairness via Source Throttling: A Configurable and High-Performance Llorente. 2009. Elastic Management of Cluster-Based Services in the
Fairness Substrate for Multicore Memory Systems. ACM Trans. Comput. Cloud. In Proceedings of the 1st Workshop on Automated Control for
Syst. 30, 2 (2012). https://doi.org/10.1145/2166879.2166881 Datacenters and Clouds (ACDC ’09). 19–24. https://doi.org/10.1145/
[14] Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, 1555271.1555277
Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. [30] Masoud Moshref, Minlan Yu, Abhishek Sharma, and Ramesh Govindan.
2010. Availability in Globally Distributed Storage Systems. In OSDI’10. 2012. vCRIB: Virtualized Rule Management in the Cloud. In HotCloud
61–74. ’12.
[15] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, [31] Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel:
Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Isolation and Sharing in Disaggregated Rack-Scale Storage. In NSDI
2016. Network Requirements for Resource Disaggregation. In Proceed- ’17. 17–33.
ings of the 12th USENIX Conference on Operating Systems Design and [32] Rene Peinl, Florian Holzschuher, and Florian Pfitzer. 2016. Docker
Implementation. 249–264. Cluster Management for the Cloud - Survey Results and Own Solution.
[16] Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Journal of Grid Computing 14 (2016). https://doi.org/10.1007/s10723-
Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair 016-9366-y
Allocation of Multiple Resource Types. In NSDI’11. 323–336. [33] Google Cloud Platform. Retrieved: 2022-5-31. Rate-limiting strategies
[17] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul and techniques. https://cloud.google.com/architecture/rate-limiting-
Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift strategies-techniques.
and the Case for Simpler Data Warehouses. In SIGMOD ’15. 1917–1923. [34] Barath Raghavan, Kashi Vishwanath, Sriram Ramabhadran, Kenneth
Yocum, and Alex C. Snoeren. 2007. Cloud Control with Distributed
156
SoCC ’22, November 7–11, 2022, San Francisco, CA, USA T. Melissaris, K. Nabar, R. Radut, S. Rehmtulla, A. Shi, S. Chandrashekar, and I. Papapanagiotou
Rate Limiting. SIGCOMM Comput. Commun. Rev. 37, 4 (aug 2007), 797–809. https://doi.org/10.1145/3183713.3196938
337–348. https://doi.org/10.1145/1282427.1282419 [44] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong,
[35] Rodrigo da Rosa Righi, Vinicius Facco Rodrigues, Cristiano André Ashish Motivala, and Thierry Cruanes. 2020. Building An Elastic
da Costa, Guilherme Galante, Luis Carlos Erpen de Bona, and Tiago Query Engine on Disaggregated Storage. In NSDI ’20. 449–462.
Ferreto. 2016. AutoElastic: Automatic Resource Elasticity for High [45] Carl A. Waldspurger. 2003. Memory Resource Management in VMware
Performance Applications in the Cloud. IEEE Transactions on Cloud ESX Server. SIGOPS Oper. Syst. Rev. 36 (2003), 181–194. https://doi.
Computing 4, 1 (2016), 6–19. https://doi.org/10.1109/TCC.2015.2424876 org/10.1145/844128.844146
[36] Kazunori Sato. Retrieved: 2022-5-31. An Inside Look at Google Big- [46] Carl A. Waldspurger and William E. Weihl. 1994. Lottery Scheduling:
Query. https://cloud.google.com/files/BigQueryTechnicalWP.pdf. Flexible Proportional-Share Resource Management. In OSDI ’94.
[37] Amazon Web Services. Retrieved: 2022-5-31. Amazon Athena Server- [47] C. A. Waldspurger and E. Weihl. W. 1995. Stride Scheduling: Determin-
less Interactive Query Service. https://aws.amazon.com/athena. istic Proportional- Share Resource Management. Technical Report.
[38] Amazon Web Services. Retrieved: 2022-5-31. Amazon Aurora MySQL [48] Huaimin Wang, Peichang Shi, and Yiming Zhang. 2017. JointCloud: A
PostgreSQL Relational Database. https://aws.amazon.com/rds/aurora/. Cross-Cloud Cooperation Architecture for Integrated Internet Service
[39] Amazon Web Services. Retrieved: 2022-5-31. Throttle API requests for Customization. In ICDCS ’17. 1846–1855. https://doi.org/10.1109/
better throughput . https://docs.aws.amazon.com/apigateway/latest/ ICDCS.2017.237
developerguide/api-gateway-request-throttling.html. [49] Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, and Len Bass.
[40] David Shue, Michael J. Freedman, and Anees Shaikh. 2013. Fairness 2012. Automatic Undo for Cloud Management via AI Planning. In
and Isolation in Multi-Tenant Storage as Optimization Decomposition. HotDep ’12. https://www.usenix.org/conference/hotdep12/workshop-
SIGOPS Oper. Syst. Rev. 47, 1 (2013), 16–21. https://doi.org/10.1145/ program/presentation/Weber
2433140.2433145 [50] Xin Xie, Chentao Wu, Junqing Gu, Han Qiu, Jie Li, Minyi Guo, Xubin
[41] Ioan Stefanovici, Eno Thereska, Greg O’Shea, Bianca Schroeder, Hitesh He, Yuanyuan Dong, and Yafei Zhao. 2019. AZ-Code: An Efficient
Ballani, Thomas Karagiannis, Antony Rowstron, and Tom Talpey. 2015. Availability Zone Level Erasure Code to Provide High Fault Tolerance
Software-Defined Caching: Managing Caches in Multi-Tenant Data in Cloud Storage Systems. In MSST ’19. 230–243. https://doi.org/10.
Centers. In SoCC ’15. 174–181. https://doi.org/10.1145/2806777.2806933 1109/MSST.2019.00004
[42] Astrid Undheim, Ameen Chilwan, and Poul Heegaard. 2011. Differ- [51] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex
entiated Availability in Cloud Computing SLAs. In 2011 IEEE/ACM Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty
12th International Conference on Grid Computing. 129–136. https: Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins,
//doi.org/10.1109/Grid.2011.25 David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Mup-
[43] Ben Vandiver, Shreya Prasad, Pratibha Rana, Eden Zik, Amin Saeidi, pana, Xiaoge Su, and Vishesh Yadav. 2021. FoundationDB: A Distributed
Pratyush Parimal, Styliani Pantela, and Jaimin Dave. 2018. Eon Mode: Unbundled Transactional Key Value Store. 2653–2666.
Bringing the Vertica Columnar Database to the Cloud. In SIGMOD ’18.
157