0% found this document useful (0 votes)
56 views14 pages

Exploiting Cloud Object Storage For High-Performance Analytics

Uploaded by

jaimev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views14 pages

Exploiting Cloud Object Storage For High-Performance Analytics

Uploaded by

jaimev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Exploiting Cloud Object Storage for High-Performance Analytics

Dominik Durner Viktor Leis Thomas Neumann


Technische Universität München Technische Universität München Technische Universität München
[email protected] [email protected] [email protected]
ABSTRACT practically unlimited capacity, and scalable access bandwidth. These
Elasticity of compute and storage is crucial for analytical cloud properties make disaggregated cloud object storage a natural fit for
database systems. All cloud vendors provide disaggregated object analytical database systems. In future data centers, database systems
stores, which can be used as storage backend for analytical query may even run on hardware that separates memory and compute.
engines. Until recently, local storage was unavoidable to process There, disaggregated storage is crucial to provide durability [83, 91].
large tables efficiently due to the bandwidth limitations of the net- High-bandwidth networks. Until recently, the major issue of
work infrastructure in public clouds. However, the gap between cloud object storage for analytics was the limited network band-
remote network and local NVMe bandwidth is closing, making width between instances and storage. In 2018, AWS introduced in-
cloud storage more attractive. This paper presents a blueprint for stances with 100 Gbit/s (≈12 GB/s) networking – resulting in a four-
performing efficient analytics directly on cloud object stores. We fold increase in per-instance bandwidth [22, 26]. In contrast to In-
derive cost- and performance-optimal retrieval configurations for finiband, 100 Gbit/s Ethernet has not only become widely-available
cloud object stores with the first in-depth study of this foundational but also affordable1 . This effectively closes the gap between remote
service in the context of analytical query processing. For achieving network and local NVMe bandwidth2 and makes relying more on
high retrieval performance, we present AnyBlob, a novel download cloud storage more attractive for bandwidth-dominated workloads.
manager for query engines that optimizes throughput while mini- Cloud storage analytics. Most cloud-native data warehouse sys-
mizing CPU usage. We discuss the integration of high-performance tems, such as Snowflake [33, 82], Databricks [25], and AWS Red-
data retrieval in query engines and demonstrate it by incorporating shift [19], use cloud object storage as their ground-truth data source.
AnyBlob in our database system Umbra. Our experiments show that Although the bandwidth gap between local storage and network is
even without caching, Umbra with integrated AnyBlob achieves closing, most research focuses on caching to avoid fetching data
similar performance to state-of-the-art cloud data warehouses that from remote storage [37, 46, 85, 89]. Early research investigates ob-
cache data on local SSDs while improving resource elasticity. ject storage for transactional database systems but limits its focus
on OLTP [27]. Surprisingly, no empirical study for general-purpose
PVLDB Reference Format: analytics (OLAP) on cloud object stores has been conducted.
Dominik Durner, Viktor Leis, and Thomas Neumann. Exploiting Cloud Challenge 1: Achieving instance bandwidth. Because the la-
Object Storage for High-Performance Analytics. PVLDB, 16(11): 2769 - tency of each object request is high, saturating high-bandwidth net-
2782, 2023. works requires many concurrently outstanding requests. Therefore,
doi:10.14778/3611479.3611486 a careful network integration into the DBMS is crucial to achieve
the complete bandwidth available on network-optimized instances.
PVLDB Artifact Availability:
Challenge 2: Network CPU overhead. In contrast to fetching
The source code, data, and/or other artifacts have been made available at
data from local disks, network retrieval has higher CPU overhead.
https://github.com/durner/AnyBlob.
Query engines, however, also contend for computation resources to
simultaneously analyze large sets of data. Consequently, reducing
1 INTRODUCTION
the CPU footprint of network retrieval is essential.
Data warehousing moves to the cloud. Estimates show that the Challenge 3: Multi-cloud support. Many cloud database systems
revenue of cloud database systems has reached that of on-premise are able to run in different clouds – allowing the user to choose the
systems in 2021 [1] – and by VLDB 2023, the cloud market share will vendor of their liking. In contrast to the desire for multi-cloud sys-
presumably be significantly higher. A major part of this change is tems, each cloud vendor provides its own networking library. Thus,
the shift of data warehousing and analytical query processing to the multiple libraries need to be integrated, which increases complexity.
cloud. The main drivers behind that are elasticity and the flexibility Approach. In this paper, we present a blueprint for performing
to provision storage and compute separately and on demand. efficient analytics directly on data residing in disaggregated cloud
Cloud object stores. Cloud object stores such as AWS S3, IBM object stores. We studied the cloud object stores of different vendors
COS, and GCP Storage enable separating compute from storage to derive cost- and performance-optimal retrieval configurations.
in a cost-effective (e.g., ∼23$/TiB per month) way [13]. They also To reduce resource utilization for network retrieval, we developed a
provide strong durability guarantees (e.g., 11 9’s per year for S3 [4]), download manager that is able to fetch data from multiple cloud ven-
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
dors. We seamlessly integrate high-bandwidth object retrieval with
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of the database engine’s scan operator. Our DBMS Umbra, equipped
this license. For any use beyond those covered by this license, obtain permission by
1 Comparing the on-demand prices of c5n.18xlarge (100 Gbit/s) and c5.18xlarge (25
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment. Gbit/s) while taking c5n’s larger DRAM into account (∼30% more DRAM), we find
Proceedings of the VLDB Endowment, Vol. 16, No. 11 ISSN 2150-8097. that adding 100 Gbit/s networking increases cost by only 22%.
2 Consider i3en.24xlarge, the AWS instance with the fastest local NVMe bandwidth. Its
doi:10.14778/3611479.3611486
local read bandwidth is 16 GB/s, while its full-duplex network bandwidth is 12 GB/s.

2769
Table 1: Cloud storage cost by cloud vendor for zone-
Load Balancers
HTTP
redundant replication (default; multiple AZs within region).
Metadata Storage
GET Cloud Provider Storage GET PUT
PUT API Servers Object Storage (cheapest region) ($ / TiB / month) ($ / 1 M) ($ / 1 M)
AWS (us-east-2) [13] 23.55 0.40 5.00
Figure 1: Schematic architecture of AWS S3. GCP (us-east-1) [39] 20.48 0.40 5.00
IBM (us-east) [45] 23.55 0.42 5.20
Azure (East US 2) [60] 23.55 0.40 6.25
with our download manager and caching disabled, achieves similar OCI (us-ashburn-1) [64] 26.11 0.34 0.34
performance on a single instance as large configurations of state-of-
the-art cloud database systems that cache data on local SSDs. Our
fast and low overhead networking integration facilitates switch-
shows that instances are able to achieve high network throughput
ing instances without performance cliffs, improving elasticity. As
to S3 [80]. With the best practices in mind [10], we conduct this
switching comes without performance cliffs, our approach is able
in-depth experimental study that helps exploiting cloud storage for
to better utilize spot instances, available at huge discounts.
analytical query processing. Unless otherwise specified, we use our
Contribution 1: Experimental study of cloud object stores. To
AnyBlob library as the retrieval manager, presented in Section 3.
achieve high-bandwidth data processing, we first study the prop-
erties of cloud object stores. In Section 2, we explain the design of
disaggregated storage, discuss the cost structure, and then provide 2.1 Object Storage Architecture
detailed experiments on the latency and throughput of different Overview. All major cloud vendors provide disaggregated storage
object stores. We define an optimal request size range that mini- solutions such as AWS S3, Azure Blob, IBM COS, OCI Object Storage,
mizes cost while maximizing throughput. Our concurrency analysis and GCP Storage. Data is stored in immutable blocks called objects.
helps to schedule enough requests to meet the throughput goal (i.e., These objects are distributed and replicated across several storage
instance bandwidth). Our in-depth experimental study of this foun- servers for availability and durability. After resolving the domain
dational cloud service helps to exploit disaggregated storage for name of the cloud object store, the user requests an object from a
analytical query processing. storage server which then sends the data. All major cloud providers
Contribution 2: AnyBlob, a low overhead multi-cloud library. use a similar API that transfers data via HTTP (TCP).
With the insights gained from our characterization of object stores, Architecture of S3. The architecture of S3 is depicted in Fig-
we developed AnyBlob, an open-source, multi-cloud download man- ure 1 [32]. AWS S3 defines prefixes that are similar to unique file
ager for object stores that is optimized for large data analytics [36]. paths in operating systems. Objects are similar to files and all levels
AnyBlob, described in Section 3, achieves the same throughput as above objects are similar to directories. Data is stored in buckets
the libraries provided by the cloud vendors while reducing CPU re- that resemble hard drive partitions in our analogy. According to
source consumption significantly. CPU resource utilization is vital AWS, S3 partitions all prefixes to scale to thousands of requests per
to process data concurrently. In contrast to existing solutions, our second [32]. A prefix can range from covering a bucket down to
approach does not need to start new threads for parallel requests be- individual objects. With an update in 2020, S3 is now a strongly
cause it uses io_uring [21], which facilitates asynchronous system consistent system [23]. Other providers were already strongly con-
calls. To saturate the network bandwidth, our analysis shows that sistent. S3 replicates the data to at least 3 different availability zones
hundreds of requests have to be outstanding simultaneously. Our (AZs). A geographic region consists of AZs that are separated data
solution helps to reduce thread scheduling overhead and allows centers for increasing availability and durability [75].
seamless integration into database query engines. Bandwidth limits. Data access performance is characterized by
Contribution 3: Blueprint for retrieval integration. Tight inte- the network connection of the instance, the network connection of
gration of the download manager into the database engine enables the cloud storage, and the network itself. At AWS, general-purpose
efficient analytics on disaggregated storage. We present a blueprint instances achieve 100 Gbit/s and more to the object stores [3, 5].
to incorporate AnyBlob into database engines in Section 4. By care-
fully designing the scan operator and developing an object retrieval 2.2 Object Storage Cost
scheduler, we can seamlessly interleave the downloading of objects Cost structure. All major cloud vendors structure their object
with the analytical processing. storage pricing similarly. They categorize expenses as storage cost,
data retrieval and data modification cost (API cost), and inter-region
2 CLOUD STORAGE CHARACTERISTICS network transfer cost. Cloud providers operate object stores on the
Methodology. In order to design an efficient analytics engine based level of a region (e.g., eu-central-1). When accessing data within
on cloud object storage, we need to understand its basic character- one region, only API costs are charged because intra-region traffic
istics. We start with an analysis on the performance characteristics is free to the object store. On the other hand, AWS inter-region data
and cost of disaggregated object stores and compute instances. To transfer, for example, from the US east to Europe costs 0.02$/GB.
gain insights into the storage architecture, we perform various Size-independent retrieval cost. Table 1 shows that the pricing of
micro-experiments on AWS S3 and two other cloud providers to cloud providers is similar for zone-redundant replication (default),
understand latency and throughput limitations. A study on AWS which provides high durability and optimal retrieval performance.

2770
8 weeks (Jul 4, 2022 – Aug 29, 2022) 1 week

Bandwidth [MiB/s]
100 ∼15% of points at ∼95 MiB/s 100 ∼10% of points at ∼95 MiB/s

Per Object
75 75

50 50

25 25

0 0
0 7 14 21 28 35 42 49 21 23 25 27
Days Days

Figure 3: S3 bandwidth over 8 weeks (AWS, eu-central-1).

95% AWS Cloud X Cloud Y


75% eu-central-1, c5n.large 4 vCPUs 4 vCPUs
50%
25% >=800
5%

Latency [ms]
600

400

200

Figure 2: First byte and total latency for different requests 0


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
sizes on hot and cold objects (AWS, eu-central-1, c5n.large).
Run

Surprisingly, retrieval cost in the same region depends only on the Figure 4: Total latency distribution of different object stores
number of requests sent to the cloud object store, and does not over multiple runs on sparsely accessed data (12h interval).
depend on the object size. Downloading a 1 KiB object costs the
same as a 1 TiB object, as long as only one HTTP request is issued.
Cloud storage alternatives. Other storage solutions are not as request sizes. This highlights that round-trip latency limits the
elastic as disaggregated storage and are often more expensive. overall throughput. For sufficiently large requests, bandwidth is the
For example, AWS Elastic Block Storage (EBS) (gp2 SSD) costs limiting factor. From 8 to 16 MiB, we see minor improvements but
102.4$/TiB compared to 23.2$/TiB per month. HDD storage pricing the duration already rises by ∼1.9× while object size doubles. In-
is comparable to S3, but bandwidth is very limited. Although EBS is creasing the size from 16 to 32 MiB results in doubling the retrieval
elastic in its size, it can only be attached to a single node. Instance- duration. Thus, the bandwidth limit is reached, and further increas-
based SSD storage is also expensive. For example, the price differ- ing the size does not benefit the retrieval performance. When data
ence between c5.18xlarge and c5d.18xlarge is 0.396$/h and yields is hot, first byte and total latency are generally reduced.
in 1.8 TB NVMe SSD. There, instance storage costs 158.4$/TB per Noisy neighbors. Cloud-based storage solutions are shared be-
month, which is 7× more expensive. Another example for instance- tween customers, resulting in less predictable latency. We continu-
based storage is the largest HDD cluster instance d3en.12xlarge. ously retrieve a single object from a set of objects with one request
This instance features 24 HDDs with 14 TB storage each at a price to analyze trends in access performance. We generate random 16
of 13.5$/TB per month. Although this seems cheaper initially, such MiB objects since increasing the size does not lead to a lower la-
an instance cannot provide S3’s durability guarantees (11 9’s). The tency per byte. Figure 3 shows the bandwidth for accessing an
parallelism of disaggregated storage enables higher throughput object (bytes divided by duration) over a period of 8 weeks. Object
than local storage devices, which we will discuss in Section 2.8. bandwidth has a high variance ranging from ∼25 to 95 MiB/s, with
a considerable number of data points being at the maximum (15%).
Finding 1: Cloud object storage provides the best durability
The median performance stabilizes at 55-60 MiB/s. Weekly patterns
guarantees while being the cheapest storage option.
in the data show that the bandwidth is influenced by the day of the
week. Especially at the weekends (first day of the week is Monday),
2.3 Latency the performance is higher – most likely due to lower demand from
Different request sizes. Disaggregated storage incurs higher la- other customers. When we zoom into one week, clear daily patterns
tency than SSD-based storage solutions. We examine the latency are visible. The performance fluctuations between day and night
distribution for different request sizes to understand storage latency. indicate variations in network utilization during different times
We distinguish between total duration and latency until the first of the day. Surprisingly, no outlier lies above the large cluster at
byte is retrieved. The results of using only a single request at a ∼95 MiB/s even though millions of objects were downloaded. This
time are depicted in Figure 2. We differentiate between the first and suggests that the per-request bandwidth is limited within S3 or that
20th consecutive iteration to simulate hot accesses. Our experiment server-side caching effects are intentionally not passed on to users.
shows that first byte latency often dominates the overall runtime Latency variations between cloud vendors. In addition to using
for small sizes. First byte and total duration are similar for small AWS, we also examine latency characteristics of two other cloud

2771
AWS AWS AWS Cloud X Cloud Y AWS
eu-central-1, c5n.18xlarge us-east-2, c5n.18xlarge eu-central-1 eu-central-1
100 Limit: 100 Gbit/s Limit: 100 Gbit/s c5n.18xlarge c5n.2xlarge
25

Bandwidth [Gbit/s]
100 Limit: 100 Gbit/s

Bandwidth [Gbit/s]
75 20
75
50 15
50 10
25
Bandwidth [Gbit/s]

5
0 25
0
Cloud X Cloud Y 0 0 25 50 75 100
Cold Hot Cold Hot Cold Hot Time [min]
100

75
Figure 6: Throughput comparison of Figure 7: Instance
50 cold and hot runs. burst bandwidth.
25

0
On Demand ($3.88/hour) Spot ($1.28/hour)

Processing Costs [$ / TB]


1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
2.0
Run

s
bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/

bit/
23 G

33 G

51 G

67 G

75 G

81 G

80 G

23 G

33 G

51 G

67 G

75 G

81 G

80 G
1.5

Figure 5: Throughput distribution of different object stores 1.0 Contributor


EC2
over multiple runs on sparsely accessed data (12h interval). 0.5 S3

0.0
0.5 1.0 2.0 4.0 8.0 16.0 32.0 0.5 1.0 2.0 4.0 8.0 16.0 32.0
providers. The experiment, plotted in Figure 4, accesses randomly Request Size [MiB]
generated 16 MiB objects. After each run, the bucket is not accessed
until the next run. The interval between executions is (at least) 12 Figure 8: Cost vs. throughput of different request sizes (AWS,
hours to reduce caching effects. AWS S3 has the highest overall eu-central-1, c5n.18xlarge).
latency for individual objects. The other two providers have similar
average latencies, but Cloud Y has more variance. Latency between
different executions is fairly stable across all cloud providers. As and almost no fluctuations. Cloud Y achieves a median bandwidth
mentioned, S3 has a minimum latency with no outliers below it, of 50 Gbit/s to its object store, but we notice higher variance.
which suggests a restricted per-request maximum bandwidth. In Different regions have slightly different performance.
contrast to AWS, outliers in the low latency spectrum indicate that Throughput is similar for the two tested regions of AWS; how-
the other vendors do not hide caching effects. ever, one region performs slightly better. The difference between
the two regions does not vary much between iterations.
2.4 Throughput High bandwidth is achievable for cold objects. In Figure 6, we
Importance of throughput. Aside from latency, we also show investigate the throughput differences between the first and the
insights on the throughput of accessing object stores. For analytics, 20th consecutive execution. The access frequency spike of the same
the most important factor is the combined throughput since OLAP objects does not result in vastly different execution times.
requires large amounts of data to be processed. Thus, the first byte Small instances allow bursting. In the AWS instance specifica-
latency is less important for bandwidth-dominated workloads. tions, the network bandwidth of smaller instances is often denoted
Cloud storage throughput similar to instance bandwidth. with an up-to bandwidth limit. Instances achieve the baseline band-
Similar to our previous latency experiment, we access randomly width (relative to the number of CPUs) in the steady state after
generated 16 MiB objects. One request retrieves one complete ob- utilizing all burst credits [14]. Figure 7 shows that the instance falls
ject. In this experiment, we maximize the throughput available on back to the baseline throughput after bursting for ∼45 min.
each cloud provider with a single instance. We schedule up to 256
Finding 2: Object retrieval can reach network bandwidth.
simultaneous requests using many threads to maximize throughput.
Further increasing requests did not lead to higher throughput. Sec-
tion 2.8 discusses the optimal number of requests. We use instances 2.5 Optimal Request Size
that achieve up to 100 Gbit/s (or the cloud’s maximum bandwidth) Request size implications. An important design decision is the
and have similar on-demand pricing. Figure 5 shows the through- size of requests. Requests can either be full objects or byte ranges
put experiment with (at least) 12 hours between different runs to within objects. The most crucial factors are performance and request
reduce caching effects. Each throughput data point is calculated as cost. Since cloud providers charge by the number of requests, larger
an aggregate of all downloaded objects over a 1-second window. requests result in lower cost for the same overall data size. On the
The results show that we achieve a median bandwidth of at least other hand, the size should be as small as possible so that small
75 Gbit/s for AWS. Most runs have a median bandwidth between tables also benefit from parallel downloads. Our experiments in
80 and 90 Gbit/s in eu-central-1, close to the maximum instance Section 2.3 demonstrate that performance does not improve beyond
bandwidth. At Cloud X, we observe a bandwidth limit of ∼40 Gbit/s the bandwidth limit for a single request.

2772
AWS, eu-central-1, c5n.18xlarge 2.8 Model for Cloud Storage Retrieval
20
Concurrency analysis. During our analysis, we saw that the

Bandwidth [Gbit/s]
CPUs used

15 90
bandwidth of individual requests is similar to accessing data on an
10 60 HDD. To saturate network bandwidth, many simultaneous requests
are required. Requests in the range of 8 - 16 MiB are cost-effective
73 Gbit/s

75 Gbit/s

74 Gbit/s
Measured
5 30 Model
Model (Hot)
for analytical workloads. We design a model to predict the number
0 0 of requests needed to reach a given throughput goal:
TP S S 0 50 100 150 200 250
AE TP
HT HT Concurrent Requests baseLatency + size · dataLatency
requests = throughput ·
size
Figure 9: Impact of en- Figure 10: Request modeling for For sufficiently large request sizes at S3, we calculate the median
cryption on CPU usage. reaching throughput goal. base latency as ∼30 ms and the median data latency as ∼20 ms/MiB
The base latency is computed from the 1 KiB experiment in Figure 2,
the average latency of 16 MiB minus the base latency defines the
Cost-throughput optimal requests. In Figure 8, we show the median data latency. Figure 4 shows that the median data latency of
cost of retrieving data from S3 with different request sizes. The Cloud X and Cloud Y is lower (12–15 ms/MiB). For S3, the optimal
achieved throughput with hundreds of simultaneous requests and request concurrency for saturating 100 Gibt/s instances is ∼200–250.
many threads is denoted above the bars. Each request size class Figure 10 evaluates the model with the previous data latency and the
contains randomly generated objects. We distinguish between com- latency representing the 25th percentile (hot). The measurements
pute instance cost (c5n.18xlarge) and storage retrieval cost. Storage are between both models until the bandwidth limit is reached.
cost dominates the total cost for small objects. Computational cost Storage medium. An access latency in the tens of ms and a per-
is the most significant contributor to requests in the ∼10 MiB range. object bandwidth of ∼50 MiB/s strongly suggest that cloud object
This applies to instances at on-demand prices and spot instances, stores are based on HDDs. This implies that reading from S3 with
which come at a huge discount (we calculate with 60%). Because the ∼80 Gbit/s is accessing on the order of 100 HDDs simultaneously.
throughput plateaus in the same range of request sizes, we classify Finding 4: Saturating high-bandwidth networks requires
request sizes of 8 - 16 MiB as cost-throughput optimal for OLAP. hundreds of outstanding requests to the cloud object store.
Finding 3: Sizes of 8 - 16 MiB are cost-throughput optimal.
3 ANYBLOB
2.6 Encryption Unified interface with smaller CPU footprint. Different cloud
CPU consumption of encryption. All experiments so far use an providers have their own download libraries with different APIs and
unsecured connection to S3 (HTTP), but S3 also supports encrypted performance characteristics [7, 40, 44, 61, 65]. To offer a unified in-
connections through HTTPS. We measure the CPU overhead of terface, we present a general-purpose and open-source object down-
different encryption strategies while reaching the same throughput load manager called AnyBlob [36]. In addition to transparently sup-
in Figure 9. HTTPS requires more than 2× CPU resources of HTTP, porting multiple clouds, our io_uring-based download manager
but AES end-to-end encryption only increases CPU usage by ∼30%. requires fewer CPU resources than the cloud-vendor-provided ones.
Encryption-at-rest superior to HTTPS. At AWS, all traffic be- Resource usage is vital as our download threads run in parallel with
tween regions and even all traffic between AZs is automatically the query engine working on the retrieved data. Existing download
encrypted by the network infrastructure. Thus, all traffic leaving libraries start new threads for each parallel request. For example,
an AWS physical location is automatically secured [8]. Within a the S3 download manager of the AWS SDK executes one request
location, no other user is able to intercept traffic between an EC2 per thread using the open-source HTTP library curl. In contrast to
instance and the S3 gateway due to the isolation of VPCs, making spinning up threads for individual requests, AnyBlob uses asynchro-
HTTPS superfluous. However, encryption-at-rest is required to nous requests, which allows us to schedule fewer threads. Because
ensure full data encryption outside the instance (e.g., at S3). hundreds of requests must be outstanding simultaneously in high-
bandwidth networks, a one-to-one thread mapping would result
2.7 Tail Latency & Request Hedging in thread oversubscription. This results in many context switches,
Hedging against slow responses. Missing or slow responses from which negatively impacts performance and CPU utilization.
storage servers are a challenge for users of cloud object stores. In our
latency experiments, we see requests that have a considerable tail 3.1 AnyBlob Design
latency. Some requests get lost without any notice. To mitigate these Multiple requests per thread. AnyBlob uses io_uring to manage
issues, cloud vendors suggest restarting unresponsive requests, multiple connections per thread asynchronously [31]. With this
known as request hedging [10, 34]. For example, the typical 16 MiB model, the system does not have to oversubscribe threads which
request duration is below 600ms for AWS. However, less than 5% of would incur additional scheduling cost. In the following, we discuss
objects are not downloaded after 600ms. Missing responses can also the three major components of AnyBlob. The components and their
be found by checking the first byte latency. Similarly to the duration, relationship are shown in Figure 11.
less than 5% have a first byte latency above 200ms. Hedging these io_uring - low-overhead system call interface. io_uring (avail-
requests does not introduce significant cost overhead. able since Linux kernel 5.1) provides a generic kernel interface for

2773
Uring HTTP Message Task Send-Receive Group
Object Request Messages & Callbacks:
executeTask():
3 4
Init

create send SQE create recv SQE create new socket, connect, Task-based Send-Receive Seduler
and register to uring Concurrent
Requests: pick new request
1 submit(): Sending create send call
sends SQE as executeTask() 4 while requests
non-blocking call create send/recv < concurrency
by the scheduler all bytes sent
1 submit()
Uring
complete(): Receiving create recv call IO
2 receives CQE as continue submission?
non-blocking call
by the scheduler content length = complete()
received bytes 2 returns any IO executeTask() 3
finished task create send/recv
Finished
Task
finished?
Finished
callback()

Figure 11: AnyBlob uses state-machine based message tasks that are asynchronously processed with the help of io_uring.

storage and network tasks. It builds on two lock-free ring buffers, executing thread to work on other requests while the transfer is
the submission and completion queues, that are shared between handled by the network device. The uring is periodically checked
user and kernel space. The user inserts new submission queue en- for available completion queue entries (CQE) ( 2 ). When a CQE
tries (SQE), such as receive (recv) and send operations, into the is available, a system call has been processed. With the retrieved
submission queue. Inserting into the queue does not require any information, we can evaluate the next Message Task step.
syscalls. To notify the kernel of new entries in the submission queue, Task-based send-receive scheduler. With io_uring-based sock-
the io_uring_enter system call processes the entries on the kernel ets and Message Tasks, we develop a task-based send-receive sched-
side in a non-blocking fashion until the request is transmitted to uler. The task scheduler uses one thread that continuously executes
the network or storage device. This device uses interrupt requests 1 – 3 as an event loop. This event loop coordinates the execution
(IRQs) to notify the kernel of finished operations. The request is then of the steps of Message Tasks ( 3 ) and processes completion en-
processed during the interrupt and placed on the completion queue. tries ( 2 ). Furthermore, new object requests are scheduled as new
To check if a request was successful, the user periodically peeks for Message Tasks ( 4 ). To optimize single-threaded throughput, a task
available completion queue entries (CQE). io_uring was found to scheduler works concurrently on multiple Message Tasks. Multiple
be highly efficient for storage applications [35, 41, 52, 55, 68], but Message Tasks’ send and recv system calls can be batched before sub-
is less studied for networking tasks [28]. Didona et al. suggest to mitting the submission queue to reduce system call overhead ( 1 ).
study io_uring for networking in more depth [35]. In multi-threaded environments, it is beneficial to reduce system
State-machine-based messages. AnyBlob uses a state machine for calls as parts of them are protected by kernel locks. When a Message
each request. The message information (address, port, and raw data) Task is finished, it invokes a callback to notify the requester.
combined with a state machine is denoted as a Message Task. Op- Send-receive groups. Although a single task-based send-receive
tionally, a receive buffer can be attached that avoids additional data scheduler has high throughput (multiple Gbit/s), it is not sufficient
copies since the kernel transfers data directly from the network to satisfy network-optimized instances. Thus, multiple schedulers
device to our desired location. Cloud object stores use HTTP mes- need to run simultaneously. For ease of use, a lock-free send-receive
sages to transfer data. We implement the different phases of an task group manages requests for multiple send-receive schedulers.
HTTP request within the state machine. On successfully complet-
ing a phase, we transition to the next phase until the object is fully
fetched. The state machine enables asynchronous and multiplexed 3.2 Authentication & Security
messages with a single thread. Several send and recv system calls Transparent authentication. Although all cloud providers use a
are required during transfer until the object is downloaded. After similar API to access objects, some details of signing requests and
each system call, we suspend the execution of this message until we the authentication are different. AnyBlob implements operations to
are notified about the successful syscall. Afterward, we reevaluate upload and download objects from multiple cloud storage providers.
the state machine until a final state is reached. We implement a custom signing process using the library openssl
Asynchronous system calls. Our asynchronous handling of send to maintain high throughput with as few cores as possible [20].
and recv system calls in the Message Task is facilitated by io_uring. For users of AnyBlob, it is transparent which provider is chosen, as
Instead of directly scheduling the system call and waiting for the the interaction with the library remains unchanged. For AWS, we
result, we insert the operation into the submission queue of the support the automatic short-term key metadata service [11].
uring. Each SQE has a user-defined member that allows passing AnyBlob enables encryption-at-rest. AnyBlob supports the user
information to later identify its origin Message Task. System calls application to use encryption-at-rest by providing easy-to-use, in-
are processed only when the submission queue is submitted to the place, and fast encryption and decryption functions for AES. Fur-
kernel ( 1 ). This submission process is non-blocking, allowing the ther, AnyBlob allows the usage of HTTPS for requests. However,

2774
we discourage this in controlled environments, such as AWS EC2
connected to AWS S3, due to high CPU overhead. HTTPS is useful
for authentication if data is sent outside the controlled environment,
e.g., from your computer to S3. In contrast to the high overhead for
HTTPS, encryption-at-rest can be used with only moderate over-
head. As shown in Section 2.6, this client-side encryption provides
superior encryption against third parties, e.g., cloud providers.

3.3 Domain Name Resolver Strategies


Resolution overhead. In analytical scenarios, many requests are
scheduled to the cloud object storage. Section 2.1 highlights that
we can connect to different server endpoints. Resolving a domain
name for each request adds considerable latency overhead due to
additional round trips. Thus, it is essential to cache endpoint IPs.
Throughput-based resolver. Our default resolver stores statistics
about requests to determine whether an endpoint is performing
well. We cache multiple endpoint IPs and schedule requests to
these cached IPs. If the throughput of an endpoint is worse than
the performance of the other endpoints, we replace this endpoint.
Figure 12: Throughput and CPU usage Pareto curves for Any-
Thereby, we allow the load to balance across different endpoints.
Blob, S3, and S3Crt (AWS, eu-central-1, c5n.18xlarge).
MTU-based resolver. We found that the path maximum trans-
mission unit (MTU) differs for S3 endpoints. In particular, the de-
fault MTU to hosts outside a VPC is typically 1500 bytes. Some S3 4 CLOUD STORAGE INTEGRATION
nodes, however, support Jumbo frames using an MTU of up to 9001
Query engine integration options. To unleash the full perfor-
bytes [9]. Jumbo frames reduce CPU cost significantly because the
mance potential of disaggregated cloud storage, we have to carefully
per-packet kernel CPU overhead is amortized with larger packets.
integrate the analytical query engine with the networking compo-
MTU discovery. The S3 endpoints addressable with a higher path
nents. A naive approach would let each worker thread download
MTU use 8400 bytes as packet size. Our AWS resolver attempts to
its currently-needed data chunk synchronously. This way, each
find hosts that provide good performance and use a higher path
worker thread would schedule at most one request at a time, but
MTU. We ping the IP with a payload (> 1500 bytes) and set the DNF
the threads would be blocked most of the time – waiting for network
(do not fragment) flag to determine if a higher path MTU is available.
I/O. A more common approach in database systems is the usage
of asynchronous I/O. Our cloud storage retrieval approach builds
3.4 Performance Evaluation upon this common I/O strategy. Database systems that use the
Competitors. To demonstrate AnyBlob’s performance and CPU us- AWS S3 SDK [7] also leverage asynchronous retrieval from cloud
age utilization, we experiment with different settings on AWS. We object storage. As discussed in Section 3, the AWS S3 SDK often
compare against two libraries provided by Amazon. They are both results in oversubscription, which has not only a negative impact
part of the official AWS C++ SDK (1.9.140). S3 is the traditional API on performance but also other undesirable effects on database sys-
that uses the library curl internally to retrieve objects. Similar con- tems. For example, a huge download task with hundreds of threads
cepts are applied by the download managers of other vendors’ SDKs. could make the DBMS unresponsive to newly arriving queries since
S3Crt is a newer alternative S3 library released by AWS that uses the DBMS has no control over the retrieval threads. Furthermore,
a custom C network implementation (C++ API). With AnyBlob’s the mix of downloading and processing threads is hard to balance,
design, S3 Select can be implemented, but it would only support especially with this vast number of concurrently active threads.
few types (JSON, CSV, Parquet) and no client-side encryption [15]. Approach. In this section, we show how to integrate efficient
Cost-throughput Pareto-optimal retrieval. Figure 12 shows object store retrieval into high-performance query engines. We
different settings for each tested download manager. Note that we rely on AnyBlob and the empirical results presented in Section 2 to
plot performance and CPU utilization such that the optimal settings saturate the available network bandwidth with low CPU resource
lie in the top-left corner of the Pareto curve. Within one download consumption. A key challenge is how to balance query processing
strategy, we highlight the points on their respective Pareto curve. and downloading. Without enough retrieval threads, the network
AnyBlob, with our throughput-based resolver, always dominates the bandwidth limit can not be reached. On the other hand, if we use too
AWS-provided download managers. We achieve the same maximum few worker threads for computation-intensive queries, we lose the
throughput using only 0.7× the CPU resources of the best competi- in-memory computation performance of our DBMS. We, therefore,
tor. Given a fixed CPU budget, we get up to 1.5× performance. Our propose a scheduling component to balance object store retrieval
specialized AWS resolver achieves the same throughput but reduces and query processing, allowing us to schedule threads effectively
CPU usage by an additional 10%. We validated AnyBlob on recently in terms of query performance and CPU usage. With this scheduler,
deployed Graviton instances (200 Gbit/s) [5] and observed greater we then develop an efficient table scan operator based on a cost-
CPU reduction while retrieving objects with up to 180 Gbit/s. effective columnar storage format.

2775
Table Scan 3 Object Seduler Processing Morsels: 1 2 3 4

{
{
{
{
4A Scheduler decides if worker
Γ T1 works on 63
Process thread processes or prepares T7 works on 64
4B Prepare block for processing T5 works on 82
Object Manager Buffer Manager T4 works on 81

1 2 T2
Requests list Receives the RO RL Preparing
list of the blocks
of currently and the headers if !getColumnInfos():
active blocks of these blocks 5B createColumnInfos()
Schedule for retrieval
if pointer is unfixed Pi Morsel if allColumnsFixed():
Retrieval addAsReady()
Fetch data Worker switch scheduler.decision(): else:
Object Store from cloud Processing getBufferSpace()
reads prepareRetrieval()
Preparing
Downloading T6 T3 T8
Figure 13: DBMS design overview for efficient analytics with Downloading
the flow of information between different components. concurrentRequests:
Retrieval eue 10 12 11 while requests > 0:
if requests != maxRequests:
pickFromeue()
4.1 Database Engine Design Ready Blocks 5 7 9
Finished?
Uring
Tasks and scheduling of worker threads. The overall design of
callback()
our cloud storage-optimized DBMS centers around the table scan
operator. Like most database systems, our system Umbra uses a
pool of worker threads to process queries in parallel. In our design, Figure 14: Table scan example with 8 threads.
worker threads do not only perform (i) regular query processing,
but can also (ii) prepare new object store requests or (iii) serve as
network threads. Our object scheduler, which we present in Sec- creates new requests that allow the retrieval threads to execute
tion 4.3, dynamically determines each worker’s job (i-iii) depending their event loop without interruption. The object manager holds
on network bandwidth saturation and processing progress. metadata of tables, blocks, and their column chunk data. The column
Task adaptivity. To overcome issues with long-running queries chunk data is managed by our variable-sized buffer manager. If the
that block resources, many database systems use tasks to process data is not in memory, we create a new request and schedule it for
queries. These tasks can either be suspended or run only for a small retrieval, shown in 5B . Finally, retrieval threads fetch the data.
amount of time. Both concepts lead to a query engine that is able
to adapt to changing workloads quickly. Regardless of the specific 4.2 Table Scan Operator
task system, our asynchronous retrieval integration only requires Scan design preliminaries. We carefully integrate AnyBlob into
the mechanism to switch tasks of workers during query runtime. our RDBMS Umbra, which compiles SQL to machine-code and
Columnar format. The raw data is organized in a column-major supports efficient memory and buffer management [38, 48, 63].
relation format chunked in immutable blocks of columns. The meta- Umbra uses worker threads to parallelize operators, such as table
data of a block, e.g., column types and offsets, are stored in the block scans, and schedules as many worker threads as there are hardware
header. The database schema information is also stored on cloud threads available on the instance. If there is only one active query,
storage, which requires fetching at start-up. all workers are used to process that specific query. Umbra’s tasks
Table metadata retrieval. In the following, we describe the flow consist of morsels which describe a small chunk of data of the
of information during a table scan operation, illustrated in Figure 13. task [53]. Worker threads are assigned to tasks and process morsels
In steps 1 and 2 , the scan operator first requests table metadata, until the task is finished or the thread is assigned to a different task.
i.e., the list of blocks. Afterward, all relevant block metadata is Morsel picking. After Umbra initializes the table scan, the worker
downloaded as a requirement to start the table scan’s data retrieval. threads start calling the pickMorsel method. This function assigns
Worker thread scheduling. After initializing the table scan, we chunks of the task’s data to worker threads. This is repeated after
dedicate multiple worker threads to this operation. Because par- each morsel completion as long as the thread continues to work
titioning worker threads into retrieval and processing threads is on this table scan task. The only difference in our approach is that
difficult and requires adaptations over the duration of the query, our workers do not only need to process data but also prepare new
we implement an object scheduler to solve this problem. Step 3 blocks or retrieve blocks from storage servers. Our object scheduler,
shows that each scanning thread asks the scheduler which job to which we explain in Section 4.3, decides the job of a worker thread
work on. If enough data is retrieved, the worker thread proceeds to based on past processing and retrieval statistics. Note that similar
process data, as demonstrated in 4A . Otherwise, we dedicate the to our pickMorsel, every task-based system has a method that
thread to preparing blocks for retrieval. Since we only execute jobs determines the next task of a worker thread.
for a short time, this decision can be quickly adapted. Worker jobs. If a thread is assigned to process data, a morsel is
Download preparation. To saturate the network bandwidth, it is picked from the currently active block in pickMorsel. In contrast
important to continuously download with enough retrieval threads to the processing job, the other jobs (preparation and retrieval) do
and many outstanding requests. In Step 4B , the preparation worker not pick a morsel for scanning. Instead, these jobs start routines

2776
Algorithm 1: Scheduler: Adaptivity Computation data processed. The aggregated data allow us to compute the mean
1 retrieveSpeed = statistics[epoch].retrievedBytes / statistics[epoch].elapsed processing throughput per thread. For the network throughput, we
2 processSpeed = (workerThreads - currentRetriever) * aggregate the overall retrieved bytes during our current time epoch.
statistics[epoch].processedBytes / statistics[epoch].processedTime Balancing retrieval threads and requests. Sections 2.8 and 3.4
3 ratio = processSpeed / retrieveSpeed analyze how many concurrent requests are needed to achieve our
4 requiredBandwidth = min(bandwidth, bandwidth * ratio)
5 requiredRetrieverThreads = min(maxRetrievers * ratio, maxRetrievers)
throughput goal and the corresponding number of AnyBlob retriev-
ers. We track the number of threads used for retrieval and limit
it according to the instance bandwidth specification. By counting
the number of outstanding requests (e.g., column chunks), we com-
that are required to prepare or retrieve blocks. Regardless of the pute an upper bound on the outstanding network bandwidth. An
job, all workers return to pickMorsel to get a new job assigned outstanding request is a prepared HTTP request currently down-
after finishing their current work. loaded or awaiting retrieval. Because the number of threads and
Scan example overview. Figure 14 shows the full table scan oper- the outstanding requests limit the network bandwidth, our object
ation with multiple (8) active threads working on different jobs. In scheduler always requires that the outstanding bandwidth is at
the example table scan, 4 threads are dedicated to processing data, least as high as the maximum bandwidth possible according to the
3 for data retrieval, and 1 for preparing new blocks. current number of retrieval threads. Hence, it schedules enough
Processing job. After receiving a morsel for processing, the thread preparation jobs to match the number of retrieval threads.
scans and filters the data according to the semantics of the table scan. Performance adaptivity. The scheduler computes the global ra-
When all morsels of an active block (global or thread-local with tio between processing and retrieval to balance the retrieval and
stealing) are taken, the thread picks the morsel from a new, already processing performance. This ratio is used to adapt the number of
retrieved block. In the example, each block is divided into 4 non- retrieval threads and the outstanding bandwidth. If processing is
overlapping morsels. Each thread works on its unique morsel range. slower, fewer blocks are prepared, and fewer retrieval threads are
Preparation job. With the already retrieved table metadata, scheduled. Some of the running retrieval threads will stop due to
threads prepare new blocks and register unknown blocks in the ob- fewer outstanding requests. These threads are then scheduled as
ject manager. If the data of all columns currently resides in physical processing workers, increasing the global processing performance.
memory, the preparing thread marks the block as ready. Otherwise, Algorithm 1 shows these adaptivity computations.
the preparing thread gets free space from the buffer manager for Overpreparation. Because it is undesirable to stall retrieval
each unfixed column. With the block metadata (column type, offset, threads due to unprepared columns, overpreparation is encouraged.
and size), HTTP messages for fetching columns from cloud storage Our scheduler ensures that up to 2× of the required bandwidth is
are created. After that, the block is queued for retrieval, where the outstanding and schedules preparation jobs accordingly.
data is downloaded. Fast statistics aggregation. Lock-free atomic values for statistics
Retrieval job. In the example, three threads are scheduled to act as and global counters provide fast object scheduler decisions. For
AnyBlob retrieval threads. After finishing the download of a block’s every new scan request, we update the epoch to store representative
column chunk, a callback is invoked and marks this column as statistics of the current workload.
ready. Only if all columns have been retrieved, we mark the block as
ready. Note that different retrieval threads may download column
chunks from the same block concurrently. The worker finishes 4.4 Relation & Storage Format
when AnyBlob’s request queue gets empty. Because threads always Columnar format. To leverage the cost-throughput optimal down-
try to keep the queue at its maximum request length, unnecessary load sizes, we require a column-major format that is chunked
retrieval threads will eventually encounter an empty queue and stop into different blocks. The database format is adapted from data
downloading. These threads can then be reused to work on different blocks [51]. For each column chunk, we store min and max values
jobs, such as processing or preparing new blocks. As long as enough in the metadata, enabling us to prune unnecessary blocks early.
requests are in the queue, the threads constantly retrieve data. Our blocks use low-overhead byte-level encodings, e.g., frame-of-
reference and dictionaries, to reduce storage requirements.
4.3 Object Scheduler Tuple count in blocks. For cost-effective downloading, each col-
Balance of retrieval and processing performance. The main umn chunk of a block should have a desired size of 16 MiB. As query
goal of the object scheduler is to strike a balance between pro- processing usually works on a block granularity, all columns within
cessing and retrieval performance. It assigns different jobs to the one block need to have the same number of tuples. However, this
available worker threads to achieve this balance. If the retrieval results in imperfect column chunk sizes due to different datatype
performance is lower than the scan performance, it increases the sizes and our byte-level encoding scheme. The range per tuple in an
amount of retrieval and preparation threads. On the other hand, re- encoded column is between 1 and 16 bytes, excluding the variable-
ducing the number of retrieval threads results in higher processing sized columns. Because of this wide byte spread, we need to balance
throughput. Note that the retrieval performance is limited by the the sizes of the individual column chunks by optimizing the tuple
network bandwidth, which the object scheduler considers. count. During block construction, we adaptively compute mean
Processing and retrieval estimations. The decision process re- tuple counts such that no encoded column falls below ∼2 MiB to
quires performance statistics during retrieval and processing. Each limit retrieval cost. Some fixed-sized and variable-sized columns
processing thread tracks the execution time and the amount of may exceed 16 MiB, which is undesirable for retrieval. To avoid

2777
large differences in download latency between columns, Umbra schema lineitem orders
splits larger column chunks into multiple smaller range requests. HList H #1 … H #m data_1 data_2 … data_n HList H #1
Zero user-space copies. Our implementation is tightly coupled
with the buffer manager to reduce copies of data. The blocks of data
are aligned to the page sizes of the buffer manager, but we reserve Figure 15: Object structure overview on S3 for TPC-H.
space for the HTTP header and the chunk size of the recv system
call. By using the result data offset, we avoid user-space data copies.
Transparent paging. We extend the buffer manager with anony-
5.1 Data Retrieval Performance
mous pages not backed by files to take advantage of the paging Comparison with in-memory cached data. In order to analyze
and in-memory buffer management features. If a new retrieval on the retrieval capabilities of Umbra, we perform self-tests against a
the same page is necessary, we check if the page is still available. If fully in-memory version of Umbra on the popular TPC-H bench-
not, we download the data again. With this unified and transparent mark. Although storing only the current query data is sufficient,
buffer manager, we avoid retrieval and buffer space trade-offs. we are restricted to scale factor 500 to fit all query data into the
Structure of metadata. Figure 15 shows the object structure in memory of our in-memory version. Table 2 shows the performance
the cloud object storage. Within the database prefix, we store the of the remote-only (no caching of buffer pages) and the in-memory
schema information that contains all the necessary information version of our database, the end-to-end bandwidth, and the cost of
to initialize the database. Each table has its own subprefix, which the remote-only version. As mentioned, our remote-only version
contains a list of headers, headers, and data blocks. Because header ignores buffered pages and retrieves all required data from remote
objects are also cost-throughput optimized, we store fewer header storage. The bandwidth is computed by a sum of the retrieved data
objects than blocks, as each header object contains multiple block divided by the total query runtime, which serves as a lower bound.
headers. The data is organized for append-only storage, which Processing at instance bandwidth. Queries can be separated
mimics most analytical engines. Because objects can be replaced into retrieval-heavy and computation-heavy ones. The bandwidth
atomically in cloud storage, updating the list of headers creates con- is a good indicator for categorizing the queries. For example,
sistent data snapshots. Versioning the metadata is common in cloud Queries 1, 6, and 19 are the strongest representatives of the retrieval-
DBMSs to provide consistent views of the data. Apache Iceberg [17] heavy group. Umbra achieves an end-to-end bandwidth of up to
and Data Lake [18] use an analogous technique. Iceberg’s manifest 78 Gbit/s which is close to the limit. However, the factor between
files are similar to our list of headers and header objects [43]. the in-memory and the remote execution time is large because
Scan optimizations. Our implementation checks a header’s min/- Umbra could process more tuples than the network can provide.
max values to avoid unnecessary downloads. A block is only sched- No overhead for computationally-intensive queries. On the
uled for retrieval if all table scan restrictions match the min/max other hand, we observe only minor differences between the in-
values within the block metadata. Before scanning the encoded data, memory and remote-only versions for computationally intensive
the processing thread has to decode the data. We repeatably fill a queries. For example, Queries 9 and 18 have a factor of ≤ 1.3×.
small chunk with decoded data and process it. Umbra can either Because the DBMS is at its processing limit due to intensive joins
decode the data entirely or only decode tuples that satisfy the re- and aggregations, fetching of blocks is not very noticeable.
strictions. Both approaches leverage vectorized SIMD instructions. Effective scheduling. This shows the effectiveness of our sched-
uling algorithm. If the query is retrieval intensive, we saturate
network bandwidth while continuing to process data. On the other
4.5 Encryption & Compression
hand, if Umbra is limited by computation, our scheduler does not
Size reduction with strong compression. Although the band- waste CPU resources on idle downloading processes.
width to external storage is high, modern engines might still wait Spot instances. In the remote Umbra scenario, spot instances can
on data arrival. With the encoding schemes presented in Section 4.4, be leveraged without any performance cliffs. However, additional
the size of the columns is already reduced. Additional stronger com- safeguards need to be in place due to early instance termination.
pression allows for reducing them further. We use bit-packing for Queries affected by termination might require restarts, and commit
integer-encoded columns and apply LZ4 on the remaining ones. persistence must be guaranteed.
Security due to encryption-at-rest. As already described in Sec-
tion 2.6, encryption-at-rest does not only secure the traffic in transit 5.2 Retrieval Manager Study
but also stores the data inaccessible to third-parties. Before upload-
Different retrieval managers on chokepoint queries. To
ing a column, we use AnyBlob to encrypt the individual columns of
demonstrate the properties of our design and validate our Any-
a block. Although encryption with AES has a slight performance
Blob results, we test different retrieval options within Umbra. We
penalty, most real-world users prefer the gained security benefits.
test our DBMS on EBS (gp3, no page cache) and on cloud object
storage (no object cache). For retrieving data from S3, we imple-
5 EXPERIMENTAL EVALUATION mented three different strategies. First, we use the worker threads to
Setup. We extended our high-performance database system Umbra download their currently required object from remote storage with
to support efficient analytics on disaggregated cloud object stores. the AWS S3 library. The second strategy uses our asynchronous
All experiments are conducted at AWS in region eu-central-1. Unless retrieval integration design, shown in Section 4, and combines it
otherwise noted, we use a single c5n.18xlarge (72 vCPUs / 36 cores, asynchronously with the AWS library. The last configuration lever-
192 GiB main memory, 100 Gbit/s network) instance with Ubuntu. ages our integration and AnyBlob (Sections 3 and 4). To demonstrate

2778
Table 2: In-memory and remote-only Umbra comparison demonstrates small cloud retrieval overhead (SF 500, c5n.18xlarge).

Query GM Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
In-Memory [s] 2.03 1.14 0.38 2.93 2.08 3.35 0.52 2.73 3.38 10.61 4.27 0.25 1.99 9.50 1.35 0.99 1.81 1.36 18.91 0.74 1.45 6.04 1.75
Remote [s] 4.94 3.52 1.97 5.87 4.18 5.77 2.47 6.41 6.86 13.34 7.68 1.14 4.74 12.47 4.15 3.97 2.42 4.63 22.20 3.82 5.06 12.24 2.54
Factor 2.42 3.08 5.16 2.01 2.01 1.72 4.78 2.35 2.03 1.26 1.80 4.58 2.39 1.31 3.07 4.01 1.34 3.41 1.17 5.15 3.50 2.03 1.45
Gbit/s 49.80 75.00 46.00 55.76 55.95 65.20 77.73 64.43 69.40 40.67 52.42 40.73 62.01 30.86 64.63 67.35 14.13 73.65 15.41 76.87 66.34 65.35 23.20
Cost S3 [¢] 0.15 0.29 0.04 0.21 0.15 0.20 0.17 0.23 0.24 0.31 0.27 0.02 0.23 0.28 0.17 0.17 0.02 0.21 0.22 0.25 0.21 0.43 0.03
Cost EC2 [¢] 0.53 0.38 0.21 0.63 0.45 0.62 0.27 0.69 0.74 1.44 0.83 0.12 0.51 1.34 0.45 0.43 0.26 0.50 2.39 0.41 0.55 1.32 0.27

GM Q1 Q9 Q19 GM Avg Q1 Q9 Q19


8 30 15 60 20
8 1.5 40 Idle

Latency [s]
m5zn.12xlarge

m5zn.12xlarge
6 15
4 6 30 20 10 40 Others

Average vCPU cores used


1.0 10 Memory
4 4 20 10 5 20 5 Umbra
2 0.5
Queries / Min

2 2 10 Network
0 0 0 0
0 0 0.0 0 0
S3 ync Blob S3 ync Blob S3 ync Blob S3 ync Blob
6 2.0 8 60 AS ny AS ny AS ny AS ny
S3 A S3 A S3 A S3 A
7.5
c5n.18xlarge

c5n.18xlarge
1.5 6
4 40
5.0
1.0 4 Figure 17: CPU usage traces for different networking imple-
2 2.5 20
0.5 2 mentations collected with Linux perf (SF 1000, c5n.18xlarge).
0 0.0 0.0 0 0

reduce the amount of parallelism within our DBMS (number of


worker threads). We contrast to the aforementioned in-memory
Figure 16: Internal comparison of Umbra on EBS, and on S3,
version of our system. For retrieval-heavy queries (e.g., Query 1),
+ ASync (Sec. 4), + AnyBlob (Sec. 3) (SF 1000, 2 instance types).
we can see a plateau if enough cores are available to utilize the
network completely. For the in-memory version, we measure a
our cloud storage performance, all remaining experiments force linear increase in performance until the hyper-threading boundary
Umbra to ignore columns already available in the buffer manager. is reached. The performance of the computation-heavy queries
Umbra always fetches these columns from remote storage. (Query 9) increases as we add more cores. The remote-only Umbra
Higher throughput while reducing CPU usage. In Figure 16, version has almost the same throughput as the in-memory version.
we test all TPC-H queries on two different machine types, both sup- Instance scaling. To demonstrate our scalability on different in-
porting 100 Gbit/s networking. EBS has the worst throughput due stances, we use smaller versions of the c5n.18xlarge. The c5n.9xlarge
to the bandwidth limit of 1 GB/s. Asynchronous retrieval of more has a maximum bandwidth of 50 Gbit/s and 36 vCPUs; the
requests than cores is crucial for performance. By simply swapping c5n.4xlarge has 16 vCPUs and 25 Gbit/s bandwidth. The additional
the retrieval library from the asynchronous AWS SDK to AnyBlob, resources of larger instances improve the query runtime. Because
Umbra achieves up to a factor of 1.2× better geometric mean per- our approach retains performance without warm caches, we can
formance and an improvement of up to 40% on computationally switch to larger instances as the workload increases.
expensive queries. Additionally, AnyBlob reduces the mean CPU
usage by up to 25%. Recent trends indicate that the networking 5.4 End-To-End Study with Compression & AES
bandwidth increases faster than the number of CPU cores, making Workload & competitors. In this experiment, we compare the
the resource usage of networking essential [5]. end-to-end performance on the TPC-H benchmark. To mimic a
Retrieval requires significant CPU resources. Figure 17 breaks realistic OLAP scenario analyzing large amounts of data, we test
the query resource CPU utilization down into fine-grained tasks, scale factors (SF) of 100 (∼100 GiB) and 1,000 (∼1 TiB of data). Since
such as network I/O, memory and buffer management, and process- we optimize the retrieval properties, Umbra does not cache any
ing (similar to [66]). We used perf to trace the resource utilization data to showcase our retrieval integration. We compare against
of different functions and aggregate the results. Umbra achieves an Spark on a single c5n.18xlarge instance and a large warehouse of
average CPU utilization of ∼75% with asynchronous networking. Snowflake. In 2019, Snowflake used c5d.2xlarge instances for xsmall
Networking uses a large share of CPU time that accounts for up to warehouses, which was reported by a Snowflake error log [70].
25% of the total utilization, significantly reduced by AnyBlob. Assuming this instance type for xsmall, a large warehouse would
use an instance or cluster similar to our instance but with local SSDs
5.3 Scaling Properties (e.g., c5d.18xlarge or 8 × c5d.2xlarge). For Snowflake, we measure
Thread scaling on chokepoint queries. Since our approach is the throughput with warm cache (multiple TPC-H runs) and on
highly elastic, it is very interesting to see how Umbra scales on a another large configuration that is shut down after each query
varying number of cores and different instances. Figure 18 shows execution to enforce remote retrieval.
two chokepoint queries, which we already identified in Section 5.1. Fast processing from cloud storage. Figure 20 shows the per-
The results are measured on the same instance, but we artificially formance results of different systems. As discussed in Section 4.5,

2779
disk operations [72]. Our experimental study on cloud storage pro-
Queries / Min

50
40 Q1 vides an in-depth analysis that provides all details for fast analytics
30
20 on cloud storage. Leis and Kuschewski present a model for cost-
10
0
optimal instance selection [54]. Although systems such as Hive and
0 16 32 48 64 Spark can be self-hosted [78, 86], managed Hadoop is common [71].
Spot instances. Because spot instances come with huge discounts,
Queries / Min

5
4 Q9 mitigating the termination risk and hopping between instances was
3
2
1
researched [74, 76]. Our approach is a perfect fit for spot hopping
0 since caching is not needed for good performance. Although our ex-
0 16 32 48 64
Threads periments run faster than the termination delay of AWS (2 min) [12],
a migration to another instance can retain query state [84].
Figure 19: Scalability on dif- Serverless computing. Serverless functions are another short-
Figure 18: Scalability of
ferent instances. term service, which allow users to deploy resources only for the
queries (c5n.18xlarge).
duration of a request. Since a serverless function has little memory,
compute resources, and a time limit [42], many parallel function
SF 100 SF 1000 invocations are required to execute a single query. Starling [69] and
>=100 >=25
Lambada [62] propose to run analytics on serverless functions. Al-
though Lambada and Starling provide a small study on S3 in server-
Queries / Min

20
75
15 less environments, the characteristics are very different as these
50
10 functions only have limited threads and networking (300 MiB/s),
25
5 which does not require a careful retrieval design such as AnyBlob.
0 0
Cloud storage for DBMS. Cloud object stores attract attention
. ) )
bra AES mp AES hed ote ark
. ) )
bra AES mp AES hed ote ark as data warehouses due to their low costs. Two prominent storage
Um ra + +Co p. + (Cac (Rem Sp Um ra + +Co p. + (Cac (Rem Sp
mb ra om e
b
U m +C fla fla k k e m b ra om e
b
U m +C fla fla k k e solutions are Apache Iceberg and Data Lake [17, 18]. Both systems
U ra ow ow U ra ow ow
b n Sn
Um S
b n Sn
Um S
use metadata stored on the cloud object stores to provide consistent
snapshots. As our storage structure is similar, our fast processing on
Figure 20: End-to-end system comparison on SF 100 and 1000. remote data can be adapted to these storage backends. Ephemeral
storage systems, such as Pocket [49], and caching for cloud stor-
age [37, 46, 85, 89] sparked a wide variety of research. Caching
Umbra is able to encrypt data automatically and implements strong solutions extend from using semantic caching on a local node [37]
compression. Compression improves performance, but encryption to leveraging spot instances as caching and offloading layer [89].
has a slight overhead. In a real-world scenario, we recommend using Memory disaggregation. Similar to disaggregating storage, future
both settings for higher security without performance degradation. data centers may separate CPU from memory to improve resource
Although Umbra always retrieves the data from cloud storage, the flexibility. Most research finds that disaggregated memory is or-
performance is similar to Snowflake, which uses data caching (e.g., thogonal to the current storage-separated design [50, 83, 90, 91].
local SSDs). As mentioned earlier, the actual hardware configura- Networking and kernel APIs. Following recent trends, future
tion of Snowflake is unknown. For example, the runtime of Query 6 data centers will be equipped with fast Ethernet connections
suggests that the instance has higher disk bandwidth than both reaching Tbit/s [28]. OS and kernel research presents approaches
mentioned instance settings. Clearly, these end-to-end results are to integrate these high-bandwidth network devices with low la-
influenced by the database, its execution model, and the hardware. tency [28, 88]. RDMA is already explored in DBMS for fast networks
with low latency [24, 47, 57, 92]. A kernel storage API study found
6 RELATED WORK io_uring, used in AnyBlob, to be promising [35]. Especially for fast
Cloud DBMS. With the dominance of the cloud for scalable solu- NVMe SSDs, it is already used widespread [35, 41, 52, 55, 68].
tions, many software-as-a-service database management systems
emerged. Often specialized systems for either OLTP [16, 29, 30, 81]
or OLAP [2, 25, 33, 58, 59, 67, 79, 87] were developed to cope with the 7 CONCLUSION
new challenges in the cloud era [56]. Redshift [19] leverages Aqua, a This paper discusses the efficient and cost-effective usage of cloud
computational caching layer, unaffected by resizing nodes [6]. Until object storage for analytics. Our first contribution is a detailed
recently, caching was unavoidable even for analytics dominated analysis on the characteristics of cloud object stores. With these
by the bandwidth. However, the gap between network and NVMe insights, we developed AnyBlob, a modern object storage down-
bandwidth is closing, making cloud storage more attractive. AWS load manager based on io_uring. AnyBlob requires fewer CPU
Athena, based on Presto [73], works directly on remote data. An resources to achieve the same or higher throughput compared to
experimental study contrasts the architecture of these systems [77]. libraries provided by cloud vendors. Finally, we demonstrated a
Processing in the cloud. Brantner et al. [27] discuss challenges blueprint to utilize efficient analytics on disaggregated object stores
and opportunities of S3 for OLTP workloads. In 2010, an experi- in DBMSs. Our results show that even with disabled caching, Umbra
mental study provided insights into the computation power of EC2 with AnyBlob achieves performance similar to large configurations
instances; in particular, it studies the CPU resources, memory, and of state-of-the-art cloud database systems that cache data locally.

2780
REFERENCES Photon: A Fast Query Engine for Lakehouse Systems. In SIGMOD Conference.
[1] Merv Adrian. 2022. DBMS Market Transformation 2021: The Big ACM, 2326–2339.
Picture. https://blogs.gartner.com/merv-adrian/2022/04/16/dbms-market- [26] Brendan Bouffler and Chris Liu. 2019. Deep-Dive Into 100G network-
transformation-2021-the-big-picture/. accessed: 2022-09-30. ing & Elastic Fabric Adapter on Amazon EC2. AWS re:Invent, https:
[2] Josep Aguilar-Saborit and Raghu Ramakrishnan. 2020. POLARIS: The Distributed //d1.awsstatic.com/events/reinvent/2019/REPEAT_2_Deep-dive_into_100G_
SQL Engine in Azure Synapse. Proc. VLDB Endow. 13, 12 (2020), 3204–3216. networking_&_Elastic_Fabric_Adapter_on_Amazon_EC2_CMP334-R2.pdf.
[3] Amazon. 2021. What’s the maximum transfer speed between Amazon EC2 and accessed: 2022-09-10.
Amazon S3? https://aws.amazon.com/premiumsupport/knowledge-center/s3- [27] Matthias Brantner, Daniela Florescu, David A. Graf, Donald Kossmann, and Tim
maximum-transfer-speed-ec2/. accessed: 2022-09-15. Kraska. 2008. Building a database on S3. In SIGMOD Conference. ACM, 251–264.
[4] Amazon. 2022. Amazon S3 Storage Classes. https://aws.amazon.com/s3/storage- [28] Qizhe Cai, Midhul Vuppalapati, Jaehyun Hwang, Christos Kozyrakis, and Rachit
classes/. accessed: 2022-10-05. Agarwal. 2022. Towards 𝜇 s tail latency and terabit ethernet: disaggregating the
[5] Amazon. 2022. Announcing Amazon EC2 C7gn instances (Preview). host network stack. In SIGCOMM. ACM, 767–779.
https://aws.amazon.com/about-aws/whats-new/2022/11/announcing-amazon- [29] Wei Cao, Feifei Li, Gui Huang, Jianghang Lou, Jianwei Zhao, Dengcheng He,
ec2-c7gn-instances-preview/. accessed: 2023-06-17. Mengshi Sun, Yingqiang Zhang, Sheng Wang, Xueqiang Wu, Han Liao, Zilin
[6] Amazon. 2022. AQUA (Advanced Query Accelerator) for Amazon Redshift. Chen, Xiaojian Fang, Mo Chen, Chenghui Liang, Yanxin Luo, Huanming Wang,
https://aws.amazon.com/redshift/features/aqua/. accessed: 2022-10-12. Songlei Wang, Zhanfeng Ma, Xinjun Yang, Xiang Peng, Yubin Ruan, Yuhui Wang,
[7] Amazon. 2022. AWS SDK for C++. https://github.com/aws/aws-sdk-cpp. ac- Jie Zhou, Jianying Wang, Qingda Hu, and Junbin Kang. 2022. PolarDB-X: An
cessed: 2022-10-05. Elastic Distributed Relational Database for Cloud-Native Applications. In ICDE.
[8] Amazon. 2022. Encryption in transit. https://docs.aws.amazon.com/AWSEC2/ IEEE, 2859–2872.
latest/UserGuide/data-protection.html#encryption-transit. accessed: 2022-09-30. [30] Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu,
[9] Amazon. 2022. Network maximum transmission unit (MTU) for your EC2 Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, Bo Wang, Yuhui Wang,
instance. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_ Haiqing Sun, Ze Yang, Zhushi Cheng, Sen Chen, Jian Wu, Wei Hu, Jianwei Zhao,
mtu.html. accessed: 2022-10-11. Yusong Gao, Songlu Cai, Yunyang Zhang, and Jiawang Tong. 2021. PolarDB
[10] Amazon. 2022. Performance Guidelines for Amazon S3. https://docs.aws.amazon. Serverless: A Cloud Native Database for Disaggregated Data Centers. In SIGMOD
com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html. ac- Conference. ACM, 2477–2489.
cessed: 2022-10-11. [31] Jonathan Corbet. 2020. The rapid growth of io_uring. https://lwn.net/Articles/
[11] Amazon. 2022. Retrieve security credentials from instance metadata. 810414/. accessed: 2022-09-20.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon- [32] Craig Cotton, Henry Zhang, and Jamal Mazhar. 2019. New C5n Instances with
ec2.html#instance-metadata-security-credentials. accessed: 2022-10-15. 100 Gbps Networking. AWS re:Invent, https://www.youtube.com/watch?v=
[12] Amazon. 2022. Spot Instance interruption notices. https://docs.aws.amazon.com/ FJJxcwSfWYg. accessed: 2022-09-10.
AWSEC2/latest/UserGuide/spot-instance-termination-notices.html. accessed: [33] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin
2022-10-08. Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel,
[13] Amazon. 2023. Amazon S3 pricing. https://aws.amazon.com/s3/pricing. accessed: Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley,
2023-06-17. Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016.
[14] Amazon. 2023. Compute optimizes instances: Network performance. The Snowflake Elastic Data Warehouse. In SIGMOD Conference. ACM, 215–226.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-optimized- [34] Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56,
instances.html. accessed: 2023-05-02. 2 (2013), 74–80.
[15] Amazon. 2023. Filtering and retrieving data using Amazon S3 Se- [35] Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler, and Animesh
lect. https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting- Trivedi. 2022. Understanding modern storage APIs: a systematic study of libaio,
content-from-objects.html. accessed: 2023-05-02. SPDK, and io_uring. In SYSTOR. ACM, 120–127.
[16] Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernan- [36] Dominik Durner. 2022. AnyBlob. https://github.com/durner/AnyBlob/.
dez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, [37] Dominik Durner, Badrish Chandramouli, and Yinan Li. 2021. Crystal: A Unified
Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chai- Cache Storage System for Analytical Databases. Proc. VLDB Endow. 14, 11 (2021),
tanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and 2432–2444.
Vikram Wakade. 2019. Socrates: The New SQL Server in the Cloud. In SIGMOD [38] Dominik Durner, Viktor Leis, and Thomas Neumann. 2019. On the Impact of
Conference. ACM, 1743–1756. Memory Allocation on High-Performance Query Processing. In DaMoN. ACM,
[17] Apache. 2022. Apache Iceberg. https://iceberg.apache.org/. accessed: 2022-09-10. 21:1–21:3.
[18] Michael Armbrust, Tathagata Das, Sameer Paranjpye, Reynold Xin, Shixiong [39] Google. 2023. Cloud Storage pricing. https://cloud.google.com/storage/pricing.
Zhu, Ali Ghodsi, Burak Yavuz, Mukul Murthy, Joseph Torres, Liwen Sun, Peter A. accessed: 2023-06-17.
Boncz, Mostafa Mokhtar, Herman Van Hovell, Adrian Ionescu, Alicja Luszczak, [40] Google. 2023. Google Cloud Platform C++ Client Libraries. https://github.com/
Michal Switakowski, Takuya Ueshin, Xiao Li, Michal Szafranski, Pieter Senster, googleapis/google-cloud-cpp. accessed: 2023-06-17.
and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage [41] Gabriel Haas, Michael Haubenschild, and Viktor Leis. 2020. Exploiting Directly-
over Cloud Object Stores. Proc. VLDB Endow. 13, 12 (2020), 3411–3424. Attached NVMe Arrays in DBMS. In CIDR. www.cidrdb.org.
[19] Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh [42] Joseph M. Hellerstein, Jose M. Faleiro, Joseph Gonzalez, Johann Schleier-Smith,
Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. 2019. Serverless Com-
Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, puting: One Step Forward, Two Steps Back. In CIDR. www.cidrdb.org.
Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Poly- [43] Jason Hughes. 2021. Apache Iceberg: An Architectural Look Under the Cov-
chroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Sub- ers. https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-
ramanian, and Doug Terry. 2022. Amazon Redshift Re-invented. In SIGMOD look-under-the-covers/. accessed: 2022-09-10.
Conference. ACM, 2205–2217. [44] IBM. 2023. About IBM COS SDKs. https://cloud.ibm.com/docs/cloud-object-
[20] OpenSSL Project Authors. 2022. OpenSSL - Cryptography and SSL/TLS Toolkit. storage?topic=cloud-object-storage-sdk-about. accessed: 2023-06-17.
https://www.openssl.org/. accessed: 2022-10-15. [45] IBM. 2023. Cloud Object Storage. https://cloud.ibm.com/objectstorage/create#
[21] Jens Axboe. 2019. Efficient IO with io_uring. https://kernel.dk/io_uring.pdf. pricing. accessed: 2023-06-17.
accessed: 2022-10-12. [46] Virajith Jalaparti, Chris Douglas, Mainak Ghosh, Ashvin Agrawal, Avrilia
[22] Jeff Barr. 2019. New C5n Instances with 100 Gbps Networking. https://aws. Floratou, Srikanth Kandula, Ishai Menache, Joseph (Seffi) Naor, and Sriram Rao.
amazon.com/blogs/aws/new-c5n-instances-with-100-gbps-networking/. ac- 2018. Netco: Cache and I/O Management for Analytics over Disaggregated Stores.
cessed: 2022-09-10. In SoCC. ACM, 186–198.
[23] Jeff Barr. 2020. Amazon S3 Update Strong Read-After-Write Consis- [47] Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable
tency. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after- and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs.
write-consistency/. accessed: 2022-10-05. In OSDI. USENIX Association, 185–201.
[24] Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. [48] Timo Kersten, Viktor Leis, and Thomas Neumann. 2021. Tidy Tuples and Flying
Rack-Scale In-Memory Join Processing using RDMA. In SIGMOD Conference. Start: fast compilation and fast execution of relational queries in Umbra. VLDB
ACM, 1463–1475. J. 30, 5 (2021), 883–905.
[25] Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David [49] Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle,
Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, and Christos Kozyrakis. 2018. Pocket: Elastic Ephemeral Storage for Serverless
Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Analytics. In OSDI. USENIX Association, 427–444.
Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bus- [50] Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin Taranov,
sel, Herman Van Hovell, Maryann Xue, Reynold Xin, and Matei Zaharia. 2022. Dejan S. Milojicic, and Gustavo Alonso. 2022. Farview: Disaggregated Memory

2781
with Operator Off-loading for Database Engines. In CIDR. www.cidrdb.org. [75] Matt Sidley and Sally Guo. 2021. Deep dive on Amazon S3. AWS
[51] Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neu- re:Invent, https://www.slideshare.net/AmazonWebServices/stg301deep-dive-
mann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on on-amazon-s3-and-glacier-architecture, https://www.youtube.com/watch?v=9_
Compressed Storage using both Vectorization and Compilation. In SIGMOD vScxbIQLY. accessed: 2022-09-10.
Conference. ACM, 311–326. [76] Supreeth Subramanya, Tian Guo, Prateek Sharma, David E. Irwin, and Prashant J.
[52] Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian Diet- Shenoy. 2015. SpotOn: a batch computing service for the spot market. In SoCC.
rich. 2023. Virtual-Memory Assisted Buffer Management. In SIGMOD Conference. ACM, 329–341.
ACM. [77] Junjay Tan, Thanaa M. Ghanem, Matthew Perron, Xiangyao Yu, Michael Stone-
[53] Viktor Leis, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel- braker, David J. DeWitt, Marco Serafini, Ashraf Aboulnaga, and Tim Kraska.
driven parallelism: a NUMA-aware query evaluation framework for the many- 2019. Choosing A Cloud DBMS: Architectures and Tradeoffs. Proc. VLDB Endow.
core age. In SIGMOD Conference. ACM, 743–754. 12, 12 (2019), 2170–2182.
[54] Viktor Leis and Maximilian Kuschewski. 2021. Towards Cost-Optimal Query [78] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Processing in the Cloud. Proc. VLDB Endow. 14, 9 (2021), 1606–1612. Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a
[55] Alberto Lerner and Philippe Bonnet. 2021. Not your Grandpa’s SSD: The Era of petabyte scale data warehouse using Hadoop. In ICDE. IEEE Computer Society,
Co-Designed Storage Devices. In SIGMOD Conference. ACM, 2852–2858. 996–1005.
[56] Feifei Li. 2019. Cloud native database systems at Alibaba: Opportunities and [79] Ben Vandiver, Shreya Prasad, Pratibha Rana, Eden Zik, Amin Saeidi, Pratyush
Challenges. Proc. VLDB Endow. 12, 12 (2019), 2263–2272. Parimal, Styliani Pantela, and Jaimin Dave. 2018. Eon Mode: Bringing the Vertica
[57] Feilong Liu, Lingyan Yin, and Spyros Blanas. 2017. Design and Evaluation of an Columnar Database to the Cloud. In SIGMOD Conference. ACM, 797–809.
RDMA-aware Data Shuffling Operator for Parallel Database Systems. In EuroSys. [80] Daniel Vassallo. 2023. Measure Amazon S3’s performance from any location.
ACM, 48–63. https://github.com/dvassallo/s3-benchmark. accessed: 2023-05-02.
[58] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- [81] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam,
akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz
Web-Scale Datasets. Proc. VLDB Endow. 3, 1 (2010), 330–339. Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations
[59] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- for High Throughput Cloud-Native Relational Databases. In SIGMOD Conference.
akumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Min, ACM, 1041–1052.
Mosha Pasumansky, and Jeff Shute. 2020. Dremel: A Decade of Interactive SQL [82] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala,
Analysis at Web Scale. Proc. VLDB Endow. 13, 12 (2020), 3461–3472. and Thierry Cruanes. 2020. Building An Elastic Query Engine on Disaggregated
[60] Microsoft. 2023. Azure Blob Storage pricing. https://azure.microsoft.com/en- Storage. In NSDI. USENIX Association, 449–462.
us/pricing/details/storage/blobs/. accessed: 2023-06-17. [83] Ruihong Wang, Jianguo Wang, Stratos Idreos, M. Tamer Özsu, and Walid G. Aref.
[61] Microsoft. 2023. Azure SDK for C++. https://github.com/Azure/azure-sdk-for- 2022. The Case for Distributed Shared-Memory Databases with RDMA-Enabled
cpp/. accessed: 2023-06-17. Memory Disaggregation. Proc. VLDB Endow. 16, 1 (2022), 15–22.
[62] Ingo Müller, Renato Marroquín, and Gustavo Alonso. 2020. Lambada: Interactive [84] Christian Winter, Jana Giceva, Thomas Neumann, and Alfons Kemper. 2022.
Data Analytics on Cold Data Using Serverless Cloud Infrastructure. In SIGMOD On-Demand State Separation for Cloud Data Warehousing. Proc. VLDB Endow.
Conference. ACM, 115–130. 15, 11 (2022), 2966–2979.
[63] Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System [85] Yifei Yang, Matt Youill, Matthew E. Woicik, Yizhou Liu, Xiangyao Yu, Marco
with In-Memory Performance. In CIDR. www.cidrdb.org. Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2021. FlexPushdownDB:
[64] Oracle. 2023. Cloud Storage Pricing. https://www.oracle.com/cloud/storage/ Hybrid Pushdown and Caching in a Cloud DBMS. Proc. VLDB Endow. 14, 11
pricing/. accessed: 2023-06-17. (2021), 2101–2113.
[65] Oracle. 2023. Software Development Kits. https://docs.oracle.com/en-us/iaas/ [86] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust,
Content/API/Concepts/sdks.htm. accessed: 2023-06-17. Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J.
[66] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016.
Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In Apache Spark: a unified engine for big data processing. Commun. ACM 59,
NSDI. USENIX Association, 293–307. 11 (2016), 56–65.
[67] Ippokratis Pandis. 2021. The evolution of Amazon Redshift. Proc. VLDB Endow. [87] Chaoqun Zhan, Maomeng Su, Chuangxian Wei, Xiaoqiang Peng, Liang Lin,
14, 12 (2021), 3162–3163. Sheng Wang, Zhe Chen, Feifei Li, Yue Pan, Fang Zheng, and Chengliang Chai.
[68] Jong-Hyeok Park, Soyee Choi, Gihwan Oh, and Sang Won Lee. 2021. SaS: SSD 2019. AnalyticDB: Real-time OLAP Database System at Alibaba Cloud. Proc.
as SQL Database System. Proc. VLDB Endow. 14, 9 (2021), 1481–1488. VLDB Endow. 12, 12 (2019), 2059–2070.
[69] Matthew Perron, Raul Castro Fernandez, David J. DeWitt, and Samuel Mad- [88] Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Jacob Nelson,
den. 2020. Starling: A Scalable Query Engine on Cloud Functions. In SIGMOD Omar S. Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay
Conference. ACM, 131–141. Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, and Anirudh
[70] Simeon Pilgrim. 2019. What are the specifications of a Snowflake server? Badam. 2021. The Demikernel Datapath OS Architecture for Microsecond-scale
https://stackoverflow.com/questions/58973007/what-are-the-specifications-of-a- Datacenter Systems. In SOSP. ACM, 195–211.
snowflake-server/58982398. accessed: 2023-05-02. [89] Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, Badrish Chandramouli,
[71] Nicolás Poggi, Josep Lluis Berral, Thomas Fenech, David Carrera, José A. Blakeley, Vincent Liu, and Boon Thau Loo. 2022. CompuCache: Remote Computable
Umar Farooq Minhas, and Nikola Vujic. 2016. The state of SQL-on-Hadoop in Caching using Spot VMs. In CIDR. www.cidrdb.org.
the cloud. In IEEE BigData. IEEE Computer Society, 1432–1443. [90] Qizhen Zhang, Yifan Cai, Sebastian Angel, Vincent Liu, Ang Chen, and
[72] Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime Mea- Boon Thau Loo. 2020. Rethinking Data Management Systems for Disaggre-
surements in the Cloud: Observing, Analyzing, and Reducing Variance. Proc. gated Data Centers. In CIDR. www.cidrdb.org.
VLDB Endow. 3, 1 (2010), 460–471. [91] Yingqiang Zhang, Chaoyi Ruan, Cheng Li, Jimmy Yang, Wei Cao, Feifei Li, Bo
[73] Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Wang, Jing Fang, Yuhui Wang, Jingze Huo, and Chao Bi. 2021. Towards Cost-
Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Effective and Elastic Cloud Database Deployment via Memory Disaggregation.
Christopher Berner. 2019. Presto: SQL on Everything. In ICDE. IEEE, 1802–1813. Proc. VLDB Endow. 14, 10 (2021), 1900–1912.
[74] Supreeth Shastri and David E. Irwin. 2017. HotSpot: automated server hopping [92] Tobias Ziegler, Carsten Binnig, and Viktor Leis. 2022. ScaleStore: A Fast and
in cloud spot markets. In SoCC. ACM, 493–505. Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. In SIGMOD
Conference. ACM, 685–699.

2782

You might also like