Check Out The Big Brain On BRAD Simplifying Cloud Data Processing With Learned Automated Data Meshes
Check Out The Big Brain On BRAD Simplifying Cloud Data Processing With Learned Automated Data Meshes
ABSTRACT 1 INTRODUCTION
The last decade of database research has led to the prevalence of The last decade has seen an explosion of specialized database en-
specialized systems for different workloads. Consequently, organi- gines for both transactional and analytical workloads following
zations often rely on a combination of specialized systems, orga- the “one size does not fit all” mantra [71]. Today, Amazon Web
nized in a Data Mesh. Data meshes present significant challenges Services (AWS) alone lists nearly 30 different services under its “An-
for system administrators, including picking the right system for alytics” (e.g., Redshift, EMR, Athena) and “Database” (e.g., Aurora,
each workload, moving data between systems, maintaining con- DynamoDB, DocumentDB) categories. This is because no single sys-
sistency, and correctly configuring each system. Many non-expert tem can provide adequate performance for all of an organization’s
end users (e.g., data analysts or app developers) either cannot solve data needs. For example, an S&P 500 corporation we are familiar
their business problems, or suffer from sub-optimal performance with uses, among other services, a dozen Amazon Aurora databases
or cost due to this complexity. We envision BRAD, a cloud system for their website and ERP systems, MemoryDB for caching, S3 man-
that automatically integrates and manages data and systems into aged by AWS Lake Formation and queried by AWS EMR for their
an instance-optimized data mesh, allowing users to efficiently store logs, over ten different Redshift clusters for dashboards and data
and query data under a unified data model (i.e., relational tables) science, and DocumentDB for content serving. Such Data Mesh
without knowledge of underlying system details. With machine architectures [20] are now common in organizations of all sizes.
learning, BRAD automatically deduces the strengths and weak- Building and maintaining such a data mesh is challenging. Ex-
nesses of each engine through a combination of offline training and perts must pick the right combination of engines based on deep
online probing. Then, BRAD uses these insights to route queries understanding of the strengths and weaknesses of each engine,
to the most suitable (combination of) system(s) for efficient execu- devise custom solutions to move data between engines, track data
tion. Furthermore, BRAD automates configuration tuning, resource locations and formats, and actively evolve the mesh over time.
scaling, and data migration across component systems, and makes This leads to highly complex systems that require large teams of
recommendations for more impactful decisions, such as adding or skilled engineers to operate. Meanwhile, data mesh users (e.g., data
removing systems. As such, BRAD exemplifies a new class of sys- scientists or app developers) often lack the expert knowledge to
tems that utilize machine learning and the cloud to make complex quickly identify which exact service(s) to use for their purposes,
data processing more accessible to end users, raising numerous new which leads to poor user experience and sub-optimal use of the data
problems in database systems, machine learning, and the cloud. mesh. Furthermore, modern data infrastructure is often deployed
on the public cloud [27] with fine-grained auto-scaling capabili-
ties [37]. Cloud data mesh users must additionally optimize for
PVLDB Reference Format: cost-efficiency, besides performance. The ensuing complexity is
Tim Kraska, Tianyu Li, Samuel Madden, Markos Markakis, Amadou Ngom,
quickly growing beyond human capabilities. Previous efforts have
Ziniu Wu, and Geoffrey X. Yu. Check Out the Big Brain on BRAD:
focused on automating individual systems (e.g., auto-scaling data
Simplifying Cloud Data Processing with Learned Automated Data Meshes.
PVLDB, 16(11): 3293 - 3301, 2023. warehouses) [9, 42, 56, 62] or optimizing for a single metric (e.g.,
doi:10.14778/3611479.3611526 knob tuning for performance) [39, 40, 76, 78, 87], whereas we call for
a more holistic approach that navigates the complex trade-offs that
arise when when choosing which systems to use and which data to
∗ Allauthors contributed equally to this paper. place on them to minimize costs and / or maximize performance.
This work is licensed under the Creative Commons BY-NC-ND 4.0 International In this paper, we argue that the way forward is to build highly
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
autonomous, learning-powered Self-Organizing Data Meshes, and
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights present our vision for the first such system called BRAD. BRAD
licensed to the VLDB Endowment. uses automation techniques, instead of human experts, to assemble,
Proceedings of the VLDB Endowment, Vol. 16, No. 11 ISSN 2150-8097.
doi:10.14778/3611479.3611526
3293
Table 1: Runtime of two queries on Aurora and Redshift.
optimize, and evolve data meshes in the cloud. Users largely inter-
act with BRAD through a unified interface under the illusion of a
single system with one copy of the data and one (SQL-based) API. Figure 1: The cost of two setups across workload scales. We
Underneath the hood, BRAD uses ML models to extract insights label the changes made to maintain latency targets.
about the strengths and weaknesses of available engines, discover
workload patterns, smartly create and evolve the data mesh infras-
tructure and optimally distribute the workload among the available The Best Execution Plan may be Federated. The best execu-
engines. If needed, users can bypass the one-size-fits-all interface tion plan for a given query on a data mesh may need to combine the
and directly intervene in some underlying systems while leaving strengths of different engines. To illustrate, we manually split query
others to BRAD. With BRAD, developers can enjoy increased pro- 19e from Table 1 into two sub-queries: sub-query 1 can be optimally
ductivity from a strong abstraction and simple interface, which executed using index scans and joins, which only Aurora supports,
hides away the management of various specialized systems; organi- so we route it to Aurora. We export the results as a CSV file, which
zations can enjoy performance improvements and cost savings as we then import into Redshift. Sub-query 2 lacks filter predicates,
our models uncover insights and adapt the data pipeline at a speed so it is more efficient to execute on Redshift, a column store. Per
and frequency infeasible for human experts; and database internals Table 1 this joint execution plan is indeed faster than either Aurora
developers can enjoy greater impact, as BRAD lowers the cost of or Redshift alone. The reported runtime includes 0.8 seconds to
innovation adoption by automating workload migration. transfer the intermediate results, which could be further optimized.
BRAD’s vision presents several novel technical challenges. BRAD
needs a query planner that can cleverly divide work between en- Knowing When to Scale is Non-Trivial. Operating cloud data
gines, supported by an accurate, learned model for query perfor- infrastructure also invites cost optimization, as modern systems
mance on different engines. Then, BRAD must leverage sophisti- support fine-grained resource scaling. Ideally, one would only pay
cated strategies to navigate the complex trade-off space of data for the resources they need at any given time, but doing so is
mesh design. To make BRAD practical, we must also develop novel not easy. To illustrate this, we ran a simple e-commerce workload
learning techniques to adapt to unseen workloads and deployments (consisting of sales transactions and periodic analytical reporting
and solve challenges around data synchronization and consistency. queries), representing a typical company’s data needs as it grows,
In this paper, we present the architecture of BRAD, outline our plan against two setups: a single instance deployment of Amazon RDS
to address these challenges, and present promising initial results. PostgreSQL and a deployment of RDS PostgreSQL and Redshift,
along with an ETL pipeline that periodically copies the latest writes
2 MOTIVATION AND BACKGROUND from RDS into Redshift. The former setup is simpler to maintain and
We will first present examples of counter-intuitive optimizations more economical at small workload scales, but the latter setup may
on a simple data mesh: an OLTP engine and an OLAP engine. perform much better at larger scales. Figure 1 shows our results. The
RDS-only deployment starts as the most economical setup, but at
2.1 Motivating Scenarios a large enough scale (scale factor 8) a combined RDS and Redshift
setup becomes cheaper. Importantly, in real cloud deployments,
OLAP systems are not always better at analytics. Conven- such inflection points tend to be dynamic, subject to changing
tional wisdom suggests that for the best performance within this workloads, pricing models and offerings, etc. Human developers
simple data mesh one should execute transactional queries on the are unlikely to be able to always follow the optimal cost line.
OLTP system (e.g., Aurora), analytical queries on the OLAP sys-
tem (e.g., Redshift), and periodically synchronize between the two.
2.2 The Case for a New Approach
This is not true; as a counter-example, we run query 19b from the
Join-Order Benchmark [46] (an analytical query) on Aurora and As shown, automated solutions that manage a data mesh and decide
Redshift on the IMDB dataset, along with a modified version de- how to execute user queries are needed. We argue that we should
noted 19e. Both queries have the same join template, but 19e omits cast the challenge of hybrid workload processing as automated
some highly selective filter predicates (on title, cast_info, and system composition. BRAD encompasses three novel directions:
name). As shown in Table 1, we observe that Aurora is 17× faster • BRAD is a backward-compatible, incrementally-deployable solu-
than Redshift on 19b, but 3.7× slower on 19e. This is because the tion on top of existing data meshes. Advanced users can directly
selective filters in query 19b allow Aurora to leverage its indexes access underlying systems where necessary, or use BRAD’s pro-
for the join. Redshift, lacking indexes, must resort to table scans for grammable policy interface to restrict its interaction with the
both queries. Had one chosen to process 19b on Redshift, based on data mesh. This minimizes impact on legacy workloads and con-
the query type, they would see an order-of-magnitude slowdown. trols the pace of transition to autonomous operation.
3294
Table 2: Differences between BRAD and related work.
Model Support
Multiple Data
Feat. Support
Autonomous
Incremental
Specialized
Operation
Adoption
System
AlloyDB (HTAP System) [30] No No No No
DeltaLake (Lakehouse) [11] Some Yes Some No
BigDAWG (Polystore) [25] No Some Yes No
Oracle Autonomous Database [61] No No No Yes
BRAD Yes Yes Some Yes
3295
on different query types. Such information, along with statistics engines and hardware that may not yet have been tested for the
collected by the engines, drives BRAD’s cost model. We envision user’s current workload and dataset. Therefore, we must transcend
that performance insights are transferable across deployments and existing approaches [47, 50, 52, 53, 55, 72], which use previously
workloads on the same engine. It is therefore possible to obtain executed workloads on a specific engine as training data.
reasonable cost models through experiments in offline training de- Recently, a dataset-agnostic cost model [33] was proposed to
ployments instead of exploring in production. By collecting large predict the runtime for unseen workloads. However, directly ap-
volumes of workload information and performance metrics in the plying this model in BRAD is sub-optimal for three reasons: (i) the
cloud setting, BRAD can avoid relying on human-supplied infor- model requires the query execution plan as input, which may not be
mation (e.g., that AWS Aurora is optimized for transactions) and available, as some engines may not support or contain certain tables
instead discover them from real workloads and environments. to execute functionality that produces a query execution plan (i.e.,
Beyond query execution, BRAD maintains and evolves the data EXPLAIN); (ii) the model allows dataset-specific information leakage
mesh to match workload changes (e.g., business growth or demand into the model, influencing the performance on unseen workloads,
spikes). BRAD first uses historical data for workload forecasting; the and (iii) the model is tailored to a single-node PostgreSQL engine
forecast is used by an intelligent policy engine to trigger necessary with fixed hardware, which may not be generalizable.
actions (e.g. increasing resources for an engine). Perhaps the most To tackle problem (i), we designed a transferrable cost model
important policy decision regards data placement across engines. that takes a SQL query as input and outputs its estimated runtime
For example, if a user frequently runs analytics on transactionally in PostgreSQL. For problem (ii), we provide our cost model with the
hot tables, BRAD may need to replicate the tables in an analytical true cardinalities during the training phase to prevent the model
engine and trigger frequent batch export jobs to keep them in sync. from learning dataset-specific knowledge (e.g., cardinality). Our cost
The problems of query planning and mesh optimization constitute model only needs base-table and pair-wise join cardinalities, which
a joint optimization problem. For example, BRAD may decide to are relatively easy to obtain either from the underlying engines
under-provision Redshift in a mesh, and instead route burst work- directly or by using a learned cardinality estimator [58, 83].
loads to a serverless engine such as Athena. Alternatively, BRAD These two ideas already provide better generalizability than Hil-
may over-provision an OLTP system such as Aurora to handle some precht’s model [34]. As a preliminary experiment, we use the same
analytical workloads (e.g., to take advantage of indexes). datasets and analytical query workloads as in [34]. We train our
Lastly, BRAD must be practical: organizations already operate cost model on 19 datasets and test it on 2,000 analytical queries
data meshes and want to avoid disruptions. BRAD is designed for on the unseen IMDB dataset. For this experiment, we provide our
compatibility and gradual adoption: one may deploy BRAD on the model with the true cardinalities for training and testing queries.
existing data mesh and it can immediately start serving users with In practice, during testing time, our cost model would not have
the single-interface experience after some initial bootstrapping. access to the true cardinality. Therefore, we propose integrating a
Meanwhile, legacy workloads can still interact with the underlying lightweight cardinality corrector from our recent research [59] that
engines directly, bypassing BRAD. To aid gradual adoption, BRAD takes DBMS estimates as input and adaptively evolves when ob-
lets users apply policy filters. For example, a user may enforce that serving more queries. Comprehensive experiments [59] have been
some table is always loaded into Redshift, or that Redshift is always conducted to show the accuracy and practicality of this cardinality
provisioned with a minimum amount of resources. Policy filters corrector. Our experiments in Figure 3 show the robustness of our
can also be used to run BRAD in advisory mode, by intercepting cost model against Hilprecht’s [33]. Figure 3 shows that our cost
migration decisions and asking users for permission. This addresses model, when trained on queries from 19 datasets with less than
the corner cases where BRAD is faced with underlying systems 15 s runtime (3(a)) or up to two join predicates (3(b)), can gener-
with weaker semantics (e.g., DynamoDB) or non-SQL interfaces alize to unseen IMDB queries with longer runtimes or more join
(e.g., Redis). For example, BRAD cannot unilaterally migrate from predicates, respectively (Q-error is defined as max{predicted/true,
a relational DBMS to a lightweight key-value store, as users may true/predicted} - better estimates have a Q-error closer to 1).
anticipate a future feature that needs strong transactional support. For problem (iii), we use lightweight parameterized functions
to predict query performance on different hardware. Figures 3(c)
and 3(d) show our ability to accurately predict an example query’s
4 RESEARCH DIRECTIONS runtime on instances with unseen types (3(c)) or node counts (3(d)),
given its runtime on current hardware. We are currently integrating
4.1 Learned Query Planner all these components to derive an accurate and robust cost model.
Central to BRAD is a learned query planner that maps queries
to execution plans, considering factors like data availability, the 4.1.2 Cross-Engine Translation. Different component engines of
strengths of each engine, and each engine’s load. BRAD may support different SQL dialects, data types, or special-
ized operators and therefore be mutually incompatible. To fully uti-
4.1.1 Execution Time Cost Model. Arguably, the core component lize underlying engines, BRAD must be able to potentially rewrite
of a learned planner is a cost model that predicts the query execu- queries for different engines. Writing manual rules for translation
tion time for each of the underlying engines, which the planner can between systems is challenging and error-prone. Recent work in
use to route a query to the engine with the lowest predicted execu- automatic code understanding provides an alternative solution
tion time. Developing this model poses new research challenges. [18, 75]. Specifically, large language models (LLMs) trained on the
For example, it is necessary to predict a query’s performance on documentation of each engine can translate special features and
3296
(a) Query Runtime (b) Number of Joins (c) Instance Type (d) Number of Nodes
Figure 3: Cost model performance overview: (a), (b) Our cost model generalizes to unseen queries with longer runtimes or more
join predicates. (c), (d) We accurately predict query runtime under different instance types and/or number of nodes.
3297
of approach is to leverage BRAD’s cloud-native setting and learn is promoted to the latest snapshot. A future analytical query 𝑄 3
from both the large corpus of passive observations from client will wait for the analytic engine to receive 𝑄 2 ’s changes. Users
deployments and carefully curated shadow deployments that are can still avoid interleaving analytics/transactions within a session
able to experiment on what-if scenarios (see Section 4.2.2). to minimize latency. These guarantees can be achieved through
Figure 4 shows an example of BRAD optimizing a data mesh; we epoch-based logical snapshots [77] and tuple multi-versioning [14].
plot query latency (top) and monthly Redshift cost (bottom) over The challenge is to do so without modifying the underlying engines
time. In this example, a user deploys BRAD on a mesh with Redshift or introducing excessive runtime overhead.
running on one ra3.xlplus node and tells BRAD that their queries
4.3.2 Auto-ETL. In addition to providing consistency guarantees
should finish within 10 seconds (shaded region in the figure). When
across table replicas, BRAD needs to support more complex data
BRAD’s mesh optimizer runs, it predicts each query’s latency across
dependency relationships between tables—typically handled by ex-
Redshift provisionings using a learned regression model; the model
tract, transform, and load (ETL) jobs today. For example, ETL jobs
uses the query’s measured latency on the current provisioning and
may be used to transform the tables in a transactional DBMS before
the ratios between the hardware resources (vCPUs and amount of
loading them into the data warehouse (e.g., to de-normalize the
memory) across the two compared provisionings. BRAD correctly
tables, re-arrange the tables in a star schema [60], or to compute
predicts (the dashed lines on the graph) that all three queries will
aggregate statistics). Currently, users often rely on handcrafted
run under 10 seconds on one dc2.large node—Redshift’s most
transformation logic and ETL frameworks such as AWS Glue [10]
economical instance type. BRAD applies this change and reduces
or EMR [6]. This setup is both tedious for users and restrictive
the mesh’s monthly Redshift cost by 4× (bottom graph).
for BRAD, as users typically hard-code the source and destination
4.2.2 System Exploration and Transfer Learning. BRAD must rely systems of such transformations in black box logic—preventing
on automated, learning-based methods to explore each engine’s BRAD from freely placing tables. Instead, we envision that BRAD
strengths and weaknesses due to the sheer number of engines will support a higher-level declarative API for specifying table
BRAD must support. However, learning requires exploring unseen dependencies (e.g., table B is obtained by running the given SQL
configurations and execution plans, which may hurt performance statements on table A), which allows BRAD to (i) change the loca-
in an online environment. We aim to leverage BRAD’s cloud-native tions of the inputs and outputs to a transformation (e.g., to migrate
deployment to mitigate this. Cloud providers have access to traces a table off of an engine), and (ii) select the system(s) on which to
of many client deployments and therefore large amounts of training execute a transform (e.g., using spare capacity on Redshift or new
data. More importantly, they can transparently capture workload features such as zero-ETL [4] instead of AWS Glue or EMR).
traces and spin up “what-if” shadow deployments or experiments in-
stead of exploring on live deployments. This approach has security 5 CONCLUSION
and privacy implications, but we see these concerns as orthogonal BRAD shows a new way to assemble and operate data meshes in
to our system. The critical challenge is whether our model can effi- the cloud, relying on recent advances in automation techniques
ciently transfer insights to unknown databases and deployments, instead of human experts. For the vast majority of end users, BRAD
which we have briefly addressed in Section 4.1.1. We envision that significantly simplifies the operation of state-of-the-art data meshes
in its complete form, BRAD is able to automatically incorporate a and allows easier derivation of timely insights from vast amounts of
new engine into the mesh by first obtaining a rough performance data. For database researchers, BRAD lowers the barrier of adoption
model of it through offline deployments running standard bench- by providing room for automated and user-transparent migration
marks (e.g., TPC-C and TPC-H), and then fine-tuning using shadow to new engines where appropriate. This paper outlines our plan to
deployments and real client performance data. build BRAD and presents preliminary results to show the promise
of our approach. If successful, we expect BRAD to unlock the true
4.3 Data Synchronization and Consistency potential of last decade’s research into specialized data systems and
have a significant impact on the efficiency of modern enterprises.
A key challenge in a data mesh is to correctly synchronize data
across component systems and maintain consistency where it mat- ACKNOWLEDGMENTS
ters, without incurring overhead elsewhere. BRAD’s automated
placement and migration decisions must address this challenge. This research was supported by Amazon, Google, and Intel as part
of the MIT Data Systems and AI Lab (DSAIL) at MIT and NSF
4.3.1 Session-Based Freshness Guarantees. Consistency is a nat- IIS 1900933. Geoffrey X. Yu was partially supported by an NSERC
ural concern in BRAD, as it encompasses multiple engines that PGS D. This research was also sponsored by the United States Air
cannot always be synchronized performantly. Since BRAD is exter- Force Research Laboratory and the Department of the Air Force
nally a unified system, stale reads and distributed anomalies would Artificial Intelligence Accelerator and was accomplished under
violate its abstraction. To avoid them, we propose session-based Cooperative Agreement Number FA8750-19-2-1000. The views and
freshness guarantees (similar to Daudjee et al. [19]), where clients conclusions contained in this document are those of the authors and
issue queries within explicitly defined sessions. Within a session, a should not be interpreted as representing the official policies, either
query runs against a consistent snapshot of the database and future expressed or implied, of the Department of the Air Force or the U.S.
queries will run against the same, or a later, snapshot. For example, Government. The U.S. Government is authorized to reproduce and
a session 𝑆 issuing a large data lake query 𝑄 1 may use a snapshot distribute reprints for Government purposes notwithstanding any
on cloud storage, but if 𝑆 then issues a transactional update 𝑄 2 , it copyright notation herein.
3298
REFERENCES [22] Jialin Ding, Umar Farooq Minhas, Badrish Chandramouli, Chi Wang, Yinan Li,
[1] Michael Abebe, Horatiu Lazu, and Khuzaima Daudjee. 2022. Proteus: Au- Ying Li, Donald Kossmann, Johannes Gehrke, and Tim Kraska. 2021. Instance-
tonomous Adaptive Storage for Mixed Workloads. In Proceedings of the 2022 Optimized Data Layouts for Cloud Analytics Workloads. In Proceedings of the
International Conference on Management of Data (SIGMOD ’22) (Philadelphia, 2021 International Conference on Management of Data (Virtual Event, China) (SIG-
PA, USA). Association for Computing Machinery, New York, NY, USA, 700–714. MOD ’21). Association for Computing Machinery, New York, NY, USA, 418–431.
https://doi.org/10.1145/3514221.3517834 https://doi.org/10.1145/3448016.3457270
[2] Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid, [23] Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020.
Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed
Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thiru- Workloads. Proceedings of the VLDB Endowment 14, 2 (November 2020), 74–86.
muruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data https://doi.org/10.14778/3425879.3425880
Processing: May the Big Data Be with You! Proceedings of the VLDB Endowment [24] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill
11, 11 (July 2018), 1414–1427. https://doi.org/10.14778/3236187.3236195 Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik.
[3] Rana Alotaibi, Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and Stamatis 2015. The BigDAWG Polystore System. SIGMOD Rec. 44, 2 (August 2015), 11–16.
Zampetakis. 2019. Towards Scalable Hybrid Stores: Constraint-Based Rewriting https://doi.org/10.1145/2814710.2814713
to the Rescue. In Proceedings of the 2019 International Conference on Management [25] Aaron J. Elmore, Jennie Duggan, Mike Stonebraker, Magdalena Balazinska, Ugur
of Data (SIGMOD ’19). 1660–1677. Çetintemel, Vijay Gadepally, Jeffrey Heer, Bill Howe, Jeremy Kepner, Tim Kraska,
[4] Amazon Web Services. 2022. AWS announces Amazon Aurora zero-ETL integration Samuel Madden, David Maier, Timothy G. Mattson, Stavros Papadopoulos, Jeff
with Amazon Redshift . https://aws.amazon.com/about-aws/whats-new/2022/ Parkhurst, Nesime Tatbul, Manasi Vartak, and Stan Zdonik. 2015. A Demonstra-
11/amazon-aurora-zero-etl-integration-redshift/. tion of the BigDAWG Polystore System. Proceedings of the VLDB Endowment 8,
[5] Amazon Web Services. 2023. Amazon Athena. https://aws.amazon.com/athena/. 12 (2015), 1908–1911. http://www.vldb.org/pvldb/vol8/p1908-Elmore.pdf
[6] Amazon Web Services. 2023. Amazon EMR. https://aws.amazon.com/emr/. [26] Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes
[7] Amazon Web Services. 2023. Amazon Redshift Serverless. https://aws.amazon. Rauhe, and Jonathan Dees. 2012. The SAP HANA Database – An Architecture
com/redshift/redshift-serverless/. Overview. IEEE Data Eng. Bull. 35 (03 2012), 28–33.
[8] Amazon Web Services. 2023. AWS Step Functions. https://aws.amazon.com/step- [27] Gartner. 2022. DBMS Market Transformation 2021: The Big Pic-
functions/. ture. https://blogs.gartner.com/merv-adrian/2022/04/16/dbms-market-
[9] Amazon Web Services. 2023. Redshift Concurrency Scaling. https://docs.aws. transformation-2021-the-big-picture/.
amazon.com/redshift/latest/dg/concurrency-scaling.html. [28] Dimitrios Georgakopoulos, Marek Rusinkiewicz, and Amit P. Sheth. 1991. On
[10] Amazon Web Services. 2023. What is AWS Glue? https://docs.aws.amazon.com/ Serializability of Multidatabase Transactions Through Forced Local Conflicts. In
glue/latest/dg/what-is-glue.html. Proceedings of the Seventh International Conference on Data Engineering (ICDE ’91).
[11] Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul IEEE Computer Society, USA, 314–323.
Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, [29] Victor Giannakouris and Immanuel Trummer. 2022. Building Learned Federated
Michał undefinedwitakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Query Optimizers. In CEUR workshop proceedings, Vol. 3186.
Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold [30] Google, Inc. 2023. AlloyDB. https://cloud.google.com/alloydb.
Xin, and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage [31] Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. 1997. Optimiz-
over Cloud Object Stores. Proceedings of the VLDB Endowment 13, 12 (2020), ing Queries Across Diverse Data Sources. In Proceedings of the VLDB Endowment
3411–3424. https://doi.org/10.14778/3415478.3415560 (VLDB ’97).
[12] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lake- [32] Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou,
house: A New Generation of Open Platforms that Unify Data Warehousing and Jeffrey Ullman, and Jennifer Widom. 1995. Information Translation, Mediation,
Advanced Analytics. In Proceedings of the 11th Annual Conference on Innovative and Mosaic-Based Browsing in the TSIMMIS System. In Proceedings of the Inter-
Data Systems Research (CIDR ’21). national Conference on Management of Data (SIGMOD ’95).
[13] Graham Bent, Patrick Dantressangle, David Vyvyan, Abbe Mowshowitz, and [33] Benjamin Hilprecht and Carsten Binnig. 2022. Zero-Shot Cost Models for Out-
Valia Mitsou. 2008. A Dynamic Distributed Federated Database. In Proc. 2nd Ann. of-the-box Learned Cost Prediction. arXiv preprint arXiv:2201.00561 (2022).
Conf. International Technology Alliance (ACITA ’08’). [34] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kris-
[14] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency tian Kersting, and Carsten Binnig. 2019. Deepdb: Learn from data, not from
Control and Recovery in Database Systems. Addison-Wesley. queries! arXiv preprint arXiv:1909.00607 (2019).
[15] Yuri Breitbart, Hector Garcia-Molina, and Abraham Silberschatz. 1992. Overview [35] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu
of Multidatabase Transaction Management. VLDB Journal 1 (10 1992), 181–239. Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun
https://doi.org/10.1145/1925805.1925811 Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas
[16] Yuri Breitbart and Avi Silberschatz. 1988. Multidatabase Update Issues. In Proceed- Cameron, Liquan Pei, and Xin Tang. 2020. TiDB: A Raft-Based HTAP Database.
ings of the 1988 ACM SIGMOD International Conference on Management of Data Proceedings of the VLDB Endowment 13, 12 (August 2020), 3072–3084. https:
(Chicago, Illinois, USA) (SIGMOD ’88). Association for Computing Machinery, //doi.org/10.14778/3415478.3415535
New York, NY, USA, 135–142. https://doi.org/10.1145/50202.50217 [36] S.-Y. Hwang, E.-P. Lim, H.-R. Yang, S. Musukula, K. Mediratta, M. Ganesh, D.
[17] Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor Clements, J. Stenoien, and J. Srivastava. 1994. The MYRIAD Federated Database
McMahon, and Christopher S Meiklejohn. 2021. Durable Functions: Semantics Prototype. In Proceedings of the 1994 ACM SIGMOD International Conference on
for Stateful Serverless. Proc. ACM Program. Lang. 5, OOPSLA (2021), 1–27. Management of Data (Minneapolis, Minnesota, USA) (SIGMOD ’94). Association
[18] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de for Computing Machinery, New York, NY, USA, 518. https://doi.org/10.1145/
Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg 191839.191986
Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, [37] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia che Tsai, Anurag Khan-
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail delwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Jayant
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Yadwadkar, Joseph Gonzalez, Raluca A. Popa, Ion Stoica, and David A. Patterson.
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo- 2019. Cloud Programming Simplified: A Berkeley View on Serverless Computing.
tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex ArXiv abs/1902.03383 (2019).
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shan- [38] Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: A
tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh New Flavor of Federated Query Processing for DB2. In Proceedings of the 2002
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles ACM SIGMOD International Conference on Management of Data (SIGMOD ’02).
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, 524–532.
Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large [39] Konstantinos Kanellis, Ramnatthan Alagappan, and Shivaram Venkataraman.
Language Models Trained on Code. arXiv:2107.03374 [cs.LG] 2020. Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting
[19] Khuzaima Daudjee and Kenneth Salem. 2006. Lazy Database Replication with Important Knobs. In 12th USENIX Workshop on Hot Topics in Storage and File
Snapshot Isolation. Proceedings of the VLDB Endowment (VLDB ’06). Systems (HotStorage ’20).
[20] Z. Dehghani. 2022. Data Mesh. O’Reilly Media. https://books.google.com/ [40] Konstantinos Kanellis, Cong Ding, Brian Kroth, Andreas Müller, Carlo Curino,
books?id=jmZjEAAAQBAJ and Shivaram Venkataraman. 2022. LlamaTune: Sample-Efficient DBMS Config-
[21] Amol Deshpande and Joseph M Hellerstein. 2002. Decoupled Query Optimization uration Tuning. Proceedings of the VLDB Endowment 15, 11 (2022), 2953–2965.
for Federated Database Systems. In Proceedings 18th International Conference on [41] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP & OLAP
Data Engineering (ICDE ’02). IEEE, 716–727. Main Memory Database System Based on Virtual Memory Snapshots. In Proceed-
ings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE
’11). IEEE Computer Society, USA, 195–206. https://doi.org/10.1109/ICDE.2011.
5767867
3299
[42] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume An Essay on Machine Learning Agents for Autonomous Database Management
Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Systems. IEEE Data Engineering Bulletin (June 2019), 32–46. https://db.cs.cmu.
Learned Database System. In 9th Biennial Conference on Innovative Data Systems edu/papers/2019/pavlo-icde-bulletin2019.pdf
Research, (CIDR ’19), Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. [64] Andrew Pavlo, Matthew Butrovich, Lin Ma, Wan Shen Lim, Prashanth Menon,
www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p117-kraska-cidr19.pdf Dana Van Aken, and William Zhang. 2021. Make Your Database System Dream
[43] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2017. The of Electric Sheep: Towards Self-Driving Operation. Proceedings of the VLDB
Case for Learned Index Structures. CoRR abs/1712.01208 (2017). arXiv:1712.01208 Endowment 14, 12 (2021), 3211–3221. https://db.cs.cmu.edu/papers/2021/p3211-
http://arxiv.org/abs/1712.01208 pavlo.pdf
[44] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion [65] Maksim Podkorytov and Michael Gubanov. 2019. Hybrid.Poly: A Consolidated
Stoica. 2018. Learning to Optimize Join Queries with Deep Reinforcement Interactive Analytical Polystore System. In 2019 IEEE 35th International Confer-
Learning. arXiv preprint arXiv:1808.03196 (2018). ence on Data Engineering (ICDE ’19). 1996–1999. https://doi.org/10.1109/ICDE.
[45] Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, 2019.00223
Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, Juan [66] Calton Pu. 1988. Superdatabases for Composition of Heterogeneous Databases.
Loaiza, Neil Macnaughton, Vineet Marwah, Niloy Mukherjee, Atrayee Mul- In Proceedings of the Fourth International Conference on Data Engineering. IEEE
lick, Sujatha Muthulingam, Vivekanandhan Raja, Marty Roth, Ekrem Soylemez, Computer Society, USA, 548–555.
and Mohamed Zait. 2015. Oracle Database In-Memory: A Dual Format In- [67] Mary Tork Roth, Laura M Haas, and Fatma Ozcan. 1999. Cost Models Do Matter:
Memory Database. In 2015 IEEE 31st International Conference on Data Engineering Providing Cost Information for Diverse Data Sources in a Federated System. IBM
(ICDE ’15). 1253–1258. https://doi.org/10.1109/ICDE.2015.7113373 Thomas J. Watson Research Division.
[46] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and [68] P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G.
Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings Price. 1979. Access Path Selection in a Relational Database Management System.
of the VLDB Endowment 9, 3 (2015), 204–215. In Proceedings of the 1979 ACM SIGMOD International Conference on Management
[47] Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. of Data (SIGMOD ’79) (Boston, Massachusetts) (SIGMOD ’79). Association for
Robust Estimation of Resource Consumption for SQL Queries Using Statistical Computing Machinery, New York, NY, USA, 23–34. https://doi.org/10.1145/
Techniques. Proceedings of the VLDB Endowment 5, 11 (2012). 582095.582099
[48] Ee-Peng Lim and Jaideep Srivastava. 1993. Query Optimization and Processing in [69] Amit P Sheth and James A Larson. 1990. Federated Database Systems for Manag-
Federated Database Systems. In Proceedings of the Second International Conference ing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing
on Information and Knowledge Management (CIKM ’93). 720–722. Surveys (CSUR) 22, 3 (1990), 183–236.
[49] Wan Shen Lim, Matthew Butrovich, William Zhang, Andrew Crotty, Lin Ma, [70] Vishal Sikka, Franz Färber, Wolfgang Lehner, Sang Kyun Cha, Thomas Peh,
Peijing Xu, Johannes Gehrke, and Andrew Pavlo. 2023. Database Gyms. In and Christof Bornhövd. 2012. Efficient Transaction Processing in SAP HANA
Conference on Innovative Data Systems Research (CIDR ’23). Database: The End of a Column Store Myth. In Proceedings of the 2012 ACM
[50] Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning SIGMOD International Conference on Management of Data (Scottsdale, Arizona,
for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD USA) (SIGMOD ’12). Association for Computing Machinery, New York, NY, USA,
International Conference on Management of Data (SIGMOD ’20). 175–191. 731–742. https://doi.org/10.1145/2213836.2213946
[51] Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and [71] Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea
Geoffrey J Gordon. 2018. Query-Based Workload Forecasting for Self-Driving Whose Time Has Come and Gone. In Proceedings of the 21st International Con-
Database Management Systems. In Proceedings of the 2018 International Confer- ference on Data Engineering (ICDE ’05). IEEE Computer Society, USA, 2–11.
ence on Management of Data (SIGMOD ’18). 631–645. https://doi.org/10.1109/ICDE.2005.1
[52] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Al- [72] Ji Sun and Guoliang Li. 2019. An End-to-End Learning-based Cost Estimator.
izadeh, and Tim Kraska. 2022. Bao: Making Learned Query Optimization Practi- Proceedings of the VLDB Endowment 13, 3 (2019).
cal. In Proceedings of the International Conference on Management of Data (SIG- [73] Rebecca Taft, Nosayba El-Sayed, Marco Serafini, Yu Lu, Ashraf Aboulnaga,
MOD ’22). Michael Stonebraker, Ricardo Mayerhofer, and Francisco Andrade. 2018. P-
[53] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Store: An Elastic Database System with Predictive Provisioning. In Proceedings of
Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned the 2018 International Conference on Management of Data (SIGMOD ’18) (Houston,
Query Optimizer. Proceedings of the VLDB Endowment 12, 11 (2019). TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY,
[54] Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for USA, 205–219. https://doi.org/10.1145/3183713.3190650
Join Order Enumeration. In Proceedings of the First International Workshop on [74] Anthony Tomasic, Remy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert
Exploiting Artificial Intelligence Techniques for Data Management (aiDM ’18). Naacke, and Louiqa Raschid. 1997. The Distributed Information Search Com-
[55] Ryan Marcus and Olga Papaemmanouil. 2019. Plan-Structured Deep Neural ponent (Disco) and the World Wide Web. ACM SIGMOD Record 26, 2 (1997),
Network Models for Query Performance Prediction. Proceedings of the VLDB 546–548.
Endowment 12, 11 (2019). [75] Immanuel Trummer. 2022. CodexDB: Generating Code for Processing SQL
[56] Microsoft Corporation. 2023. Serverless Compute Tier for Azure SQL Data- Queries using GPT-3 Codex. arXiv:2204.08941 [cs.DB]
base. https://learn.microsoft.com/en-us/azure/azure-sql/database/serverless- [76] Immanuel Trummer. 2022. DB-BERT: A Database Tuning Tool That "Reads the
tier-overview?view=azuresql&tabs=general-purpose. Manual". In Proceedings of the 2022 International Conference on Management of
[57] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learn- Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machin-
ing Multi-Dimensional Indexes. In Proceedings of the 2020 ACM SIGMOD In- ery, New York, NY, USA, 190–203. https://doi.org/10.1145/3514221.3517843
ternational Conference on Management of Data (Portland, OR, USA) (SIGMOD [77] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden.
’20). Association for Computing Machinery, New York, NY, USA, 985–1000. 2013. Speedy Transactions in Multicore In-Memory Databases. In Proceedings of
https://doi.org/10.1145/3318464.3380579 the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13).
[58] Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim 18–32.
Kraska, and Mohammad Alizadeh. 2021. Flow-Loss: Learning Cardinality Esti- [78] Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017.
mates That Matter. Proceedings of the VLDB Endowment 14, 11 (2021). Automatic Database Management System Tuning Through Large-Scale Machine
[59] Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Learning. In Proceedings of the 2017 ACM International Conference on Management
Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machin-
Cardinality Estimation under Changing Workloads. Proceedings of the VLDB ery, New York, NY, USA, 1009–1024. https://doi.org/10.1145/3035918.3064029
Endowment 16, 6 (2023), 1520–1533. [79] Marco Vogt, Alexander Stiemer, and Heiko Schuldt. 2018. Polypheny-DB: To-
[60] Patrick O’Neil, Betty O’Neil, and Xuedong Chen. 2006. Star Schema Benchmark. wards a Distributed and Self-Adaptive Polystore. In 2018 IEEE International
Technical Report. University of Massachusetts Boston. https://www.cs.umb.edu/ Conference on Big Data (Big Data). IEEE, 3364–3373.
~poneil/StarSchemaB.PDF. [80] Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon
[61] Oracle. 2023. Oracle Autonomous Database. https://www.oracle.com/ Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta,
autonomous-database/. Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker,
[62] Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System
Prashanth Menon, Todd Mowry, Matthew Perron, Ian Quah, Siddharth San- and Cloud Services. In Proceedings of the Conference on Innovative Data Systems
turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Research (CIDR ’17).
Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Manage- [81] Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine
ment Systems. In Conference on Innovative Data Systems Research (CIDR ’17). Learning 8, 3 (1992), 279–292. https://doi.org/10.1007/BF00992698
https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf [82] Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for
[63] Andrew Pavlo, Matthew Butrovich, Ananya Joshi, Lin Ma, Prashanth Menon, Connectionist Reinforcement Learning. Mach. Learn. 8, 3–4 (May 1992), 229–256.
Dana Van Aken, Lisa Lee, and Ruslan Salakhutdinov. 2019. External vs. Internal: https://doi.org/10.1007/BF00992696
3300
[83] Ziniu Wu, Parimarjan Negi, Mohammad Alizadeh, Tim Kraska, and Samuel International Conference on Management of Data (Philadelphia, PA, USA) (SIG-
Madden. 2023. FactorJoin: A New Cardinality Estimation Framework for Join MOD ’22). Association for Computing Machinery, New York, NY, USA, 34–48.
Queries. Proc. ACM Manag. Data 1, 1, Article 41 (May 2023), 27 pages. https: https://doi.org/10.1145/3514221.3526171
//doi.org/10.1145/3588721 [87] Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing,
[84] Geoffrey X. Yu, Markos Markakis, Andreas Kipf, Per-Åke Larson, Umar Farooq Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. 2019. An
Minhas, and Tim Kraska. 2022. TreeLine: An Update-In-Place Key-Value Store End-to-End Automatic Cloud Database Tuning System Using Deep Reinforce-
for Modern Storage. Proceedings of the VLDB Endowment 16, 1 (2022), 99–112. ment Learning. In Proceedings of the 2019 International Conference on Management
[85] Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. Reinforcement of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Ma-
Learning with Tree-LSTM for Join Order Selection. In 2020 IEEE 36th International chinery, New York, NY, USA, 415–432. https://doi.org/10.1145/3299869.3300085
Conference on Data Engineering (ICDE). IEEE, 1297–1308. [88] Xiuwen Zheng, Subhasis Dasgupta, Arun Kumar, and Amarnath Gupta. 2022.
[86] Jianqiu Zhang, Kaisong Huang, Tianzheng Wang, and King Lv. 2022. Skeena: AWESOME: Empowering Scalable Data Science on Social Media Data with an
Efficient and Consistent Cross-Engine Transactions. In Proceedings of the 2022 Optimized Tri-Store Data System. arXiv:2112.00833 [cs.DB]
3301