0% found this document useful (0 votes)

99 views9 pages

Check Out The Big Brain On BRAD Simplifying Cloud Data Processing With Learned Automated Data Meshes

This document discusses a proposed system called BRAD that would simplify cloud data processing by automatically managing data and systems in a data mesh. BRAD would utilize machine learning to understand the strengths and weaknesses of different data processing engines. It would then route queries, migrate data, and scale resources optimally across systems while providing a unified interface to users. The goal is to make complex data meshes easier to use and more cost effective through automation rather than relying solely on human experts.

Uploaded by

tazziee8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views9 pages

Check Out The Big Brain On BRAD Simplifying Cloud Data Processing With Learned Automated Data Meshes

Uploaded by

tazziee8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Check Out the Big Brain on BRAD: Simplifying Cloud Data

Processing with Learned Automated Data Meshes

Tim Kraska∗ Tianyu Li∗ Samuel Madden∗ Markos Markakis∗
MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL
Amazon Web Services [email protected] [email protected] [email protected]
[email protected]
[email protected]

Amadou Ngom∗ Ziniu Wu∗ Geoffrey X. Yu∗

MIT CSAIL MIT CSAIL MIT CSAIL
[email protected] [email protected] [email protected]

ABSTRACT 1 INTRODUCTION
The last decade of database research has led to the prevalence of The last decade has seen an explosion of specialized database en-
specialized systems for different workloads. Consequently, organi- gines for both transactional and analytical workloads following
zations often rely on a combination of specialized systems, orga- the “one size does not fit all” mantra [71]. Today, Amazon Web
nized in a Data Mesh. Data meshes present significant challenges Services (AWS) alone lists nearly 30 different services under its “An-
for system administrators, including picking the right system for alytics” (e.g., Redshift, EMR, Athena) and “Database” (e.g., Aurora,
each workload, moving data between systems, maintaining con- DynamoDB, DocumentDB) categories. This is because no single sys-
sistency, and correctly configuring each system. Many non-expert tem can provide adequate performance for all of an organization’s
end users (e.g., data analysts or app developers) either cannot solve data needs. For example, an S&P 500 corporation we are familiar
their business problems, or suffer from sub-optimal performance with uses, among other services, a dozen Amazon Aurora databases
or cost due to this complexity. We envision BRAD, a cloud system for their website and ERP systems, MemoryDB for caching, S3 man-
that automatically integrates and manages data and systems into aged by AWS Lake Formation and queried by AWS EMR for their
an instance-optimized data mesh, allowing users to efficiently store logs, over ten different Redshift clusters for dashboards and data
and query data under a unified data model (i.e., relational tables) science, and DocumentDB for content serving. Such Data Mesh
without knowledge of underlying system details. With machine architectures [20] are now common in organizations of all sizes.
learning, BRAD automatically deduces the strengths and weak- Building and maintaining such a data mesh is challenging. Ex-
nesses of each engine through a combination of offline training and perts must pick the right combination of engines based on deep
online probing. Then, BRAD uses these insights to route queries understanding of the strengths and weaknesses of each engine,
to the most suitable (combination of) system(s) for efficient execu- devise custom solutions to move data between engines, track data
tion. Furthermore, BRAD automates configuration tuning, resource locations and formats, and actively evolve the mesh over time.
scaling, and data migration across component systems, and makes This leads to highly complex systems that require large teams of
recommendations for more impactful decisions, such as adding or skilled engineers to operate. Meanwhile, data mesh users (e.g., data
removing systems. As such, BRAD exemplifies a new class of sys- scientists or app developers) often lack the expert knowledge to
tems that utilize machine learning and the cloud to make complex quickly identify which exact service(s) to use for their purposes,
data processing more accessible to end users, raising numerous new which leads to poor user experience and sub-optimal use of the data
problems in database systems, machine learning, and the cloud. mesh. Furthermore, modern data infrastructure is often deployed
on the public cloud [27] with fine-grained auto-scaling capabili-
ties [37]. Cloud data mesh users must additionally optimize for
PVLDB Reference Format: cost-efficiency, besides performance. The ensuing complexity is
Tim Kraska, Tianyu Li, Samuel Madden, Markos Markakis, Amadou Ngom,
quickly growing beyond human capabilities. Previous efforts have
Ziniu Wu, and Geoffrey X. Yu. Check Out the Big Brain on BRAD:
focused on automating individual systems (e.g., auto-scaling data
Simplifying Cloud Data Processing with Learned Automated Data Meshes.
PVLDB, 16(11): 3293 - 3301, 2023. warehouses) [9, 42, 56, 62] or optimizing for a single metric (e.g.,
doi:10.14778/3611479.3611526 knob tuning for performance) [39, 40, 76, 78, 87], whereas we call for
a more holistic approach that navigates the complex trade-offs that
arise when when choosing which systems to use and which data to
∗ Allauthors contributed equally to this paper. place on them to minimize costs and / or maximize performance.
This work is licensed under the Creative Commons BY-NC-ND 4.0 International In this paper, we argue that the way forward is to build highly
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
autonomous, learning-powered Self-Organizing Data Meshes, and
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights present our vision for the first such system called BRAD. BRAD
licensed to the VLDB Endowment. uses automation techniques, instead of human experts, to assemble,
Proceedings of the VLDB Endowment, Vol. 16, No. 11 ISSN 2150-8097.
doi:10.14778/3611479.3611526

3293
Table 1: Runtime of two queries on Aurora and Redshift.

Query Aurora Redshift Performance Gap Joint

19b 0.18 s 3.1 s 17× —
19e 9.6 s 2.6 s 3.7× 1.8s

optimize, and evolve data meshes in the cloud. Users largely inter-
act with BRAD through a unified interface under the illusion of a
single system with one copy of the data and one (SQL-based) API. Figure 1: The cost of two setups across workload scales. We
Underneath the hood, BRAD uses ML models to extract insights label the changes made to maintain latency targets.
about the strengths and weaknesses of available engines, discover
workload patterns, smartly create and evolve the data mesh infras-
tructure and optimally distribute the workload among the available The Best Execution Plan may be Federated. The best execu-
engines. If needed, users can bypass the one-size-fits-all interface tion plan for a given query on a data mesh may need to combine the
and directly intervene in some underlying systems while leaving strengths of different engines. To illustrate, we manually split query
others to BRAD. With BRAD, developers can enjoy increased pro- 19e from Table 1 into two sub-queries: sub-query 1 can be optimally
ductivity from a strong abstraction and simple interface, which executed using index scans and joins, which only Aurora supports,
hides away the management of various specialized systems; organi- so we route it to Aurora. We export the results as a CSV file, which
zations can enjoy performance improvements and cost savings as we then import into Redshift. Sub-query 2 lacks filter predicates,
our models uncover insights and adapt the data pipeline at a speed so it is more efficient to execute on Redshift, a column store. Per
and frequency infeasible for human experts; and database internals Table 1 this joint execution plan is indeed faster than either Aurora
developers can enjoy greater impact, as BRAD lowers the cost of or Redshift alone. The reported runtime includes 0.8 seconds to
innovation adoption by automating workload migration. transfer the intermediate results, which could be further optimized.
BRAD’s vision presents several novel technical challenges. BRAD
needs a query planner that can cleverly divide work between en- Knowing When to Scale is Non-Trivial. Operating cloud data
gines, supported by an accurate, learned model for query perfor- infrastructure also invites cost optimization, as modern systems
mance on different engines. Then, BRAD must leverage sophisti- support fine-grained resource scaling. Ideally, one would only pay
cated strategies to navigate the complex trade-off space of data for the resources they need at any given time, but doing so is
mesh design. To make BRAD practical, we must also develop novel not easy. To illustrate this, we ran a simple e-commerce workload
learning techniques to adapt to unseen workloads and deployments (consisting of sales transactions and periodic analytical reporting
and solve challenges around data synchronization and consistency. queries), representing a typical company’s data needs as it grows,
In this paper, we present the architecture of BRAD, outline our plan against two setups: a single instance deployment of Amazon RDS
to address these challenges, and present promising initial results. PostgreSQL and a deployment of RDS PostgreSQL and Redshift,
along with an ETL pipeline that periodically copies the latest writes
2 MOTIVATION AND BACKGROUND from RDS into Redshift. The former setup is simpler to maintain and
We will first present examples of counter-intuitive optimizations more economical at small workload scales, but the latter setup may
on a simple data mesh: an OLTP engine and an OLAP engine. perform much better at larger scales. Figure 1 shows our results. The
RDS-only deployment starts as the most economical setup, but at
2.1 Motivating Scenarios a large enough scale (scale factor 8) a combined RDS and Redshift
setup becomes cheaper. Importantly, in real cloud deployments,
OLAP systems are not always better at analytics. Conven- such inflection points tend to be dynamic, subject to changing
tional wisdom suggests that for the best performance within this workloads, pricing models and offerings, etc. Human developers
simple data mesh one should execute transactional queries on the are unlikely to be able to always follow the optimal cost line.
OLTP system (e.g., Aurora), analytical queries on the OLAP sys-
tem (e.g., Redshift), and periodically synchronize between the two.
2.2 The Case for a New Approach
This is not true; as a counter-example, we run query 19b from the
Join-Order Benchmark [46] (an analytical query) on Aurora and As shown, automated solutions that manage a data mesh and decide
Redshift on the IMDB dataset, along with a modified version de- how to execute user queries are needed. We argue that we should
noted 19e. Both queries have the same join template, but 19e omits cast the challenge of hybrid workload processing as automated
some highly selective filter predicates (on title, cast_info, and system composition. BRAD encompasses three novel directions:
name). As shown in Table 1, we observe that Aurora is 17× faster • BRAD is a backward-compatible, incrementally-deployable solu-
than Redshift on 19b, but 3.7× slower on 19e. This is because the tion on top of existing data meshes. Advanced users can directly
selective filters in query 19b allow Aurora to leverage its indexes access underlying systems where necessary, or use BRAD’s pro-
for the join. Redshift, lacking indexes, must resort to table scans for grammable policy interface to restrict its interaction with the
both queries. Had one chosen to process 19b on Redshift, based on data mesh. This minimizes impact on legacy workloads and con-
the query type, they would see an order-of-magnitude slowdown. trols the pace of transition to autonomous operation.

3294
Table 2: Differences between BRAD and related work.

Model Support
Multiple Data
Feat. Support

Autonomous
Incremental

Specialized

Operation
Adoption
System
AlloyDB (HTAP System) [30] No No No No
DeltaLake (Lakehouse) [11] Some Yes Some No
BigDAWG (Polystore) [25] No Some Yes No
Oracle Autonomous Database [61] No No No Yes
BRAD Yes Yes Some Yes

• BRAD is a cloud-native system optimizing not just for perfor-

mance, but for a complex cost-performance trade-off curve, with Figure 2: BRAD intelligently serves a unified data API with
many more degrees of freedom than conventional ML-optimized specialized engines, leveraging the modern cloud to provide
data systems: it can dynamically change the provisioning of sys- an auto-scaling, management-free DBMS
tems, spin up new systems, shift workload and data between
systems, and take advantage of serverless [5, 7] offerings.
• BRAD is a learned system that leverages machine learning tech- query interface. These systems focus on tackling heterogeneous
niques to automatically extract insights about the strengths and data models and query languages, leaving it up to developers to de-
weaknesses of available engines, discover workload patterns, ploy, maintain, and curate the underlying systems. Instead, BRAD
smartly create and evolve the data mesh infrastructure and dis- automates the management of the underlying systems and jointly
tribute the workload among the available engines optimally. optimizes query execution with system composition and data place-
Below, we compare BRAD to prior work (summarized in Table 2). ment. In short, BRAD is the next-generation polystore, emphasizing
autonomy more than heterogeneous data models and federation.
HTAP systems. Recent advances in HTAP systems have shown
it possible to build engines that match specialized engines across Self-Driving and Instance-Optimized Systems. There has
scenarios [1, 26, 35, 41, 45, 70]. Compared to BRAD, HTAP systems been a wealth of work in instance-optimization techniques [22, 23,
may be more efficient for specific workloads, as they may optimize 42–44, 52–54, 57, 84, 85], which allow systems to automatically
execution holistically, across engine boundaries. However, HTAP specialize themselves to a specific workload’s characteristics; and
systems are replacements for existing systems; because of the com- self-driving systems [49, 62–64, 78], which automate the running of
plexity of migration, we believe modern enterprises are unlikely to complex systems that previously relied on regular human interven-
adopt HTAP systems for existing and legacy applications. BRAD tion. BRAD builds on top of this important work, but must (i) tackle
instead helps organizations manage their existing infrastructure, additional complexity in deciding how systems compose with one
rather than requiring them to move to a new system. In addition, if another, including dynamic data placement and cross-system query
an HTAP system excels at certain workloads, BRAD can incorporate optimization and planning and (ii) optimize for a more complex, dy-
it into the data mesh and take advantage of its performance. namic objective (e.g., cost-savings and resilience to future business
shifts in addition to raw performance) due to its cloud setting.
Data Lakehouses. Data lakehouses process, curate, and store
data directly on a cloud data lake [12]. For the user, BRAD looks 3 BRAD OVERVIEW
like a data lakehouse: a unified data storage layer that supports
We now introduce our proposed architecture for BRAD. As shown
various workloads and automatically transforms and prepares data.
in Figure 2, each instance of BRAD sits atop an ensemble of backend
However, the emphasis and implementation of the two approaches
data processing engines; each engine excels at a different workload
are quite different. BRAD focuses on automatic management of
or provides a different cost-performance trade-off. We assume that
multiple structured data engines, while lakehouses typically rely
each component engine supports the relational model and SQL to
on open direct-access data formats and support unstructured and
some degree. BRAD users define schema, issue queries, and execute
semi-structured data as a first-class concern. If such concerns are
transactions against a unified SQL interface. BRAD maintains front-
relevant, lakehouses can simply be incorporated into a data mesh as
end servers that act as entry points for translating user queries into
a component system, and therefore leverage BRAD. Our techniques
an execution plan using a learned query planner. In the simplest
may also benefit lakehouses that support multiple structured data
case, the planner routes the query to one engine based on predicted
frontends (e.g., SparkSQL and Presto on Delta Lake).
performance, data availability, and dynamic information such as
Polystores and Federated Databases. Prior work on polystores system load. In other cases, BRAD may split a query into sub-
[2, 3, 24, 65, 79, 80, 88] and federated databases [13, 15, 16, 28, 36, components on different engines and combine the results.
38, 66, 69, 86] resembles BRAD in serving diverse data process- To power the learned planner, BRAD must derive insights about
ing needs with a collection of specialized systems under a unified each engine, including supported features, pricing, and performance

3295
on different query types. Such information, along with statistics engines and hardware that may not yet have been tested for the
collected by the engines, drives BRAD’s cost model. We envision user’s current workload and dataset. Therefore, we must transcend
that performance insights are transferable across deployments and existing approaches [47, 50, 52, 53, 55, 72], which use previously
workloads on the same engine. It is therefore possible to obtain executed workloads on a specific engine as training data.
reasonable cost models through experiments in offline training de- Recently, a dataset-agnostic cost model [33] was proposed to
ployments instead of exploring in production. By collecting large predict the runtime for unseen workloads. However, directly ap-
volumes of workload information and performance metrics in the plying this model in BRAD is sub-optimal for three reasons: (i) the
cloud setting, BRAD can avoid relying on human-supplied infor- model requires the query execution plan as input, which may not be
mation (e.g., that AWS Aurora is optimized for transactions) and available, as some engines may not support or contain certain tables
instead discover them from real workloads and environments. to execute functionality that produces a query execution plan (i.e.,
Beyond query execution, BRAD maintains and evolves the data EXPLAIN); (ii) the model allows dataset-specific information leakage
mesh to match workload changes (e.g., business growth or demand into the model, influencing the performance on unseen workloads,
spikes). BRAD first uses historical data for workload forecasting; the and (iii) the model is tailored to a single-node PostgreSQL engine
forecast is used by an intelligent policy engine to trigger necessary with fixed hardware, which may not be generalizable.
actions (e.g. increasing resources for an engine). Perhaps the most To tackle problem (i), we designed a transferrable cost model
important policy decision regards data placement across engines. that takes a SQL query as input and outputs its estimated runtime
For example, if a user frequently runs analytics on transactionally in PostgreSQL. For problem (ii), we provide our cost model with the
hot tables, BRAD may need to replicate the tables in an analytical true cardinalities during the training phase to prevent the model
engine and trigger frequent batch export jobs to keep them in sync. from learning dataset-specific knowledge (e.g., cardinality). Our cost
The problems of query planning and mesh optimization constitute model only needs base-table and pair-wise join cardinalities, which
a joint optimization problem. For example, BRAD may decide to are relatively easy to obtain either from the underlying engines
under-provision Redshift in a mesh, and instead route burst work- directly or by using a learned cardinality estimator [58, 83].
loads to a serverless engine such as Athena. Alternatively, BRAD These two ideas already provide better generalizability than Hil-
may over-provision an OLTP system such as Aurora to handle some precht’s model [34]. As a preliminary experiment, we use the same
analytical workloads (e.g., to take advantage of indexes). datasets and analytical query workloads as in [34]. We train our
Lastly, BRAD must be practical: organizations already operate cost model on 19 datasets and test it on 2,000 analytical queries
data meshes and want to avoid disruptions. BRAD is designed for on the unseen IMDB dataset. For this experiment, we provide our
compatibility and gradual adoption: one may deploy BRAD on the model with the true cardinalities for training and testing queries.
existing data mesh and it can immediately start serving users with In practice, during testing time, our cost model would not have
the single-interface experience after some initial bootstrapping. access to the true cardinality. Therefore, we propose integrating a
Meanwhile, legacy workloads can still interact with the underlying lightweight cardinality corrector from our recent research [59] that
engines directly, bypassing BRAD. To aid gradual adoption, BRAD takes DBMS estimates as input and adaptively evolves when ob-
lets users apply policy filters. For example, a user may enforce that serving more queries. Comprehensive experiments [59] have been
some table is always loaded into Redshift, or that Redshift is always conducted to show the accuracy and practicality of this cardinality
provisioned with a minimum amount of resources. Policy filters corrector. Our experiments in Figure 3 show the robustness of our
can also be used to run BRAD in advisory mode, by intercepting cost model against Hilprecht’s [33]. Figure 3 shows that our cost
migration decisions and asking users for permission. This addresses model, when trained on queries from 19 datasets with less than
the corner cases where BRAD is faced with underlying systems 15 s runtime (3(a)) or up to two join predicates (3(b)), can gener-
with weaker semantics (e.g., DynamoDB) or non-SQL interfaces alize to unseen IMDB queries with longer runtimes or more join
(e.g., Redis). For example, BRAD cannot unilaterally migrate from predicates, respectively (Q-error is defined as max{predicted/true,
a relational DBMS to a lightweight key-value store, as users may true/predicted} - better estimates have a Q-error closer to 1).
anticipate a future feature that needs strong transactional support. For problem (iii), we use lightweight parameterized functions
to predict query performance on different hardware. Figures 3(c)
and 3(d) show our ability to accurately predict an example query’s
4 RESEARCH DIRECTIONS runtime on instances with unseen types (3(c)) or node counts (3(d)),
given its runtime on current hardware. We are currently integrating
4.1 Learned Query Planner all these components to derive an accurate and robust cost model.
Central to BRAD is a learned query planner that maps queries
to execution plans, considering factors like data availability, the 4.1.2 Cross-Engine Translation. Different component engines of
strengths of each engine, and each engine’s load. BRAD may support different SQL dialects, data types, or special-
ized operators and therefore be mutually incompatible. To fully uti-
4.1.1 Execution Time Cost Model. Arguably, the core component lize underlying engines, BRAD must be able to potentially rewrite
of a learned planner is a cost model that predicts the query execu- queries for different engines. Writing manual rules for translation
tion time for each of the underlying engines, which the planner can between systems is challenging and error-prone. Recent work in
use to route a query to the engine with the lowest predicted execu- automatic code understanding provides an alternative solution
tion time. Developing this model poses new research challenges. [18, 75]. Specifically, large language models (LLMs) trained on the
For example, it is necessary to predict a query’s performance on documentation of each engine can translate special features and

3296
(a) Query Runtime (b) Number of Joins (c) Instance Type (d) Number of Nodes

Figure 3: Cost model performance overview: (a), (b) Our cost model generalizes to unseen queries with longer runtimes or more
join predicates. (c), (d) We accurately predict query runtime under different instance types and/or number of nodes.

possibly generate UDFs in the process. We illustrate the feasibility of

this approach by using GPT-4 to convert the Athena query SELECT
CONTAINS (’10.0.0.0/8’, IPADDRESS address) to Postgres-
and Redshift-compatible SQL, where it needs to use a Python UDF
for the missing IP address processing feature. For Postgres SQL,
the LLM correctly used the inet type and the » operator. For Red-
shift, the LLM correctly generated a Python UDF using ipaddress
(part of the standard library) to check if an IP address is within
a subnet. These early results show that it is possible to represent
query semantics as a dialect-independent embedding. However,
relying solely on LLMs is not yet practical due to the performance
overhead and hallucination risk. Developing robust, interpretable Figure 4: BRAD changes Redshift instance types to reduce
translation schemes with LLM-like techniques instead of static rules cost (bottom) as it correctly predicts that query latency will
is a promising research direction to address this challenge. remain below a user-specified ceiling (top, shaded).
4.1.3 Multi-System Query Plans. As shown earlier, BRAD can
sometimes achieve better performance by combining results across
engines. Similar ideas have been explored for federated query exe- 4.2.1 Configuration and Placement Searching. BRAD optimizes
cution [21, 29, 31, 32, 38, 48, 67, 69, 74], but these systems (i) largely the mesh in two major ways: (i) individual engine configuration,
focus on data integration and easy querying rather than perfor- and (ii) data placement. At a high-level, BRAD picks a deployment
mance, and (ii) often rely on manual rules or heuristics. We envi- type (e.g., off vs. serverless vs. provisioned); provisioning (e.g., VM
sion tackling multi-system query planning as an extension of the type); and various knobs (e.g., number of nodes, cache size) for each
single-node query optimization problem [68]. Each operator will system and then decides how to place, replicate, and partition data
have variants backed by different engines, along with data move- across engines to best serve a workload. The resulting search space
ment and results merging operators. This framing will also allow is exponential in the number of tables and configurations, which is
BRAD to incorporate standalone components, such as a serverless prohibitively large for practical deployment.
or hardware-accelerated hash join operator, opening up exciting We envision that BRAD will first adopt a heuristics-based ap-
new opportunities. The optimizer then must efficiently explore the proach, using simple observations (e.g., tables referred to in a single
plan space, using system statistics and the cost model to pick the query should be placed together, transactions should be routed to
best plan. Recent developments have yielded learned query opti- Aurora) and optimization strategies (e.g., [78]) to narrow down the
mizers that match or outperform traditional methods; we plan to search space. Then, for each candidate mesh configuration and data
leverage such optimizers in this new environment. Lastly, a resilient placement, BRAD can take advantage of its cost model to predict
execution layer must execute the plan and hide away distributed the performance of a forecasted workload [51, 73], which is then
complexity and possible anomalies due to transient failures [8, 17]. combined with the dollar operating cost of the configuration and
other metrics to determine how desirable the candidate is.
4.2 Joint Query-Mesh Optimization The holy grail of mesh optimization, however, is to explicitly
Beyond executing queries, BRAD optimizes the data mesh struc- learn to optimally operate a data mesh. BRAD can either choose to
ture. Unlike self-driving databases [62–64], BRAD faces dynamic learn to score each mesh configuration, or directly learn a policy to
trade-offs among query performance, cost, service level agreements, evolve it [81, 82]. The challenge is that users of BRAD are unlikely to
anticipated growth, etc. The mesh architecture impacts query plan- tolerate the amount of trial and error required for a learning-based
ning and vice-versa, yielding a joint optimization problem. BRAD method to arrive at a stable and performant configuration. BRAD
needs several learning-based methods to tackle this. must therefore devise new techniques to quickly adapt learned
models to new, unseen operating conditions. One possible avenue

3297
of approach is to leverage BRAD’s cloud-native setting and learn is promoted to the latest snapshot. A future analytical query 𝑄 3
from both the large corpus of passive observations from client will wait for the analytic engine to receive 𝑄 2 ’s changes. Users
deployments and carefully curated shadow deployments that are can still avoid interleaving analytics/transactions within a session
able to experiment on what-if scenarios (see Section 4.2.2). to minimize latency. These guarantees can be achieved through
Figure 4 shows an example of BRAD optimizing a data mesh; we epoch-based logical snapshots [77] and tuple multi-versioning [14].
plot query latency (top) and monthly Redshift cost (bottom) over The challenge is to do so without modifying the underlying engines
time. In this example, a user deploys BRAD on a mesh with Redshift or introducing excessive runtime overhead.
running on one ra3.xlplus node and tells BRAD that their queries
4.3.2 Auto-ETL. In addition to providing consistency guarantees
should finish within 10 seconds (shaded region in the figure). When
across table replicas, BRAD needs to support more complex data
BRAD’s mesh optimizer runs, it predicts each query’s latency across
dependency relationships between tables—typically handled by ex-
Redshift provisionings using a learned regression model; the model
tract, transform, and load (ETL) jobs today. For example, ETL jobs
uses the query’s measured latency on the current provisioning and
may be used to transform the tables in a transactional DBMS before
the ratios between the hardware resources (vCPUs and amount of
loading them into the data warehouse (e.g., to de-normalize the
memory) across the two compared provisionings. BRAD correctly
tables, re-arrange the tables in a star schema [60], or to compute
predicts (the dashed lines on the graph) that all three queries will
aggregate statistics). Currently, users often rely on handcrafted
run under 10 seconds on one dc2.large node—Redshift’s most
transformation logic and ETL frameworks such as AWS Glue [10]
economical instance type. BRAD applies this change and reduces
or EMR [6]. This setup is both tedious for users and restrictive
the mesh’s monthly Redshift cost by 4× (bottom graph).
for BRAD, as users typically hard-code the source and destination
4.2.2 System Exploration and Transfer Learning. BRAD must rely systems of such transformations in black box logic—preventing
on automated, learning-based methods to explore each engine’s BRAD from freely placing tables. Instead, we envision that BRAD
strengths and weaknesses due to the sheer number of engines will support a higher-level declarative API for specifying table
BRAD must support. However, learning requires exploring unseen dependencies (e.g., table B is obtained by running the given SQL
configurations and execution plans, which may hurt performance statements on table A), which allows BRAD to (i) change the loca-
in an online environment. We aim to leverage BRAD’s cloud-native tions of the inputs and outputs to a transformation (e.g., to migrate
deployment to mitigate this. Cloud providers have access to traces a table off of an engine), and (ii) select the system(s) on which to
of many client deployments and therefore large amounts of training execute a transform (e.g., using spare capacity on Redshift or new
data. More importantly, they can transparently capture workload features such as zero-ETL [4] instead of AWS Glue or EMR).
traces and spin up “what-if” shadow deployments or experiments in-
stead of exploring on live deployments. This approach has security 5 CONCLUSION
and privacy implications, but we see these concerns as orthogonal BRAD shows a new way to assemble and operate data meshes in
to our system. The critical challenge is whether our model can effi- the cloud, relying on recent advances in automation techniques
ciently transfer insights to unknown databases and deployments, instead of human experts. For the vast majority of end users, BRAD
which we have briefly addressed in Section 4.1.1. We envision that significantly simplifies the operation of state-of-the-art data meshes
in its complete form, BRAD is able to automatically incorporate a and allows easier derivation of timely insights from vast amounts of
new engine into the mesh by first obtaining a rough performance data. For database researchers, BRAD lowers the barrier of adoption
model of it through offline deployments running standard bench- by providing room for automated and user-transparent migration
marks (e.g., TPC-C and TPC-H), and then fine-tuning using shadow to new engines where appropriate. This paper outlines our plan to
deployments and real client performance data. build BRAD and presents preliminary results to show the promise
of our approach. If successful, we expect BRAD to unlock the true
4.3 Data Synchronization and Consistency potential of last decade’s research into specialized data systems and
have a significant impact on the efficiency of modern enterprises.
A key challenge in a data mesh is to correctly synchronize data
across component systems and maintain consistency where it mat- ACKNOWLEDGMENTS
ters, without incurring overhead elsewhere. BRAD’s automated
placement and migration decisions must address this challenge. This research was supported by Amazon, Google, and Intel as part
of the MIT Data Systems and AI Lab (DSAIL) at MIT and NSF
4.3.1 Session-Based Freshness Guarantees. Consistency is a nat- IIS 1900933. Geoffrey X. Yu was partially supported by an NSERC
ural concern in BRAD, as it encompasses multiple engines that PGS D. This research was also sponsored by the United States Air
cannot always be synchronized performantly. Since BRAD is exter- Force Research Laboratory and the Department of the Air Force
nally a unified system, stale reads and distributed anomalies would Artificial Intelligence Accelerator and was accomplished under
violate its abstraction. To avoid them, we propose session-based Cooperative Agreement Number FA8750-19-2-1000. The views and
freshness guarantees (similar to Daudjee et al. [19]), where clients conclusions contained in this document are those of the authors and
issue queries within explicitly defined sessions. Within a session, a should not be interpreted as representing the official policies, either
query runs against a consistent snapshot of the database and future expressed or implied, of the Department of the Air Force or the U.S.
queries will run against the same, or a later, snapshot. For example, Government. The U.S. Government is authorized to reproduce and
a session 𝑆 issuing a large data lake query 𝑄 1 may use a snapshot distribute reprints for Government purposes notwithstanding any
on cloud storage, but if 𝑆 then issues a transactional update 𝑄 2 , it copyright notation herein.

3298
REFERENCES [22] Jialin Ding, Umar Farooq Minhas, Badrish Chandramouli, Chi Wang, Yinan Li,
[1] Michael Abebe, Horatiu Lazu, and Khuzaima Daudjee. 2022. Proteus: Au- Ying Li, Donald Kossmann, Johannes Gehrke, and Tim Kraska. 2021. Instance-
tonomous Adaptive Storage for Mixed Workloads. In Proceedings of the 2022 Optimized Data Layouts for Cloud Analytics Workloads. In Proceedings of the
International Conference on Management of Data (SIGMOD ’22) (Philadelphia, 2021 International Conference on Management of Data (Virtual Event, China) (SIG-
PA, USA). Association for Computing Machinery, New York, NY, USA, 700–714. MOD ’21). Association for Computing Machinery, New York, NY, USA, 418–431.
https://doi.org/10.1145/3514221.3517834 https://doi.org/10.1145/3448016.3457270
[2] Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed Elmagarmid, [23] Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020.
Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed
Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thiru- Workloads. Proceedings of the VLDB Endowment 14, 2 (November 2020), 74–86.
muruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data https://doi.org/10.14778/3425879.3425880
Processing: May the Big Data Be with You! Proceedings of the VLDB Endowment [24] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill
11, 11 (July 2018), 1414–1427. https://doi.org/10.14778/3236187.3236195 Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik.
[3] Rana Alotaibi, Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and Stamatis 2015. The BigDAWG Polystore System. SIGMOD Rec. 44, 2 (August 2015), 11–16.
Zampetakis. 2019. Towards Scalable Hybrid Stores: Constraint-Based Rewriting https://doi.org/10.1145/2814710.2814713
to the Rescue. In Proceedings of the 2019 International Conference on Management [25] Aaron J. Elmore, Jennie Duggan, Mike Stonebraker, Magdalena Balazinska, Ugur
of Data (SIGMOD ’19). 1660–1677. Çetintemel, Vijay Gadepally, Jeffrey Heer, Bill Howe, Jeremy Kepner, Tim Kraska,
[4] Amazon Web Services. 2022. AWS announces Amazon Aurora zero-ETL integration Samuel Madden, David Maier, Timothy G. Mattson, Stavros Papadopoulos, Jeff
with Amazon Redshift . https://aws.amazon.com/about-aws/whats-new/2022/ Parkhurst, Nesime Tatbul, Manasi Vartak, and Stan Zdonik. 2015. A Demonstra-
11/amazon-aurora-zero-etl-integration-redshift/. tion of the BigDAWG Polystore System. Proceedings of the VLDB Endowment 8,
[5] Amazon Web Services. 2023. Amazon Athena. https://aws.amazon.com/athena/. 12 (2015), 1908–1911. http://www.vldb.org/pvldb/vol8/p1908-Elmore.pdf
[6] Amazon Web Services. 2023. Amazon EMR. https://aws.amazon.com/emr/. [26] Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes
[7] Amazon Web Services. 2023. Amazon Redshift Serverless. https://aws.amazon. Rauhe, and Jonathan Dees. 2012. The SAP HANA Database – An Architecture
com/redshift/redshift-serverless/. Overview. IEEE Data Eng. Bull. 35 (03 2012), 28–33.
[8] Amazon Web Services. 2023. AWS Step Functions. https://aws.amazon.com/step- [27] Gartner. 2022. DBMS Market Transformation 2021: The Big Pic-
functions/. ture. https://blogs.gartner.com/merv-adrian/2022/04/16/dbms-market-
[9] Amazon Web Services. 2023. Redshift Concurrency Scaling. https://docs.aws. transformation-2021-the-big-picture/.
amazon.com/redshift/latest/dg/concurrency-scaling.html. [28] Dimitrios Georgakopoulos, Marek Rusinkiewicz, and Amit P. Sheth. 1991. On
[10] Amazon Web Services. 2023. What is AWS Glue? https://docs.aws.amazon.com/ Serializability of Multidatabase Transactions Through Forced Local Conflicts. In
glue/latest/dg/what-is-glue.html. Proceedings of the Seventh International Conference on Data Engineering (ICDE ’91).
[11] Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul IEEE Computer Society, USA, 314–323.
Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, [29] Victor Giannakouris and Immanuel Trummer. 2022. Building Learned Federated
Michał undefinedwitakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Query Optimizers. In CEUR workshop proceedings, Vol. 3186.
Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold [30] Google, Inc. 2023. AlloyDB. https://cloud.google.com/alloydb.
Xin, and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage [31] Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. 1997. Optimiz-
over Cloud Object Stores. Proceedings of the VLDB Endowment 13, 12 (2020), ing Queries Across Diverse Data Sources. In Proceedings of the VLDB Endowment
3411–3424. https://doi.org/10.14778/3415478.3415560 (VLDB ’97).
[12] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lake- [32] Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou,
house: A New Generation of Open Platforms that Unify Data Warehousing and Jeffrey Ullman, and Jennifer Widom. 1995. Information Translation, Mediation,
Advanced Analytics. In Proceedings of the 11th Annual Conference on Innovative and Mosaic-Based Browsing in the TSIMMIS System. In Proceedings of the Inter-
Data Systems Research (CIDR ’21). national Conference on Management of Data (SIGMOD ’95).
[13] Graham Bent, Patrick Dantressangle, David Vyvyan, Abbe Mowshowitz, and [33] Benjamin Hilprecht and Carsten Binnig. 2022. Zero-Shot Cost Models for Out-
Valia Mitsou. 2008. A Dynamic Distributed Federated Database. In Proc. 2nd Ann. of-the-box Learned Cost Prediction. arXiv preprint arXiv:2201.00561 (2022).
Conf. International Technology Alliance (ACITA ’08’). [34] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kris-
[14] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency tian Kersting, and Carsten Binnig. 2019. Deepdb: Learn from data, not from
Control and Recovery in Database Systems. Addison-Wesley. queries! arXiv preprint arXiv:1909.00607 (2019).
[15] Yuri Breitbart, Hector Garcia-Molina, and Abraham Silberschatz. 1992. Overview [35] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu
of Multidatabase Transaction Management. VLDB Journal 1 (10 1992), 181–239. Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun
https://doi.org/10.1145/1925805.1925811 Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas
[16] Yuri Breitbart and Avi Silberschatz. 1988. Multidatabase Update Issues. In Proceed- Cameron, Liquan Pei, and Xin Tang. 2020. TiDB: A Raft-Based HTAP Database.
ings of the 1988 ACM SIGMOD International Conference on Management of Data Proceedings of the VLDB Endowment 13, 12 (August 2020), 3072–3084. https:
(Chicago, Illinois, USA) (SIGMOD ’88). Association for Computing Machinery, //doi.org/10.14778/3415478.3415535
New York, NY, USA, 135–142. https://doi.org/10.1145/50202.50217 [36] S.-Y. Hwang, E.-P. Lim, H.-R. Yang, S. Musukula, K. Mediratta, M. Ganesh, D.
[17] Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor Clements, J. Stenoien, and J. Srivastava. 1994. The MYRIAD Federated Database
McMahon, and Christopher S Meiklejohn. 2021. Durable Functions: Semantics Prototype. In Proceedings of the 1994 ACM SIGMOD International Conference on
for Stateful Serverless. Proc. ACM Program. Lang. 5, OOPSLA (2021), 1–27. Management of Data (Minneapolis, Minnesota, USA) (SIGMOD ’94). Association
[18] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de for Computing Machinery, New York, NY, USA, 518. https://doi.org/10.1145/
Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg 191839.191986
Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, [37] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia che Tsai, Anurag Khan-
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail delwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Jayant
Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Yadwadkar, Joseph Gonzalez, Raluca A. Popa, Ion Stoica, and David A. Patterson.
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fo- 2019. Cloud Programming Simplified: A Berkeley View on Serverless Computing.
tios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex ArXiv abs/1902.03383 (2019).
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shan- [38] Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: A
tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh New Flavor of Federated Query Processing for DB2. In Proceedings of the 2002
Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles ACM SIGMOD International Conference on Management of Data (SIGMOD ’02).
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, 524–532.
Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large [39] Konstantinos Kanellis, Ramnatthan Alagappan, and Shivaram Venkataraman.
Language Models Trained on Code. arXiv:2107.03374 [cs.LG] 2020. Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting
[19] Khuzaima Daudjee and Kenneth Salem. 2006. Lazy Database Replication with Important Knobs. In 12th USENIX Workshop on Hot Topics in Storage and File
Snapshot Isolation. Proceedings of the VLDB Endowment (VLDB ’06). Systems (HotStorage ’20).
[20] Z. Dehghani. 2022. Data Mesh. O’Reilly Media. https://books.google.com/ [40] Konstantinos Kanellis, Cong Ding, Brian Kroth, Andreas Müller, Carlo Curino,
books?id=jmZjEAAAQBAJ and Shivaram Venkataraman. 2022. LlamaTune: Sample-Efficient DBMS Config-
[21] Amol Deshpande and Joseph M Hellerstein. 2002. Decoupled Query Optimization uration Tuning. Proceedings of the VLDB Endowment 15, 11 (2022), 2953–2965.
for Federated Database Systems. In Proceedings 18th International Conference on [41] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP & OLAP
Data Engineering (ICDE ’02). IEEE, 716–727. Main Memory Database System Based on Virtual Memory Snapshots. In Proceed-
ings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE
’11). IEEE Computer Society, USA, 195–206. https://doi.org/10.1109/ICDE.2011.
5767867

3299
[42] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume An Essay on Machine Learning Agents for Autonomous Database Management
Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Systems. IEEE Data Engineering Bulletin (June 2019), 32–46. https://db.cs.cmu.
Learned Database System. In 9th Biennial Conference on Innovative Data Systems edu/papers/2019/pavlo-icde-bulletin2019.pdf
Research, (CIDR ’19), Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. [64] Andrew Pavlo, Matthew Butrovich, Lin Ma, Wan Shen Lim, Prashanth Menon,
www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p117-kraska-cidr19.pdf Dana Van Aken, and William Zhang. 2021. Make Your Database System Dream
[43] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2017. The of Electric Sheep: Towards Self-Driving Operation. Proceedings of the VLDB
Case for Learned Index Structures. CoRR abs/1712.01208 (2017). arXiv:1712.01208 Endowment 14, 12 (2021), 3211–3221. https://db.cs.cmu.edu/papers/2021/p3211-
http://arxiv.org/abs/1712.01208 pavlo.pdf
[44] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion [65] Maksim Podkorytov and Michael Gubanov. 2019. Hybrid.Poly: A Consolidated
Stoica. 2018. Learning to Optimize Join Queries with Deep Reinforcement Interactive Analytical Polystore System. In 2019 IEEE 35th International Confer-
Learning. arXiv preprint arXiv:1808.03196 (2018). ence on Data Engineering (ICDE ’19). 1996–1999. https://doi.org/10.1109/ICDE.
[45] Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, 2019.00223
Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, Juan [66] Calton Pu. 1988. Superdatabases for Composition of Heterogeneous Databases.
Loaiza, Neil Macnaughton, Vineet Marwah, Niloy Mukherjee, Atrayee Mul- In Proceedings of the Fourth International Conference on Data Engineering. IEEE
lick, Sujatha Muthulingam, Vivekanandhan Raja, Marty Roth, Ekrem Soylemez, Computer Society, USA, 548–555.
and Mohamed Zait. 2015. Oracle Database In-Memory: A Dual Format In- [67] Mary Tork Roth, Laura M Haas, and Fatma Ozcan. 1999. Cost Models Do Matter:
Memory Database. In 2015 IEEE 31st International Conference on Data Engineering Providing Cost Information for Diverse Data Sources in a Federated System. IBM
(ICDE ’15). 1253–1258. https://doi.org/10.1109/ICDE.2015.7113373 Thomas J. Watson Research Division.
[46] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and [68] P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G.
Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings Price. 1979. Access Path Selection in a Relational Database Management System.
of the VLDB Endowment 9, 3 (2015), 204–215. In Proceedings of the 1979 ACM SIGMOD International Conference on Management
[47] Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. of Data (SIGMOD ’79) (Boston, Massachusetts) (SIGMOD ’79). Association for
Robust Estimation of Resource Consumption for SQL Queries Using Statistical Computing Machinery, New York, NY, USA, 23–34. https://doi.org/10.1145/
Techniques. Proceedings of the VLDB Endowment 5, 11 (2012). 582095.582099
[48] Ee-Peng Lim and Jaideep Srivastava. 1993. Query Optimization and Processing in [69] Amit P Sheth and James A Larson. 1990. Federated Database Systems for Manag-
Federated Database Systems. In Proceedings of the Second International Conference ing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing
on Information and Knowledge Management (CIKM ’93). 720–722. Surveys (CSUR) 22, 3 (1990), 183–236.
[49] Wan Shen Lim, Matthew Butrovich, William Zhang, Andrew Crotty, Lin Ma, [70] Vishal Sikka, Franz Färber, Wolfgang Lehner, Sang Kyun Cha, Thomas Peh,
Peijing Xu, Johannes Gehrke, and Andrew Pavlo. 2023. Database Gyms. In and Christof Bornhövd. 2012. Efficient Transaction Processing in SAP HANA
Conference on Innovative Data Systems Research (CIDR ’23). Database: The End of a Column Store Myth. In Proceedings of the 2012 ACM
[50] Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning SIGMOD International Conference on Management of Data (Scottsdale, Arizona,
for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD USA) (SIGMOD ’12). Association for Computing Machinery, New York, NY, USA,
International Conference on Management of Data (SIGMOD ’20). 175–191. 731–742. https://doi.org/10.1145/2213836.2213946
[51] Lin Ma, Dana Van Aken, Ahmed Hefny, Gustavo Mezerhane, Andrew Pavlo, and [71] Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea
Geoffrey J Gordon. 2018. Query-Based Workload Forecasting for Self-Driving Whose Time Has Come and Gone. In Proceedings of the 21st International Con-
Database Management Systems. In Proceedings of the 2018 International Confer- ference on Data Engineering (ICDE ’05). IEEE Computer Society, USA, 2–11.
ence on Management of Data (SIGMOD ’18). 631–645. https://doi.org/10.1109/ICDE.2005.1
[52] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Al- [72] Ji Sun and Guoliang Li. 2019. An End-to-End Learning-based Cost Estimator.
izadeh, and Tim Kraska. 2022. Bao: Making Learned Query Optimization Practi- Proceedings of the VLDB Endowment 13, 3 (2019).
cal. In Proceedings of the International Conference on Management of Data (SIG- [73] Rebecca Taft, Nosayba El-Sayed, Marco Serafini, Yu Lu, Ashraf Aboulnaga,
MOD ’22). Michael Stonebraker, Ricardo Mayerhofer, and Francisco Andrade. 2018. P-
[53] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Store: An Elastic Database System with Predictive Provisioning. In Proceedings of
Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned the 2018 International Conference on Management of Data (SIGMOD ’18) (Houston,
Query Optimizer. Proceedings of the VLDB Endowment 12, 11 (2019). TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY,
[54] Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for USA, 205–219. https://doi.org/10.1145/3183713.3190650
Join Order Enumeration. In Proceedings of the First International Workshop on [74] Anthony Tomasic, Remy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert
Exploiting Artificial Intelligence Techniques for Data Management (aiDM ’18). Naacke, and Louiqa Raschid. 1997. The Distributed Information Search Com-
[55] Ryan Marcus and Olga Papaemmanouil. 2019. Plan-Structured Deep Neural ponent (Disco) and the World Wide Web. ACM SIGMOD Record 26, 2 (1997),
Network Models for Query Performance Prediction. Proceedings of the VLDB 546–548.
Endowment 12, 11 (2019). [75] Immanuel Trummer. 2022. CodexDB: Generating Code for Processing SQL
[56] Microsoft Corporation. 2023. Serverless Compute Tier for Azure SQL Data- Queries using GPT-3 Codex. arXiv:2204.08941 [cs.DB]
base. https://learn.microsoft.com/en-us/azure/azure-sql/database/serverless- [76] Immanuel Trummer. 2022. DB-BERT: A Database Tuning Tool That "Reads the
tier-overview?view=azuresql&tabs=general-purpose. Manual". In Proceedings of the 2022 International Conference on Management of
[57] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learn- Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machin-
ing Multi-Dimensional Indexes. In Proceedings of the 2020 ACM SIGMOD In- ery, New York, NY, USA, 190–203. https://doi.org/10.1145/3514221.3517843
ternational Conference on Management of Data (Portland, OR, USA) (SIGMOD [77] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden.
’20). Association for Computing Machinery, New York, NY, USA, 985–1000. 2013. Speedy Transactions in Multicore In-Memory Databases. In Proceedings of
https://doi.org/10.1145/3318464.3380579 the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13).
[58] Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim 18–32.
Kraska, and Mohammad Alizadeh. 2021. Flow-Loss: Learning Cardinality Esti- [78] Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017.
mates That Matter. Proceedings of the VLDB Endowment 14, 11 (2021). Automatic Database Management System Tuning Through Large-Scale Machine
[59] Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Learning. In Proceedings of the 2017 ACM International Conference on Management
Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machin-
Cardinality Estimation under Changing Workloads. Proceedings of the VLDB ery, New York, NY, USA, 1009–1024. https://doi.org/10.1145/3035918.3064029
Endowment 16, 6 (2023), 1520–1533. [79] Marco Vogt, Alexander Stiemer, and Heiko Schuldt. 2018. Polypheny-DB: To-
[60] Patrick O’Neil, Betty O’Neil, and Xuedong Chen. 2006. Star Schema Benchmark. wards a Distributed and Self-Adaptive Polystore. In 2018 IEEE International
Technical Report. University of Massachusetts Boston. https://www.cs.umb.edu/ Conference on Big Data (Big Data). IEEE, 3364–3373.
~poneil/StarSchemaB.PDF. [80] Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon
[61] Oracle. 2023. Oracle Autonomous Database. https://www.oracle.com/ Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta,
autonomous-database/. Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker,
[62] Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System
Prashanth Menon, Todd Mowry, Matthew Perron, Ian Quah, Siddharth San- and Cloud Services. In Proceedings of the Conference on Innovative Data Systems
turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Research (CIDR ’17).
Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Manage- [81] Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine
ment Systems. In Conference on Innovative Data Systems Research (CIDR ’17). Learning 8, 3 (1992), 279–292. https://doi.org/10.1007/BF00992698
https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf [82] Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for
[63] Andrew Pavlo, Matthew Butrovich, Ananya Joshi, Lin Ma, Prashanth Menon, Connectionist Reinforcement Learning. Mach. Learn. 8, 3–4 (May 1992), 229–256.
Dana Van Aken, Lisa Lee, and Ruslan Salakhutdinov. 2019. External vs. Internal: https://doi.org/10.1007/BF00992696

3300
[83] Ziniu Wu, Parimarjan Negi, Mohammad Alizadeh, Tim Kraska, and Samuel International Conference on Management of Data (Philadelphia, PA, USA) (SIG-
Madden. 2023. FactorJoin: A New Cardinality Estimation Framework for Join MOD ’22). Association for Computing Machinery, New York, NY, USA, 34–48.
Queries. Proc. ACM Manag. Data 1, 1, Article 41 (May 2023), 27 pages. https: https://doi.org/10.1145/3514221.3526171
//doi.org/10.1145/3588721 [87] Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing,
[84] Geoffrey X. Yu, Markos Markakis, Andreas Kipf, Per-Åke Larson, Umar Farooq Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. 2019. An
Minhas, and Tim Kraska. 2022. TreeLine: An Update-In-Place Key-Value Store End-to-End Automatic Cloud Database Tuning System Using Deep Reinforce-
for Modern Storage. Proceedings of the VLDB Endowment 16, 1 (2022), 99–112. ment Learning. In Proceedings of the 2019 International Conference on Management
[85] Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. Reinforcement of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Ma-
Learning with Tree-LSTM for Join Order Selection. In 2020 IEEE 36th International chinery, New York, NY, USA, 415–432. https://doi.org/10.1145/3299869.3300085
Conference on Data Engineering (ICDE). IEEE, 1297–1308. [88] Xiuwen Zheng, Subhasis Dasgupta, Arun Kumar, and Amarnath Gupta. 2022.
[86] Jianqiu Zhang, Kaisong Huang, Tianzheng Wang, and King Lv. 2022. Skeena: AWESOME: Empowering Scalable Data Science on Social Media Data with an
Efficient and Consistent Cross-Engine Transactions. In Proceedings of the 2022 Optimized Tri-Store Data System. arXiv:2112.00833 [cs.DB]

3301

2022 Bookmatter StatisticsForDataScientists
No ratings yet
2022 Bookmatter StatisticsForDataScientists
24 pages
Money Banking Financial Markets Institutions W PDF Annas Archive
No ratings yet
Money Banking Financial Markets Institutions W PDF Annas Archive
552 pages
Jlr-8400 (e) 7zpna4701 (1版) Service Manual 190311
No ratings yet
Jlr-8400 (e) 7zpna4701 (1版) Service Manual 190311
70 pages
Report Reader Tutorial
100% (3)
Report Reader Tutorial
54 pages
BSBXTW401 Assessment Task 2
100% (1)
BSBXTW401 Assessment Task 2
55 pages
Unit X: Information Technology
No ratings yet
Unit X: Information Technology
2 pages
Encrypted Network Traffic Analysis: Aswani Kumar Cherukuri Sumaiya Thaseen Ikram Gang Li Xiao Liu
No ratings yet
Encrypted Network Traffic Analysis: Aswani Kumar Cherukuri Sumaiya Thaseen Ikram Gang Li Xiao Liu
108 pages
(The Milken Institute Series On Financial Innovation and Economic Growth) James R. Barth, S. Trimbath, Glenn Yago-The Savings and Loan Crisis - Lessons From A Regulatory Failure-Springer (2004)
No ratings yet
(The Milken Institute Series On Financial Innovation and Economic Growth) James R. Barth, S. Trimbath, Glenn Yago-The Savings and Loan Crisis - Lessons From A Regulatory Failure-Springer (2004)
430 pages
User Manual Gold Certification
No ratings yet
User Manual Gold Certification
34 pages
Finit Mixture Modeling Mclachlan2000
No ratings yet
Finit Mixture Modeling Mclachlan2000
446 pages
Analysis Handbook PDF
No ratings yet
Analysis Handbook PDF
171 pages
MIT Technology Review (September-October 2013)
No ratings yet
MIT Technology Review (September-October 2013)
112 pages
Mcafee Enterprise Security Manager Data Source Configuration Reference Guide 10-8-2019
No ratings yet
Mcafee Enterprise Security Manager Data Source Configuration Reference Guide 10-8-2019
557 pages
Managerial Economics & Business Strategy 8e
No ratings yet
Managerial Economics & Business Strategy 8e
677 pages
Discussion 4 Pytorch
100% (1)
Discussion 4 Pytorch
37 pages
James W Cortada The Digital Hand Volume 1 How Computers Changed The Work of American Manufacturing Transportation and Retail Industries PDF
No ratings yet
James W Cortada The Digital Hand Volume 1 How Computers Changed The Work of American Manufacturing Transportation and Retail Industries PDF
513 pages
Chicken Chicken Chicken: Chicken Chicken
No ratings yet
Chicken Chicken Chicken: Chicken Chicken
3 pages
Fumiko Lopez V Apple Inc - Class Action
100% (1)
Fumiko Lopez V Apple Inc - Class Action
20 pages
20.0us $ - All Data Pro Alldata 10.53 Auto Repair Software Moto Heavy Duty Vivid Workshop Full Set 1Tb HDD For Car Truck Diagnostic Tools Aliexpress
No ratings yet
20.0us $ - All Data Pro Alldata 10.53 Auto Repair Software Moto Heavy Duty Vivid Workshop Full Set 1Tb HDD For Car Truck Diagnostic Tools Aliexpress
5 pages
PPDM3.8 Seismic ER Diagram
No ratings yet
PPDM3.8 Seismic ER Diagram
12 pages
Data Mart Vs Data Warehouse
100% (1)
Data Mart Vs Data Warehouse
6 pages
MySQL - Learn Data Analytics Together's Group
No ratings yet
MySQL - Learn Data Analytics Together's Group
96 pages
Mod Simpy
No ratings yet
Mod Simpy
204 pages
UsingCsla4 03 DataAccess
No ratings yet
UsingCsla4 03 DataAccess
216 pages
OOPS in C++ PDF
No ratings yet
OOPS in C++ PDF
7 pages
OpenRTB 2 6 - FINAL
No ratings yet
OpenRTB 2 6 - FINAL
91 pages
CSP367 - 1st Day
No ratings yet
CSP367 - 1st Day
61 pages
001 Intro
No ratings yet
001 Intro
66 pages
M580-2CH User Manual 20230620
No ratings yet
M580-2CH User Manual 20230620
40 pages
Model-Driven Software Development
No ratings yet
Model-Driven Software Development
39 pages
Basic HTML, Image: My First Image Is HTML Picture
No ratings yet
Basic HTML, Image: My First Image Is HTML Picture
70 pages
(1961) Tukey - The Future of Data Analysis
No ratings yet
(1961) Tukey - The Future of Data Analysis
68 pages
Clonamos El Repositorio para Obtener Los Dataset: From Import
No ratings yet
Clonamos El Repositorio para Obtener Los Dataset: From Import
23 pages
Cisco CCNA 200-301 Exam Mock Tests 2022
100% (7)
Cisco CCNA 200-301 Exam Mock Tests 2022
330 pages
Lumix DMC Fz100
No ratings yet
Lumix DMC Fz100
44 pages
Polen Focus Growth Presentation 2q19 Compressed
No ratings yet
Polen Focus Growth Presentation 2q19 Compressed
39 pages
PP 001
No ratings yet
PP 001
9 pages
Installing The Cisco CSR 1000V in Vmware Esxi Environments
No ratings yet
Installing The Cisco CSR 1000V in Vmware Esxi Environments
38 pages
The Holloway Guide To Using Twitter - Holloway PDF
No ratings yet
The Holloway Guide To Using Twitter - Holloway PDF
76 pages
Resume
No ratings yet
Resume
7 pages
Rebooting The AI
No ratings yet
Rebooting The AI
1 page
Course Time Table PDF 57551
No ratings yet
Course Time Table PDF 57551
2 pages
Theory of Constrained Optimization
No ratings yet
Theory of Constrained Optimization
18 pages
IT Terminal Examination
No ratings yet
IT Terminal Examination
3 pages
Balanced Scorecard Case Study
No ratings yet
Balanced Scorecard Case Study
5 pages
Internet and Email
No ratings yet
Internet and Email
6 pages
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
No ratings yet
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
18 pages
Technical Comparison Aerospike Vs Redis
No ratings yet
Technical Comparison Aerospike Vs Redis
28 pages
Analysis of Bitcoin in Illicit Finance
No ratings yet
Analysis of Bitcoin in Illicit Finance
11 pages
"Defend Forward" and Sovereignty
No ratings yet
"Defend Forward" and Sovereignty
28 pages
Biometric Slip Form
No ratings yet
Biometric Slip Form
4 pages
Journey To Event Driven - Part 4 - Four Pillars of Event Streaming Microservices - Confluent
No ratings yet
Journey To Event Driven - Part 4 - Four Pillars of Event Streaming Microservices - Confluent
33 pages
Dibs Final Paper 2015
No ratings yet
Dibs Final Paper 2015
9 pages
Information Rules
No ratings yet
Information Rules
27 pages
Business Verification: Total Records Unmatched / Data Quality Errors Confidence Code 8 Confidence Code 7
No ratings yet
Business Verification: Total Records Unmatched / Data Quality Errors Confidence Code 8 Confidence Code 7
12 pages
Anshul Final Ultra
No ratings yet
Anshul Final Ultra
3 pages
How Technology Shapes Organizations
No ratings yet
How Technology Shapes Organizations
25 pages
Computer Simulation Techniques
No ratings yet
Computer Simulation Techniques
185 pages
Priyanshu's Resume
No ratings yet
Priyanshu's Resume
1 page
2009b Borrowing Product Licenses
No ratings yet
2009b Borrowing Product Licenses
2 pages
Computer Forensics: Infosec Pro Guide Ch7 Live vs. Postmortem Forensics
No ratings yet
Computer Forensics: Infosec Pro Guide Ch7 Live vs. Postmortem Forensics
35 pages
Cloud Computing - Digital Forensics and Challenges: Presented by Mittal S. Mehta Roll No: PGDC - 019
No ratings yet
Cloud Computing - Digital Forensics and Challenges: Presented by Mittal S. Mehta Roll No: PGDC - 019
16 pages
04 McCarthy Whatisai
No ratings yet
04 McCarthy Whatisai
14 pages
Free Agents
No ratings yet
Free Agents
5 pages
Deep Learning@Ok Interviews
No ratings yet
Deep Learning@Ok Interviews
6 pages
Trend Micro Apex One Training For Certified Professionals: Duration: 3 Days Course Code: TMAO
No ratings yet
Trend Micro Apex One Training For Certified Professionals: Duration: 3 Days Course Code: TMAO
3 pages
Article in Petromin 1 PDF
No ratings yet
Article in Petromin 1 PDF
6 pages
Introduction To Data Analysis and Decision Making
No ratings yet
Introduction To Data Analysis and Decision Making
11 pages
XML Basics
No ratings yet
XML Basics
9 pages
Introduction To Database Programming: Record
No ratings yet
Introduction To Database Programming: Record
19 pages
Geographic Coordinate Conversion
No ratings yet
Geographic Coordinate Conversion
11 pages
Cubes - Models and Schemas
No ratings yet
Cubes - Models and Schemas
6 pages
Ultra-Low Latency Market Data
No ratings yet
Ultra-Low Latency Market Data
4 pages
Project Management Software - NOT Critical To Success: Lisa Anderson
No ratings yet
Project Management Software - NOT Critical To Success: Lisa Anderson
3 pages
Lockformer Vulcan Catalog PDF
No ratings yet
Lockformer Vulcan Catalog PDF
6 pages
Ant Colony Optimization Algorithms
No ratings yet
Ant Colony Optimization Algorithms
13 pages
Ecler
No ratings yet
Ecler
4 pages
Procurement - Wikipedia, The Free Encyclopedia
No ratings yet
Procurement - Wikipedia, The Free Encyclopedia
8 pages
GV500 Quick Start V100.160124534
No ratings yet
GV500 Quick Start V100.160124534
2 pages
Recommended Reading For Time Series Analysis
No ratings yet
Recommended Reading For Time Series Analysis
2 pages
NATS Architecture and Implementation Guide: Definitive Reference for Developers and Engineers
From Everand
NATS Architecture and Implementation Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Computing
From Everand
Cloud Computing
Dr. Nirvikar Katiyar
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Azure Data Demystified: From SQL to Synapse
From Everand
Azure Data Demystified: From SQL to Synapse
Kameron Hussain
No ratings yet
Customer-Anchored Supply Chains: An Executive’S Guide to Building Competitive Advantage in the Oil Patch
From Everand
Customer-Anchored Supply Chains: An Executive’S Guide to Building Competitive Advantage in the Oil Patch
Gary Flaharty
No ratings yet
August of Money: The Quest for Cashless Society
From Everand
August of Money: The Quest for Cashless Society
Mehul Desai
No ratings yet
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
From Everand
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
BusinessNews Publishing
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
From Everand
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
Christopher Ford
No ratings yet

Check Out The Big Brain On BRAD Simplifying Cloud Data Processing With Learned Automated Data Meshes

Uploaded by

Check Out The Big Brain On BRAD Simplifying Cloud Data Processing With Learned Automated Data Meshes

Uploaded by

Check Out the Big Brain on BRAD: Simplifying Cloud Data

Processing with Learned Automated Data Meshes

Amadou Ngom∗ Ziniu Wu∗ Geoffrey X. Yu∗

Query Aurora Redshift Performance Gap Joint

• BRAD is a cloud-native system optimizing not just for perfor-

possibly generate UDFs in the process. We illustrate the feasibility of

You might also like