Protocol Query Rewrites
Protocol Query Rewrites
Distributed protocols such as 2PC and Paxos lie at the core of many systems in the cloud, but standard
implementations do not scale. New scalable distributed protocols are developed through careful analysis and
rewrites, but this process is ad hoc and error-prone. This paper presents an approach for scaling any distributed
protocol by applying rule-driven rewrites, borrowing from query optimization. Distributed protocol rewrites
entail a new burden: reasoning about spatiotemporal correctness. We leverage order-insensitivity and data
dependency analysis to systematically identify correct coordination-free scaling opportunities. We apply
this analysis to create preconditions and mechanisms for coordination-free decoupling and partitioning, two
fundamental vertical and horizontal scaling techniques. Manual rule-driven applications of decoupling and
partitioning improve the throughput of 2PC by 5× and Paxos by 3×, and match state-of-the-art throughput in
recent work. These results point the way toward automated optimizers for distributed protocols based on
correct-by-construction rewrite rules.
Additional Key Words and Phrases: Distributed Systems, Query Optimization, Paxos, 2PC, Relational Algebra,
Datalog, Partitioning, Dataflow, Monotonicity
Authors’ addresses: David C. Y. Chu, University of California, Berkeley, USA, [email protected]; Rithvik
Panchapakesan, University of California, Berkeley, USA, [email protected]; Shadaj Laddad, University of California,
Berkeley, USA, [email protected]; Lucky E. Katahanas, Sutter Hill Ventures, USA, [email protected]; Chris Liu, University
of California, Berkeley, USA, [email protected]; Kaushik Shivakumar, University of California, Berkeley, USA,
[email protected]; Natacha Crooks, University of California, Berkeley, USA, [email protected]; Joseph M.
Hellerstein, University of California, Berkeley, USA and Sutter Hill Ventures, USA, [email protected]; Heidi Howard,
Azure Research, Microsoft, UK, [email protected].
This work is licensed under a Creative Commons Attribution International 4.0 License.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:2 David C. Y. Chu et al.
David C. Y. Chu, Rithvik Panchapakesan, Shadaj Laddad, Lucky E. Katahanas, Chris Liu, Kaushik Shivakumar,
Natacha Crooks, Joseph M. Hellerstein, and Heidi Howard. 2024. Optimizing Distributed Protocols with Query
Rewrites. Proc. ACM Manag. Data 2, N1 (SIGMOD), Article 2 (February 2024), 25 pages. https://doi.org/10.114
5/3639257
1 INTRODUCTION
Promises of better cost and scalability have driven the migration of database systems to the cloud.
Yet, the distributed protocols at the core of these systems, such as 2PC [46] or Paxos [43], are not
designed to scale: when the number of machines grows, overheads often increase and throughput
drops. As such, there has been a wealth of research on developing new, scalable distributed protocols.
Unfortunately, each new design requires careful examination of prior work and new correctness
proofs; the process is ad hoc and often error-prone [2, 35, 51, 53, 57, 62]. Moreover, due to the
heterogeneity of proposed approaches, each new insight is localized to its particular protocol and
cannot easily be composed with other efforts.
This paper offers an alternative approach. Instead of creating new distributed protocols from scratch,
we formalize scalability optimizations into rule-driven rewrites that are correct by construction and
can be applied to any distributed protocol.
To rewrite distributed protocols, we take a page from traditional SQL query optimizations. Prior
work has shown that distributed protocols can be expressed declaratively as sets of queries in
a SQL-like language such as Dedalus [6], which we adopt here. Applying query optimization to
these protocols thus seems like an appealing way forward. Doing so correctly however, requires
care, as the domain of distributed protocols requires optimizer transformations whose correctness
is subtler than classical matters like the associativity and commutativity of join. In particular,
transformations to scale across machines must reason about program equivalence in the face of
changes to spatiotemporal semantics like the order of data arrivals and the location of state.
We focus on applying two fundamental scaling optimizations in this paper: decoupling and par-
titioning, which correspond to vertical and horizontal scaling. We target these two techniques
because (1) they can be generalized across protocols and (2) were recently shown by Whittaker
et al. [63] to achieve state-of-the-art throughput on complex distributed protocols such as Paxos.
While Whittaker’s rewrites are handcrafted specifically for Paxos, our goal is to rigorously define
the general preconditions and mechanics for decoupling and partitioning, so they can be used to
correctly rewrite any distributed protocol.
Decoupling improves scalability by spreading logic across machines to take advantage of additional
physical resources and pipeline parallel computation. Decoupling rewrites data dependencies on
a single node into messages that are sent via asynchronous channels between nodes. Without
coordination, the original timing and ordering of messages cannot be guaranteed once these
channels are introduced. To preserve correctness without introducing coordination, we decouple
sub-components that produce the same responses regardless of message ordering or timing: these
sub-components are order-insensitive. Order-insensitivity is easy to systematically identify in
Dedalus thanks to its relational model: Dedalus programs are an (unordered) set of queries over
(unordered) relations, so the logic for ordering—time, causality, log sequence numbers—is the
exception, not the norm, and easy to identify. By avoiding decoupling the logic that explicitly relies
on order, we can decouple the remaining order-insensitive sub-components without coordination.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:3
Partitioning improves scalability by spreading state across machines and parallelizing compute, a
technique widely used in query processing [22, 25]. Textbook discussions focus on partitioning data
to satisfy a single query operator like join or group-by. If the next operator downstream requires a
different partitioning, then data must be forwarded or “shuffled” across the network. We would
like to partition data in such a way that entire sub-programs can compute on local data without
reshuffling. We leverage relational techniques like functional dependency analysis to find data
partitioning schemes that can allow as much code as possible to work on local partitions without
reshuffling between operators. This is a benefit of choosing to express distributed protocols in the
relational model: functional dependencies are far easier to identify in a relational language than a
procedural language.
We demonstrate the generality of our optimizations by methodically applying rewrites to three
seminal distributed protocols: voting, 2PC, and Paxos. We specifically target Paxos [59] as it is a
protocol with many distributed invariants and it is challenging to verify [31, 66, 67]. The throughput
of the optimized voting, 2PC, and Paxos protocols scale by 2×, 5×, and 3× respectively, a scale-up
factor that matches the performance of ad hoc rewrites [63] when the underlying language of each
implementation is accounted for and achieves state-of-the-art performance for Paxos.
Our correctness arguments focus on the equivalence of localized, “peephole” optimizations of
dataflow graphs. Traditional protocol optimizations often make wholesale modifications to protocol
logic and therefore require holistic reasoning to prove correctness. We take a different approach.
Our rewrite rules modify existing programs with small local changes, each of which is proven to
preserve semantics. As a result, each rewritten subprogram is provably indistinguishable to an
observer (or client) from the original. We do not need to prove that holistic protocol invariants are
preserved—they must be. Moreover, because rewrites are local and preserve semantics, they can be
composed to produce protocols with multiple optimizations, as we demonstrate in Section 5.2.
Our local-first approach naturally has a potential cost: the space of protocol optimization is limited
by design as it treats the initial implementation as “law”. It cannot distinguish between true protocol
invariants and implementation artifacts, limiting the space of potential optimizations. Nonetheless,
we find that, when applying our results to seminal distributed system algorithms, we easily match
the results of their (manually proven) optimized implementations.
In summary, we make the following contributions:
(1) We present the preconditions and mechanisms for applying multiple correct-by-construction
rewrites of two fundamental transformations: decoupling and partitioning.
(2) We demonstrate the application of these rule-driven rewrites by manually applying them to
complex distributed protocols such as Paxos.
(3) We evaluate our optimized programs and observe 2 − 5× improvement in throughput across pro-
tocols with state-of-the-art throughput in Paxos, validating the role of correct-by-construction
rewrites for distributed protocols.
Due to a lack of space, the full precondition, mechanism, and proof of correctness of each rewrite
in this paper can be found in the technical report [16].
2 BACKGROUND
Our contributions begin with the program rewriting rules in Section 3. Naturally, the correctness of
those rules depends on the details of the language we are rewriting, Dedalus. Hence in this section
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:4 David C. Y. Chu et al.
Fig. 1. Dataflow diagram for a verifiably-replicated KVS. Edges are labeled with corresponding line numbers;
dashed edges represent asynchronous channels. Each gray bounding box represents a node; select nodes’
dataflows are presented.
we pause to review the syntax and semantics of Dedalus, as well as additional terminology we will
use in subsequent discussion.
Dedalus is a spatiotemporal logic for distributed systems [6]. As we will see in Section 2.3, Dedalus
captures specifications for the state, computation and messages of a set of distributed nodes over
time. Each node (a.k.a. machine, thread) has its own explicit “clock” that marks out local time
sequentially. Dedalus (and hence our work here) assumes a standard asynchronous model in which
messages between correct nodes can be arbitrarily delayed and reordered, but must eventually be
delivered after an infinite amount of time [24].
Dedalus is a dialect of Datalog¬ , which is itself a SQL-like declarative logic language that supports
familiar constructs like joins, selection, and projection, with additional support for recursion,
aggregation (akin to GROUP BY in SQL), and negation (NOT IN). Unlike SQL, Datalog¬ has set
semantics.
2.2 Datalog¬
We now introduce the necessary Datalog¬ terminology, copying code snippets from Listings 1
and 2 to introduce key concepts.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:5
A Datalog¬ program is a set of rules in no particular order. A rule 𝜑 is like a view definition
in SQL, defining a virtual relation via a query over other relations. A literal in a rule is either a
relation, a negated relation, or a boolean expression. A rule consists of a deduction operator :−
defining a single left-hand-side relation (the head of the rule) via a list of right-hand-side literals
(the body).
Consider Line 3 of Listing 2, which computes hash collisions:
3 collisions(val2,hashed,l,t) :− toStorage(val1,leaderSig,l,t), hash(val1,hashed),
hashset(hashed,val2,l,t)
In this example, the head literal is collisions, and the body literals are toStorage, hash, and
hashset. Each body literal can be a (possibly negated) relation 𝑟 consisting of multiple attributes
𝐴, or a boolean expression; the head literal must be a relation. For example, hashset is a relation
with four attributes representing the hash, message value, location, and time in that order. Each
attribute must be bound to a constant or variable; attributes in the head literal can also be bound
to aggregation functions. In the example above, the attribute representing the message value in
hashset is bound to the variable val2. Positive literals in the body of the rule are joined together;
negative literals are anti-joined (SQL’s NOT IN). Attributes bound to the same variable form an
equality predicate—in the rule above, the first attribute of toStorage must be equal to the first
attribute of hash since they are both bound to val1; this specifies an equijoin of those two relations.
Two positive literals in the same body that share no common variables form a cross-product.
Multiple rules may have the same head relation; the head relation is defined as the disjunction
(SQL UNION) of the rule bodies.
Note how library functions like hash are simply modeled as infinite relations of the form
(input, output) . Because these are infinite relations, they can only be used in a rule body if
the input variables are bound to another attribute—this corresponds to “lazily evaluating” the
function only for that attribute’s finite set of values. For example, the relation hash contains the
fact (x, y) if and only if hash(x) equals y.
Relations 𝑟 are populated with facts 𝑓 , which are tuples of values, one for each attribute of 𝑟 . We
will use the syntax 𝜋𝐴 (𝑓 ) to project 𝑓 to the value of attribute 𝐴. Relations with facts stored prior
to execution are traditionally called extensional relations, and the set of extensional relations is
called the EDB. Derived relations, defined in the heads of rules, are traditionally called intensional
relations, and the set of them is called the IDB. Boolean operators and library functions like hash
have pre-defined content, hence they are (infinite) EDB relations.
Datalog¬ also supports negation and aggregations. An example of aggregation is seen in Listing 2
Line 4, which counts the number of hash collisions with the count aggregation:
4 numCollisions(count<val>,hashed,l,t) :− collisions(val,hashed,l,t)
In this syntax, attributes that appear outside of aggregate functions form the GROUP BY list; attributes
inside the functions are aggregated. In order to compute aggregation in any rule 𝜑, we must first
compute the full content of all relations 𝑟 in the body of 𝜑. Negation works similarly: if we have a
literal !r(x) in the body, we can only check that r is empty after we’re sure we have computed the
full contents of r(x). We refer the reader to [1, 48] for further reading on aggregation and negation.
2.3 Dedalus
Dedalus programs are legal Datalog¬ programs, constrained to adhere to three additional rules on
the syntax.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:6 David C. Y. Chu et al.
(1) Space and Time in Schema: All IDB relations must contain two attributes at their far right:
location 𝐿 and time 𝑇 . Together, these attributes model where and when a fact exists in the system.
For example, in the rule on Line 3 discussed above, a toStorage message 𝑚 and signature 𝑠𝑖𝑔 that
arrives at time 𝑡 at a node with location 𝑎𝑑𝑑𝑟 is represented as a fact toStorage(𝑚, 𝑠𝑖𝑔, 𝑎𝑑𝑑𝑟, 𝑡 ).
(2) Matching Space-Time Variables in Body: The location and time attributes in all body literals
must be bound to the same variables 𝑙 and 𝑡, respectively. This models the physical property that
two facts can be joined only if they exist at the same time and location. In Line 3, a toStorage fact
that appears on node 𝑙 at time 𝑡 can only match with hashset facts that are also on 𝑙 at time 𝑡.
We model library functions like hash as relations that are known (replicated) across all nodes 𝑛 and
unchanging across all timesteps 𝑡. Hence we elide 𝐿 and 𝑇 from function and expression literals as
a matter of syntax sugar, and assume they can join with other literals at all locations and times.
(3) Space and Time Constraints in Head: The location and time variables in the head of rules
must obey certain syntactic constraints, which ensure that the “derived” locations and times
correspond to physical reality. These constraints differ across three types of rules. Synchronous
(“deductive” [6]) rules are captured by having the same time variable in the head literal as in the
body literals. Having these derivations assigned to the same timestep 𝑡 is only physically possible
on a single node, so the location in the head of a synchronous rule must match the body as well.
Sequential (“inductive” [6]) rules are captured by having the head literal’s time be the successor
(t+1) of the body literals’ times t. Again, sequentiality can only be guaranteed physically on a single
node in an asychronous system, so the location of the head in a sequential rule must match the
body. Asynchronous rules capture message passing between nodes, by having different time and
location variables in the head than the body. In an asynchronous system, messages are delivered at
an arbitrary time in the future. We discuss how this is modeled next.
In an asynchronous rule 𝜑, the location attribute of the head and body relations in 𝜑 are bound to
different variables; a different location in the head of 𝜑 indicates the arrival of the fact on a new
node. Asynchronous rules are constrained to capture non-deterministic delay by including a body
literal for the built-in delay relation (a.k.a. choose [6], chosen [4]), a non-deterministic function
that independently maps each head fact to an arrival time. The logical formalism of the delay
function is discussed in [4]; for our purposes it is sufficient to know that delay is constrained to
reflect Lamport’s “happens-before” relation for each fact. That is, a fact sent at time 𝑡 on 𝑙 arrives
at time 𝑡 ′ on 𝑙 ′ , where 𝑡 < 𝑡 ′ . We focus on Listing 2, Line 5 from our running example.
5 fromStorage(l,sig,val,collCnt,l',t') :− toStorage(val,leaderSig,l,t),
hash(val,hashed), numCollisions(collCnt,hashed,l,t), sign(val,sig),
leader(l'), delay((sig,val,collCnt,l,t,l'),t')
This is an asynchronous rule where a storage node 𝑙 sends the count of hash collisions for each
distinct storage request back to the leader 𝑙 ′ . Note the l' and t' in the head literal: they are derived
from the body literals leader (an EDB relation storing the leader address) and the built-in delay.
Note also how the first attribute of delay (the function “input”) is a tuple of variables that, together,
distinguish each individual head fact. This allows delay to choose a different t' for every head
fact [4]. The l in the head literal represents the storage node’s address and is used by the leader to
count the number of votes; it is unrelated to asynchrony.
So far, we have only talked about facts that exist at a point in time 𝑡. State change in Dedalus is
modeled through the existence or non-existence of facts across time. Persistence rules like the
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:7
one below from Line 2 of Listing 2 ensure, inductively, that facts in hashset that exist at time 𝑡
exist at time 𝑡 + 1. Relations with persistence rules—like hashset—are persisted.
2 hashset(hashed,val,l,t') :− hashset(hashed,val,l,t), t'=t+1
Line 2
toStorage('hi', 0x7465, b.b.us:5678, 9)
Line 1
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:8 David C. Y. Chu et al.
2.5 Correctness
This paper transforms single-node Dedalus components into “equivalent” multi-component, multi-
node Dedalus programs; the transformations can be composed to scale entire distributed protocols.
For equivalence, we want a definition that satisfies any client (or observer) of the input/output
channels of the original program. To this end we employ equivalence of concurrent histories as
defined for linearizability [33], the gold standard in distributed systems.
We assume that a history 𝐻 can be constructed from any run of a given Dedalus program 𝑃.
Linearizability traditionally expects every program to include a specification that defines what
histories are "legal". We make no such assumption and we consider any possible history generated
by the unoptimized program 𝑃 to define the specification. As such, the optimized program 𝑃 ′ is
linearizable if any run of 𝑃 ′ generates the same output facts with the same timestamps as some run
of 𝑃.
Our rewrites are safe over protocols that assume the following fault model: an asynchronous
network (messages between correct nodes will eventually be delivered) where up to 𝑓 nodes can
suffer from general omission failures [52] (they may fail to send or receive some messages). After
optimizing, one original node 𝑛 may be replaced by multiple nodes 𝑛 1, 𝑛 2, . . .; the failure of any of
nodes 𝑛𝑖 corresponds to a partial failure of the original node 𝑛, which is equivalent to the failure of
𝑛 under general omission.
Due to a lack of space, we omit the proofs of correctness of the rewrites described in Sections 3
and 4. Full proofs, preconditions, and rewrite mechanisms can be found in the appendix of our
technical report [16].
3 DECOUPLING
Decoupling partitions code; it takes a Dedalus component running on a single node, and breaks it
into multiple components that can run in parallel across many nodes. Decoupling can be used to
alleviate single-node bottlenecks by scaling up available resources. Decoupling can also introduce
pipeline parallelism: if one rule produces facts in its head that another rule consumes in its body,
decoupling those rules across two components can allow the producer and consumer to run in
parallel.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:9
challenges: how to get the right data to the right nodes (space), and how to ensure that introducing
asynchronous messaging between nodes does not affect correctness (time).
In this section we step through a progression of decoupling scenarios, and introduce analyses and
rewrites that provably address our concerns regarding space and time. Throughout, our goal is to
avoid introducing any coordination—i.e. extra messages beyond the data passed between rules in
the original program.
General Construction for Decoupling: In all our scenarios we will consider a component 𝐶 at
network location addr, consisting of a set of rules 𝜑. We will, without loss of generality, decouple 𝐶
into two components: 𝐶 1 = 𝜑 1 , which stays at location addr, and 𝐶 2 = 𝜑 2 which is placed at a new
location addr2. The rulesets of the two new components partition the original ruleset: 𝜑 1 ∩ 𝜑 2 = ∅
and 𝜑 1 ∪ 𝜑 2 ⊇ 𝜑. Note that we may add new rules during decoupling to achieve equivalence.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:10 David C. Y. Chu et al.
5 fromStorage(l,sig,val,collCnt,l'',t') :− toStorage(val,leaderSig,l,t),
hash(val,hashed), numCollisions(collCnt,hashed,l,t), sign(val,sig),
leader(l'), forward(l',l'') delay((l,sig,val,collCnt,l,t,l''),t')
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:11
The alert reader may notice performance concerns. First, 𝐶 1 may redundantly resend persistently-
derived facts to 𝐶 2 each tick, even though 𝐶 2 is persistently storing them anyway via the rewrite.
Second, 𝐶 2 is required to persist facts indefinitely, potentially long after they are needed. Solutions
to this problem were explored in prior work [17] and can be incorporated here as well without
affecting semantics.
Consider again Lines 1 and 2 in Listing 1. Note that Line 1 works like a function on one input: each
fact from in results in an independent signed fact in signed. Hence we can decouple further, placing
Line 1 on one node and Line 2 on another, forwarding signed values to toStorage. Intuitively, this
decoupling does not change program semantics because Line 2 simply sends messages, regardless
of which messages have come before: it behaves like pure functions.
As a side note, recall that persisted relations in Dedalus are by definition IDB relations. Hence
Precondition (2) prevents 𝐶 2 from joining current inputs (an IDB relation) with previous persisted
data (another IDB relation)! In effect, persistence rules are irrelevant to the output of a functional
component, rendering functional components effectively “stateless”.
4 PARTITIONING
Decoupling is the distribution of logic across nodes; partitioning (or “sharding”) is the distribution
of data. By using a relational language like Dedalus, we can scale protocols using a variety of
techniques that query optimizers use to maximize partitioning without excessive “repartitioning”
(a.k.a. “shuffling”) of data at runtime.
Unlike decoupling, which introduces new components, partitioning introduces additional nodes
on which to run instances of each component. Therefore, each fact may be rerouted to any of the
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:12 David C. Y. Chu et al.
many nodes, depending on the partitioning scheme. Because each rule still executes locally on each
node, we must reason about changing the location of facts.
We first need to define partitioning schemes, and what it means for a partitioning to be correct
for a set of rules. Much of this can be borrowed from recent theoretical literature [7, 27, 28, 55].
A partitioning scheme is described by a distribution policy 𝐷 (𝑓 ) that outputs some node address
addr_i for any fact 𝑓 . A partitioning preserves the semantics of the rules in a component if it is
parallel disjoint correct [55]. Intuitively, this property says that the body facts that need to be
colocated remain colocated after partitioning. We adapt the parallel disjoint correctness definitions
to the context of Dedalus as follows:
Definition 4.1. A distribution policy 𝐷 over component 𝐶 is parallel disjoint correct if for any fact
𝑓 of 𝐶, for any two facts 𝑓1, 𝑓2 in the proof tree of 𝑓 , 𝐷 (𝑓1 ) = 𝐷 (𝑓2 ).
Ideally we can find a single distribution policy that is parallel disjoint correct over the component
in question. To do so, we need to partition each relation based on the set of attributes used for
joining or grouping the relation in the component’s rules. Such distribution policies are said to
satisfy the co-hashing constraint (Section 4.1). Unfortunately, it is common for a single relation to
be referenced in two rules with different join or grouping attributes. In some cases, dependency
analysis can still find a distribution policy that will be correct (Section 4.2). If no parallel disjoint
correct distribution policy can be found, we can resort to partial partitioning (Section 4.3), which
replicates facts across multiple nodes.
To discuss partitioning rewrites on generic Dedalus programs, we consider without loss of generality
a component 𝐶 with a set of rules 𝜑 at network location addr. We will partition the data at addr
across a set of new locations addr1, addr2, etc, each executing the same rules 𝜑.
4.1 Co-hashing
We begin with co-hashing [28, 55], a well studied constraint that avoids repartitioning data. Our
goal is to co-locate facts that need to be combined because they (a) share a join key, (b) share a
group key, or (c) share an antijoin key.
Consider two relations 𝑟 1 and 𝑟 2 that appear in the body of a rule 𝜑, with matching variables bound
to attributes 𝐴 in 𝑟 1 and corresponding attributes 𝐵 in 𝑟 2 . Henceforth we will say that 𝑟 1 and 𝑟 2
“share keys” on attributes 𝐴 and 𝐵. Co-hashing states that if 𝑟 1 and 𝑟 2 share keys on attributes 𝐴
and 𝐵, then all facts from 𝑟 1 and 𝑟 2 with the same values for 𝐴 and 𝐵 must be routed to the same
partition.
Note that even if co-hashing is satisfied for individual rules, 𝑟 might need to be repartitioned
between the rules, because a relation 𝑟 might share keys with another relation on attributes 𝐴 in one
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:13
rule and 𝐴′ in another. To avoid repartitioning, we would like the distribution policy to partition
consistently with co-hashing in every rule of a component.
Consider Line 8 of Listing 1, assuming it has already been decoupled. Inconsistencies between ACKs
are detected on a per-value basis and can be partitioned over the attribute bound to the variable
val; this is evidenced by the fact that the relation acks is always joined with other IDB relations
using the same attribute (bound to val). Line 2 and Listing 2 Line 5 are similarly partitionable by
value, as seen in Figure 5.
Formally, a distribution policy 𝐷 partitions relation 𝑟 by attribute 𝐴 if for any pair of facts 𝑓1, 𝑓2
in 𝑟 , 𝜋𝐴 (𝑓1 ) = 𝜋𝐴 (𝑓2 ) implies 𝐷 (𝑓1 ) = 𝐷 (𝑓2 ). Facts are distributed according to their partitioning
attributes.
𝐷 partitions consistently with co-hashing if for any pair of referenced relations 𝑟 1, 𝑟 2 in rule
𝜑 of 𝐶, 𝑟 1 and 𝑟 2 share keys on attribute lists 𝐴1 and 𝐴2 respectively, such that for any pair of
facts 𝑓1 ∈ 𝑟 1, 𝑓2 ∈ 𝑟 2 , 𝜋𝐴1 (𝑓1 ) = 𝜋𝐴2 (𝑓2 ) implies 𝐷 (𝑓1 ) = 𝐷 (𝑓2 ). Facts will be successfully joined,
aggregated, or negated after partitioning because they are sent to the same locations.
Precondition: There exists a distribution policy 𝐷 for relations referenced by component 𝐶 that
partitions consistently with co-hashing.
We can discover candidate distribution policies through a static analysis of the join and grouping
attributes in every rule 𝜑 in 𝐶.
Rewrite: Redirection With Partitioning. We are given a distribution policy 𝐷 from the precon-
dition. For any rules in 𝐶 ′ whose head is referenced in 𝐶, we modify the “redirection” relation such
that messages 𝑓 sent to 𝐶 at addr are instead sent to the appropriate node of 𝐶 at 𝐷 (𝑓 ).
4.2 Dependencies
By analyzing Dedalus rules, we can identify dependencies between attributes that (1) strengthen
partitioning by showing that partitioning on one attribute can imply partitioning on another, and
(2) loosen the co-hashing constraint.
For example, consider a relation 𝑟 that contains both an original string attribute Str and its
uppercased value in attribute UpStr. The functional dependency (FD) Str → UpStr strengthens
partitioning: partitioning on UpStr implies partitioning on Str. Formally, relation 𝑟 has a functional
dependency 𝑔 : 𝐴 → 𝐵 on attribute lists 𝐴, 𝐵 if for all facts 𝑓 ∈ 𝑟 , 𝜋𝐵 (𝑓 ) = 𝑔(𝜋𝐴 (𝑓 )) for some
function 𝑔. That is, the values 𝐴 in the domain of 𝑔 determine the values in the range, 𝐵. This
reasoning allows us to satisfy multiple co-hashing constraints simultaneously.
Now consider the following joins in the body of a rule: p(str), r(str, upStr), q(upStr). Co-
hashing would not allow partitioning, because 𝑝 and 𝑞 do not share keys over their attributes.
However, if we know the functional dependency Str → UpStr over 𝑟 , then we can partition 𝑝, 𝑞, 𝑟
on the uppercase values of the strings and still avoid reshuffling. This co-partition dependency
(CD) between the attributes of 𝑝 and 𝑞 loosens the co-hashing constraint beyond sharing keys.
Formally, relations 𝑟 1 and 𝑟 2 have a co-partition dependency 𝑔 : 𝐴 ↩→ 𝐵 on attribute lists 𝐴, 𝐵 if for
all proof trees containing facts 𝑓1 ∈ 𝑟 1 , 𝑓2 ∈ 𝑟 2 , we have 𝜋𝐵 (𝑓1 ) = 𝑔(𝜋𝐴 (𝑓2 )) for some function 𝑔. If
we partition by 𝐵 (the range of 𝑔) we also successfully partition by 𝐴 (the domain of 𝑔).
We return to the running example to see how CDs and FDs can be combined to enable coordination-
free partitioning where co-hashing forbade it. Listing 2 cannot be partitioned with co-hashing
because toStorage does not share keys with hashset in Line 3. No distribution policy can satisfy
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:14 David C. Y. Chu et al.
the co-hashing constraint if there exists two relations in the same rule that do not share keys.
However, we know that the hash is a function of the value; there is an FD hash.1 → hash.2.
Hence partitioning on hash.2 implies partitioning on hash.1. The first attributes of toStorage and
hashset are joined through the attributes of the hash relation in all rules, forming a CD. Let the first
attributes of toStorage and hashset—representing a value and a hash—be 𝑉 and 𝐻 respectively:
a fact 𝑓𝑣 in toStorage can only join with a fact 𝑓ℎ in hashset if hash (𝜋𝑉 (𝑓𝑣 )) equals 𝜋𝐻 (𝑓ℎ ). This
reasoning can be repeatedly applied to partition all relations by the attributes corresponding the
repeated variable hashed, as seen in Figure 6.
Precondition: There exists a distribution policy 𝐷 for relations 𝑟 referenced in 𝐶 that partitions
consistently with the CDs of 𝑟 .
Assume we know all CDs 𝑔 over attribute sets 𝐴1, 𝐴2 of relations 𝑟 1, 𝑟 2 . A distribution policy
partitions consistently with CDs if for any pair of facts 𝑓1, 𝑓2 over referenced relations 𝑟 1, 𝑟 2 in
rule 𝜑 of 𝐶, if 𝜋𝐴1 (𝑓1 ) = 𝑔(𝜋𝐴2 (𝑓2 )) for each attribute set, then 𝐷 (𝑓1 ) = 𝐷 (𝑓2 ).
We describe the mechanism for systematically finding FDs and CDs in the technical report.
Rewrite: Identical to Redirection with Partitioning.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:15
Fig. 7. Throughput/latency comparison between distributed protocols before and after rule-driven rewrites.
We define “state machines” and the rewrites for partial partitioning in the technical report.
5 EVALUATION
We will refer to our approach of manually modifying distributed protocols with the mechanisms
described in this paper as rule-driven rewrites, and the traditional approach of modifying distributed
protocols and proving the correctness of the optimized protocol as ad hoc rewrites.
In this section we address the following questions:
(1) How can rule-driven rewrites be applied to foundational distributed protocols, and how well do
the optimized protocols scale? (Section 5.2)
(2) Which of the ad hoc rewrites can be reproduced via the application of (one or more) rules, and
which cannot? (Section 5.3)
(3) What is the effect of the individual rule-driven rewrites on throughput? (Section 5.4)
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:16 David C. Y. Chu et al.
throughput differences between the unoptimized implementation and the 1-partition configuration.
Partitioning contributes to the differences between the 1, 3, and 5 partition configurations.
These experimental configurations demonstrate the scalability of the rewritten protocols. They do
not represent the most cost-effective configurations, nor the configurations that maximize through-
put. We manually applied rewrites on the critical path, selecting rewrites with low overhead, where
we suspect the protocols may be bottlenecked. Across the protocols we tested, these bottlenecks
often occurred where the protocol (1) broadcasts messages, (2) collects messages, and (3) logs to
disk. These bottlenecks can usually be decoupled from the original node, and because messages are
often independent of one another, the decoupled nodes can then be partitioned such that each node
handles a subset of messages. The process of identifying bottlenecks, applying suitable rewrites,
and finding optimal configurations may eventually be automated.
Voting. Client payloads arrive at the leader, which broadcasts payloads to the participants, collects
votes from the participants, and responds to the client once all participants have voted. Multiple
rounds of voting can occur concurrently. BaseVoting is implemented with 4 machines, 1 leader
and 3 participants, achieving a maximum throughput of 100,000 commands/s, bottlenecking at the
leader.
2PC (with Presumed Abort). The coordinator receives client payloads and broadcasts voteReq to
participants. Participants log and flush to disk, then reply with votes. The coordinator collects votes,
logs and flushes to disk, then broadcasts commit to participants. Participants log and flush to disk,
then reply with acks. The coordinator then logs and replies to the client. Multiple rounds of 2PC can
occur concurrently. Base2PC is implemented with 4 machines, 1 coordinator and 3 participants,
achieving a maximum throughput of 30,000 commands/s, bottlenecking at the coordinator.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:17
Paxos. Paxos solves consensus while tolerating up to 𝑓 failures. Paxos consists of 𝑓 + 1 proposers
and 2𝑓 + 1 acceptors. Each proposer has a unique, dynamic ballot number; the proposer with the
highest ballot number is the leader. The leader receives client payloads, assigns each payload a
sequence number, and broadcasts a p2a message containing the payload, sequence number, and
its ballot to the acceptors. Each acceptor stores the highest ballot it has received and rejects or
accepts payloads into its log based on whether its local ballot is less than or equal to the leader’s.
The acceptor then replies to the leader via a p2b message that includes the acceptor’s highest ballot.
If this ballot is higher than the leader’s ballot, the leader is preempted. Otherwise, the acceptor has
accepted the payload, and when 𝑓 + 1 acceptors accept, the payload is committed. The leader relays
committed payloads to the replicas, which execute the payload command and notify the clients.
BasePaxos is implemented with 8 machines—2 proposers, 3 acceptors, and 3 replicas (matching
BasePaxos in Section 5.3)—tolerating 𝑓 = 1 failures, achieving a maximum throughput of 50,000
commands/s, bottlenecking at the proposer.
Across the protocols, the additional latency overhead from decoupling is negligible.
Together, these experiments demonstrate that rule-driven rewrites can be applied to scale a variety
of distributed protocols, and that performance wins can be found fairly easily via choosing the
rules to apply manually. A natural next step is to develop cost models for our context, and integrate
into a search algorithm in order to create an automatic optimizer for distributed systems. Standard
techniques may be useful here, but we also expect new challenges in modeling dynamic load and
contention. It seems likely that adaptive query optimization and learning could prove relevant here
to enable autoscaling [20, 58].
2 Asymmetric decoupling is defined in the technical report. It applies when we decouple 𝐶 into 𝐶 1 and 𝐶 2 , where 𝐶 2 is
monotonic, but 𝐶 2 is independent of 𝐶 1 .
3 Partitioning with sealing is defined in the technical report. It applies when a partitioned component originally sent a
batched set of messages that must be recombined across partitions after partitioning.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:18 David C. Y. Chu et al.
Fig. 8. The common path taken by CompPaxos and ScalablePaxos, assuming 𝑓 = 1 and any partitionable
component has 2 partitions. The acceptors outlined in red represent possible quorums for leader election.
implementation of Paxos based, among other optimizations, on manually applying decoupling and
partitioning.
To best understand the merits of scalability, we choose not to batch client requests, as batching
often obscures the benefits of individual scalability rewrites.
5.3.1 Throughput comparison. Whittaker et al. created Scala implementations of Paxos (BasePaxos)
and Compartmentalized Paxos (CompPaxos). Since our implementations are in Dedalus, we first
compare throughputs of the Paxos implementations between the two languages to establish a
baseline. Following the nomenclature from Section 5.2, implementations in Dedalus are prepended
with , and implementations in Scala by Whittaker et al. are not.
BasePaxos was reported to peak of 25,000 commands/s with 𝑓 = 1 and 3 replicas on AWS in
2021 [63]. As seen Figure 9, we verified this result in GCP using the same code and experimental
setup. Our Dedalus implementation of Paxos— BasePaxos—in contrast, peaks at a higher 50,000
commands/s with the same configuration as BasePaxos. We suspect this performance difference is
due to the underlying implementations of BasePaxos in Scala and BasePaxos in Dedalus, compiled
to Hydroflow atop Rust. Indeed, our deployment of CompPaxos peaked at 130,000 commands/s, and
our reimplementation of Compartmentalized Paxos in Dedalus ( CompPaxos) peaked at a higher
160,000 commands/s, a throughput improvement comparable to the 25,000 command throughput
gap between BasePaxos and BasePaxos.
Note that technically, CompPaxos was reported to peak at 150,000 commands/s, not 130,000.
We deployed the Scala code provided by Whittaker et al. with identical hardware, network, and
configuration, but could not replicate their exact result.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:19
We now have enough context to compare the throughput between CompPaxos and ScalablePaxos;
their respective architectures are shown in Figure 8. CompPaxos achieves maximum throughput
with 20 machines: 2 proposers, 10 proxy leaders, 4 acceptors (in a 2 × 2 grid), and 4 replicas. We
compare CompPaxos and ScalablePaxos using the same number of machines, fixing the number
of proposers (for fault tolerance) and replicas (which we do not decouple or partition). Restricted
to 20 machines, ScalablePaxos achieves the maximum throughput with 2 proposers, 2 p2a proxy
leaders, 3 coordinators, 3 acceptors, 6 p2b proxy leaders, and 4 replicas. All components are kept
at minimum configuration—with only 1 partition—except for the p2b proxy leaders, which are
the throughput bottleneck. ScalablePaxos then scales to 130,000 commands/s, a 2.5× throughput
improvement over BasePaxos. Although CompPaxos reports a 6× throughput improvement over
BasePaxos from 25,000 to 150,000 commands/s in Scala, reimplemented in Dedalus, it reports a 3×
throughput improvement between CompPaxos and BasePaxos, similar to the 2.5× throughput
improvement between ScalablePaxos and BasePaxos. Therefore we conclude that the throughput
improvements of rule-driven rewrites and ad hoc rewrites are comparable when applied to Paxos.
We emphasize that our framework cannot realize every ad hoc rewrite in CompPaxos (Figure 8).
We describe the differences between CompPaxos and ScalablePaxos next.
5.3.2 Proxy leaders. Figure 8 shows that CompPaxos has a single component called “proxy leader”
that serves the roles of two components in ScalablePaxos: p2a and p2b proxy leaders. Unlike p2a
and p2b proxy leaders, proxy leaders in CompPaxos can be shared across proposers. Since only 1
proposer will be the leader at any time, CompPaxos ensures that work is evenly distributed across
proxy leaders. Our rewrites focus on scaling out and do not consider sharing physical resources
between logical components. Moreover, there is an additional optimization in the proxy leader
of CompPaxos. CompPaxos avoids relaying p2bs from proxy leaders to proposers by introducing
nack messages from acceptors that are sent instead. This optimization is neither decoupling nor
partitioning and hence is not included in ScalablePaxos.
5.3.3 Acceptors. CompPaxos partitions acceptors without introducing coordination, allowing each
partition to hold an independent ballot. In contrast, ScalablePaxos can only partially partition
acceptors and must introduce coordinators to synchronize ballots between partitions, because our
formalism states that the partitions’ ballots together must correspond to the original acceptor’s
ballot. Crucially, CompPaxos allows the highest ballot held at each partition to diverge while
ScalablePaxos does not, because this divergence can introduce non-linearizable executions that
remain safe for Paxos, but are too specific to generalize. We elaborate more on this execution in the
technical report.
Despite its additional overhead, ScalablePaxos does not suffer from increased latency because the
overhead is not on the critical path. Assuming a stable leader, p2b proxy leaders do not need to
forward p2bs to proposers, and acceptors do not need to coordinate between partitions.
5.3.4 Additional differences. CompPaxos additionally includes classical Paxos optimizations such
as batching, thriftiness [47], and flexible quorums [36], which are outside the scope of this paper as
they are not instances of decoupling or partitioning. These optimizations, combined with the more
efficient use of proxy leaders, explain the remaining throughput difference between CompPaxos
and ScalablePaxos.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:20 David C. Y. Chu et al.
Functional Decoupling
Monotonic Decoupling
Mutually Independent Decoupling
program must always decrypt the message from the client and encrypt its output. When partitioning,
the program must always encrypt its output. When decoupling, we always separate one node
into two. When partitioning, we always create two partitions out of one. Thus maximum scale
factor of each rewrite is 2×. To determine the scaling factors, we increased the number of clients by
increments of two for decoupling and three for partitioning, stopping when we reached saturation
for each protocol.
Briefly, we study each of the individual rewrites using the following artificial protocols:
• Mutually Independent Decoupling: A replicated set where the leader decrypts a client request,
broadcasts payloads to replicas, collects acknowledgements, and replies to the client (encrypting
the response), similar to the voting protocol. We denote this base protocol as R-set. We decouple
the broadcast and collection rules.
• Monotonic Decoupling: An R-set where the leader also keeps track of a ballot that is potentially
updated by each client message. The leader attaches the value of the ballot at the time each
client request is received to the matching response.
• Functional Decoupling: The same R-set protocol, but with zero replicas. The leader attaches the
highest ballot it has seen so far to each response. It still decrypts client requests and encrypts
replies as before.
• Partitioning With Dependencies: A R-set where each replica records the number of hash collisions,
similar to our running example.
• Partial Partitioning: A R-set where the leader and replicas each track an integer. The leader’s
integer is periodically incremented and sent to the replicas, similar to Paxos. The replicas attach
their latest integers to each response.
The impact on throughput varies between rewrites due to both the overhead introduced and the
underlying protocol. Note that of our 6 experiments, the first two are the only ones that add a
network hop to the critical path of the protocol and rely on pipelined parallelism. The combination
of networking overhead and the potential for imperfect pipelined parallelism likely explain why
they achieve only about 1.7× performance improvement. In contrast, the speedups for mutually
independent decoupling and the different variants of partitioning are closer to the expected 2×.
Nevertheless, each rewrite improves throughput in isolation as shown in Figure 10.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:21
6 RELATED WORK
Our results build on rich traditions in distributed protocol design and parallel query processing. The
intent of this paper was not to innovate in either of those domains per se, but rather to take parallel
query processing ideas and use them to discover and evaluate rewrites for distributed protocols.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:22 David C. Y. Chu et al.
Parallel Datalog goes back to the early 1990s (e.g. [26]). A recent survey covers the state of the
art in modern Datalog engines [41], including dedicated parallel Datalog systems and Datalog
implementations over Big Data engines. The partitioning strategies we use in Section 4 are discussed
in the survey; a deeper treatment can be found in the literature cited in Section 4 [7, 27, 28, 55].
Declarative languages like Dedalus have been used extensively in networking. Loo, et al. surveyed
work as of 2009 including the Datalog variants NDlog and Overlog [45]. As networking DSLs,
these languages take a relaxed “soft state” view of topics like persistence and consistency. Dedalus
and Bloom [5, 18] were developed with the express goal of formally addressing persistence and
consistency in ways that we rely upon here. More recent languages for software-defined networks
(SDNs) include NetKAT [9] and P4 [14], but these focus on centralized SDN controllers, not
distributed systems.
Further afield, DAG-based dataflow programming is explored in parallel computing (e.g., [12, 13]).
While that work is not directly relevant to the transformations we study here, their efforts to
schedule DAGs in parallel environments may inform future work.
7 CONCLUSION
This is the first paper to present general scaling optimizations that can be safely applied to any
distributed protocol, taking inspiration from traditional SQL query optimizers. This opens the door
to the creation of automatic optimizers for distributed protocols.
Our work builds on the ideas of Compartmentalized Paxos [63], which “unpacks” atomic components
to increase throughput. In addition to our work on generalizing decoupling and partitioning via
automation, there are additional interesting follow-on questions that we have not addressed here.
The first challenge follows from the separation of an atomic component into multiple smaller
components: when one of the smaller components fails, others may continue responding to client
requests. While this is not a concern for protocols that assume omission failures, additional checks
and/or rewriting may be necessary to extend our work to weaker failure models. The second
challenge is the potential liveness issues introduced by the additional latency from our rewrites and
our assumption of an asynchronous network. Protocols that calibrate timeouts assuming a partially
synchronous network with some maximum message delay may need their timeouts recalibrated.
This can likely be addressed in practice using typical pragmatic calibration techniques.
ACKNOWLEDGEMENTS
This work was supported by gifts from AMD, Anyscale, Google, IBM, Intel, Microsoft, Mohamed
Bin Zayed University of Artificial Intelligence, Samsung SDS, Uber, and VMware.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:23
REFERENCES
[1] Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley. http://webdam.inr
ia.fr/Alice/pdfs/all.pdf
[2] Ittai Abraham, Guy Gueta, Dahlia Malkhi, Lorenzo Alvisi, Ramakrishna Kotla, and Jean-Philippe Martin. 2017. Revisiting
Fast Practical Byzantine Fault Tolerance. CoRR abs/1712.01367 (2017). arXiv:1712.01367 http://arxiv.org/abs/1712.01367
[3] Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, and Tevfik Kosar. 2020. WPaxos: Wide Area Network Flexible
Consensus. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 211–223. https://doi.org/10.1109/TPDS.2
019.2929793
[4] Peter Alvaro, Tom J Ameloot, Joseph M Hellerstein, William Marczak, and Jan Van den Bussche. 2011. A declarative
semantics for Dedalus. UC Berkeley EECS Technical Report 120 (2011), 2011.
[5] Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak. 2011. Consistency Analysis in Bloom: a
CALM and Collected Approach. In Fifth Biennial Conference on Innovative Data Systems Research, CIDR 2011, Asilomar,
CA, USA, January 9-12, 2011, Online Proceedings. 249–260. http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf
[6] Peter Alvaro, William R. Marczak, Neil Conway, Joseph M. Hellerstein, David Maier, and Russell Sears. 2011. Dedalus:
Datalog in Time and Space. In Datalog Reloaded, Oege de Moor, Georg Gottlob, Tim Furche, and Andrew Sellers (Eds.).
Springer Berlin Heidelberg, Berlin, Heidelberg, 262–281.
[7] Tom J. Ameloot, Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. 2017. Parallel-Correctness and
Transferability for Conjunctive Queries. Journal of the ACM 64, 5 (Oct. 2017), 1–38. https://doi.org/10.1145/3106412
[8] Mohammad Javad Amiri, Chenyuan Wu, Divyakant Agrawal, Amr El Abbadi, Boon Thau Loo, and Mohammad Sadoghi.
2022. The bedrock of bft: A unified platform for bft protocol design and implementation. arXiv preprint arXiv:2205.04534
(2022).
[9] Carolyn Jane Anderson, Nate Foster, Arjun Guha, Jean-Baptiste Jeannin, Dexter Kozen, Cole Schlesinger, and David
Walker. 2014. NetKAT: Semantic foundations for networks. Acm sigplan notices 49, 1 (2014), 113–126.
[10] Mahesh Balakrishnan, Chen Shen, Ahmed Jafri, Suyog Mapara, David Geraghty, Jason Flinn, Vidhya Venkat, Ivailo
Nedelchev, Santosh Ghosh, Mihir Dharamshi, Jingming Liu, Filip Gruszczynski, Jun Li, Rounak Tibrewal, Ali Zaveri,
Rajeev Nagar, Ahmed Yossef, Francois Richard, and Yee Jiun Song. 2021. Log-Structured Protocols in Delos. In
Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (Virtual Event, Germany) (SOSP ’21).
Association for Computing Machinery, New York, NY, USA, 538–552. https://doi.org/10.1145/3477132.3483544
[11] Christian Berger and Hans P Reiser. 2018. Scaling byzantine consensus: A broad analysis. In Proceedings of the 2nd
workshop on scalable and resilient infrastructures for distributed ledgers. 13–18.
[12] Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou.
1995. Cilk: An efficient multithreaded runtime system. ACM SigPlan Notices 30, 8 (1995), 207–216.
[13] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012.
DAGuE: A generic distributed DAG engine for high performance computing. Parallel Comput. 38, 1-2 (2012), 37–51.
[14] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco,
Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM
Computer Communication Review 44, 3 (2014), 87–95.
[15] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache
flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering 38, 4
(2015).
[16] David C. Y. Chu, Rithvik Panchapakesan, Shadaj Laddad, Lucky E. Katahanas, Chris Liu, Kaushik Shivakumar, Natacha
Crooks, Joseph M. Hellerstein, and Heidi Howard. 2024. Optimizing Distributed Protocols with Query Rewrites
[Technical Report]. https://github.com/rithvikp/autocomp.
[17] Neil Conway, Peter Alvaro, Emily Andrews, and Joseph M Hellerstein. 2014. Edelweiss: Automatic storage reclamation
for distributed programming. Proceedings of the VLDB Endowment 7, 6 (2014), 481–492.
[18] Neil Conway, William R Marczak, Peter Alvaro, Joseph M Hellerstein, and David Maier. 2012. Logic and lattices for
distributed programming. In Proceedings of the Third ACM Symposium on Cloud Computing. 1–14.
[19] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings
of the 6th Conference on Symposium on Operating Systems Design I& Implementation - Volume 6 (San Francisco, CA)
(OSDI’04). USENIX Association, USA, 10.
[20] Amol Deshpande, Zachary Ives, Vijayshankar Raman, et al. 2007. Adaptive query processing. Foundations and Trends®
in Databases 1, 1 (2007), 1–140.
[21] David DeWitt and Jim Gray. 1992. Parallel database systems. Commun. ACM 35, 6 (June 1992), 85–98. https:
//doi.org/10.1145/129888.129894
[22] David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, and M. Muralikrishna. 1986.
GAMMA - A High Performance Dataflow Database Machine. In VLDB. 228–237.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
2:24 David C. Y. Chu et al.
[23] Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert Van Renesse. 2020. Scalog: Seamless
Reconfiguration and Total Order in a Scalable Shared Log. In 17th USENIX Symposium on Networked Systems Design
and Implementation (NSDI 20). USENIX Association, Santa Clara, CA, 325–338. https://www.usenix.org/conference/ns
di20/presentation/ding
[24] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. 1988. Consensus in the presence of partial synchrony. Journal
of the ACM (JACM) 35, 2 (1988), 288–323.
[25] Shinya Fushimi, Masaru Kitsuregawa, and Hidehiko Tanaka. 1986. An Overview of The System Software of A Parallel
Relational Database Machine GRACE.. In VLDB, Vol. 86. 209–219.
[26] Sumit Ganguly, Avi Silberschatz, and Shalom Tsur. 1990. A framework for the parallel processing of datalog queries.
ACM SIGMOD Record 19, 2 (1990), 143–152.
[27] Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. 2019. Parallel-Correctness and Containment
for Conjunctive Queries with Union and Negation. ACM Transactions on Computational Logic 20, 3 (July 2019), 1–24.
https://doi.org/10.1145/3329120
[28] Gaetano Geck, Frank Neven, and Thomas Schwentick. 2020. Distribution Constraints: The Chase for Distributed Data.
Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ICDT.2020.13
[29] Rachid Guerraoui, Nikola Knežević, Vivien Quéma, and Marko Vukolić. 2010. The next 700 BFT protocols. In Proceedings
of the 5th European conference on Computer systems. 363–376.
[30] Suyash Gupta, Mohammad Javad Amiri, and Mohammad Sadoghi. 2023. Chemistry behind Agreement. In Conference
on Innovative Data Systems Research (CIDR).(2023).
[31] Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and
Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on
Operating Systems Principles. 1–17.
[32] Joseph M. Hellerstein and Peter Alvaro. 2020. Keeping CALM: When Distributed Consistency is Easy. Commun. ACM
63, 9 (Aug. 2020), 72–81. https://doi.org/10.1145/3369736
[33] Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM
Transactions on Programming Languages and Systems (TOPLAS) 12, 3 (1990), 463–492.
[34] Martin Hirzel, Robert Soulé, Buğra Gedik, and Scott Schneider. 2018. Stream Query Optimization. Springer International
Publishing, 1–9.
[35] Heidi Howard and Ittai Abraham. 2020. Raft does not Guarantee Liveness in the face of Network Faults. https:
//decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/.
[36] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. 2016. Flexible paxos: Quorum intersection revisited. arXiv
preprint arXiv:1608.06696 (2016).
[37] Heidi Howard and Richard Mortier. 2020. Paxos vs Raft. In Proceedings of the 7th Workshop on Principles and Practice of
Consistency for Distributed Data. ACM. https://doi.org/10.1145/3380787.3393681
[38] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel
programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on
Computer Systems 2007. 59–72.
[39] Mohammad M Jalalzai, Costas Busch, and Golden G Richard. 2019. Proteus: A scalable BFT consensus protocol for
blockchains. In 2019 IEEE international conference on Blockchain (Blockchain). IEEE, 308–313.
[40] Bas Ketsman and Christoph Koch. 2020. Datalog with Negation and Monotonicity. In 23rd International Conference
on Database Theory (ICDT 2020) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 155), Carsten Lutz
and Jean Christoph Jung (Eds.). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 19:1–19:18.
https://doi.org/10.4230/LIPIcs.ICDT.2020.19
[41] Bas Ketsman, Paraschos Koutris, et al. 2022. Modern Datalog Engines. Foundations and Trends® in Databases 12, 1
(2022), 1–68.
[42] Igor Konnov, Jure Kukovec, and Thanh-Hai Tran. 2019. TLA+ model checking made symbolic. Proceedings of the ACM
on Programming Languages 3, OOPSLA (2019), 1–30.
[43] Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133–169. https:
//doi.org/10.1145/279227.279229
[44] Leslie Lamport. 2002. Specifying systems: the TLA+ language and tools for hardware and software engineers. (2002).
[45] Boon Thau Loo, Tyson Condie, Minos Garofalakis, David E Gay, Joseph M Hellerstein, Petros Maniatis, Raghu
Ramakrishnan, Timothy Roscoe, and Ion Stoica. 2009. Declarative networking. Commun. ACM 52, 11 (2009), 87–95.
[46] C Mohan, Bruce Lindsay, and Ron Obermarck. 1986. Transaction management in the R* distributed database manage-
ment system. ACM Transactions on Database Systems (TODS) 11, 4 (1986), 378–396.
[47] Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in Egalitarian parliaments.
In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM. https://doi.org/10.1145/25
17349.2517350
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.
Optimizing Distributed Protocols with Query Rewrites 2:25
[48] Inderpal Singh Mumick and Oded Shmueli. 1995. How expressive is stratified aggregation? Annals of Mathematics and
Artificial Intelligence 15 (1995), 407–435.
[49] Ray Neiheiser, Miguel Matos, and Luís Rodrigues. 2021. Kauri: Scalable bft consensus with pipelined tree-based
dissemination and aggregation. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles.
35–48.
[50] Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff. 2015. How Amazon
web services uses formal methods. Commun. ACM 58, 4 (2015), 66–73.
[51] Diego Ongaro. 2014. Consensus : bridging theory and practice. Ph. D. Dissertation. Stanford University.
[52] Kenneth J. Perry and Sam Toueg. 1986. Distributed agreement in the presence of processor and communication faults.
IEEE Transactions on Software Engineering SE-12, 3 (1986), 477–482. https://doi.org/10.1109/TSE.1986.6312888
[53] George Pirlea. 2023. Errors found in distributed protocols. https://github.com/dranov/protocol-bugs-list.
[54] Mingwei Samuel, Joseph M Hellerstein, and Alvin Cheung. 2021. Hydroflow: A Model and Runtime for Distributed
Systems Programming. (2021).
[55] Bruhathi Sundarmurthy, Paraschos Koutris, and Jeffrey Naughton. 2021. Locality-Aware Distribution Schemes. Schloss
Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ICDT.2021.22
[56] Florian Suri-Payer, Matthew Burke, Zheng Wang, Yunhao Zhang, Lorenzo Alvisi, and Natacha Crooks. 2021. Basil. In
Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM. ACM. https://doi.org/10.1
145/3477132.3483552
[57] Pierre Sutra. 2020. On the correctness of Egalitarian Paxos. Inform. Process. Lett. 156 (2020), 105901. https://doi.org/10
.1016/j.ipl.2019.105901
[58] Immanuel Trummer, Samuel Moseley, Deepak Maram, Saehan Jo, and Joseph Antonakakis. 2018. Skinnerdb: regret-
bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment 11, 12 (2018), 2074–2077.
[59] Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos Made Moderately Complex. ACM Comput. Surv. 47, 3, Article
42 (Feb. 2015), 36 pages. https://doi.org/10.1145/2673577
[60] Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. 2015. Vive La Différence: Paxos vs. Viewstamped
Replication vs. Zab. IEEE Transactions on Dependable and Secure Computing 12, 4 (July 2015), 472–484. https:
//doi.org/10.1109/tdsc.2014.2355848
[61] Zhaoguo Wang, Changgeng Zhao, Shuai Mu, Haibo Chen, and Jinyang Li. 2019. On the Parallels between Paxos and
Raft, and how to Port Optimizations. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing.
ACM. https://doi.org/10.1145/3293611.3331595
[62] Michael Whittaker. 2020. mwhittaker/craq_bug. https://github.com/mwhittaker/craq_bug.
[63] Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein,
Heidi Howard, Ion Stoica, and Adriana Szekeres. 2021. Scaling Replicated State Machines with Compartmentalization.
Proc. VLDB Endow. 14, 11 (July 2021), 2203–2215. https://doi.org/10.14778/3476249.3476273
[64] Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein,
Heidi Howard, Ion Stoica, and Adriana Szekeres. 2021. Scaling Replicated State Machines with Compartmentalization
[Technical Report]. arXiv:2012.15762 [cs.DC]
[65] Michael Whittaker, Neil Giridharan, Adriana Szekeres, Joseph Hellerstein, and Ion Stoica. 2021. SoK: A Generalized
Multi-Leader State Machine Replication Tutorial. Journal of Systems Research 1, 1 (2021).
[66] James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson.
2015. Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM
SIGPLAN Conference on Programming Language Design and Implementation. 357–368.
[67] Jianan Yao, Runzhou Tao, Ronghui Gu, Jason Nieh, Suman Jana, and Gabriel Ryan. 2021. DistAI: Data-Driven Automated
Invariant Learning for Distributed Protocols.. In OSDI. 405–421.
[68] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing
with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). USENIX Association,
Boston, MA. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets
[69] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams:
Fault-tolerant streaming computation at scale. In Proceedings of the twenty-fourth ACM symposium on operating systems
principles. 423–438.
[70] Jingren Zhou, Per-Ake Larson, and Ronnie Chaiken. 2010. Incorporating partitioning and parallel plans into the SCOPE
optimizer. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, 1060–1071.
Proc. ACM Manag. Data, Vol. 2, No. N1 (SIGMOD), Article 2. Publication date: February 2024.