0% found this document useful (0 votes)
21 views12 pages

13 Graves

This document discusses the challenges of designing large-scale distributed stream processing systems. It describes two such systems called Aurora* and Medusa that take complementary approaches. Aurora* assumes a single administrative domain while Medusa supports operation across multiple administrative domains. Key challenges addressed include load management, high availability, and communication protocols across nodes. The paper outlines the architectures of Aurora* and Medusa and discusses their approaches to these challenges.

Uploaded by

roenavega2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

13 Graves

This document discusses the challenges of designing large-scale distributed stream processing systems. It describes two such systems called Aurora* and Medusa that take complementary approaches. Aurora* assumes a single administrative domain while Medusa supports operation across multiple administrative domains. Key challenges addressed include load management, high availability, and communication protocols across nodes. The paper outlines the architectures of Aurora* and Medusa and discusses their approaches to these challenges.

Uploaded by

roenavega2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Scalable Distributed Stream Processing

Mitch Cherniack‫٭‬, Hari Balakrishnan†, Magdalena Balazinska†,


Don Carney‡, Uğur Çetintemel‡, Ying Xing‡, and Stan Zdonik‡

Many stream-based applications are naturally


distributed. Applications are often embedded in an
Abstract environment with numerous connected computing
Stream processing fits a large class of new applications devices with heterogeneous capabilities. As data travels
for which conventional DBMSs fall short. Because from its point of origin (e.g., sensors) downstream to
many stream-oriented systems are inherently applications, it passes through many computing devices,
geographically distributed and because distribution each of which is a potential target of computation.
offers scalable load management and higher availability, Furthermore, to cope with time-varying load spikes and
future stream processing systems will operate in a changing demand, many servers would be brought to bear
distributed fashion. They will run across the Internet on
computers typically owned by multiple cooperating
on the problem. In both cases, distributed computation is
administrative domains. This paper describes the the norm.
architectural challenges facing the design of large-scale This paper discusses the architectural issues facing the
distributed stream processing systems, and discusses design of large-scale distributed stream processing
novel approaches for addressing load management, high systems. We begin in Section 2 with a brief description of
availability, and federated operation issues. We describe our centralized stream processing system, Aurora [4]. We
two stream processing systems, Aurora* and Medusa, then discuss two complementary efforts to extend Aurora
which are being designed to explore complementary to a distributed environment: Aurora* and Medusa.
solutions to these challenges. Aurora* assumes an environment in which all nodes fall
under a single administrative domain. Medusa provides
1 Introduction the infrastructure to support federated operation of nodes
There is a large class of emerging applications in which across administrative boundaries. After describing the
data, generated in some external environment, is pushed architectures of these two systems in Section 3, we
asynchronously to servers that process this information. consider three design challenges common to both:
Some example applications include sensor networks, infrastructures and protocols supporting communication
location-tracking services, fabrication line management, amongst nodes (Section 4), load sharing in response to
and network management. These applications are variable network conditions (Section 5), and high
characterized by the need to process high-volume data availability in the presence of failures (Section 6). We
streams in a timely and responsive fashion. Hereafter, we also discuss high-level policy specifications employed by
refer to such applications as stream-based applications. the two systems in Section 7. For all of these issues, we
The architecture of current database management believe that the push-based nature of stream-based
systems assumes a pull-based model of data access: when applications not only raises new challenges but also offers
a user (the active party) wants data, she submits a query the possibility of new domain-specific solutions.
to the system (the passive party) and an answer is
returned. In contrast, in stream-based applications data is 2 Aurora: A Centralized Stream Processor
pushed to a system that must evaluate queries in response
to detected events. Query answers are then pushed to a 2.1 System Model
waiting user or application. Therefore, the stream-based In Aurora, data is assumed to come from a variety of
model inverts the traditional data management model by sources such as computer programs that generate values
assuming users to be passive and the data management at regular or irregular intervals or hardware sensors. We
system to be active. will use the term data source for either case. A data
stream is a potentially unbounded collection of tuples
‫٭‬Brandeis University, ‡Brown University, †M.I.T. generated by a data source. Unlike the tuples of the
relational database model, stream tuples are generated in
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
real-time and are typically not available in their entirety at
commercial advantage, the VLDB copyright notice and the title of the any given point in time.
publication and its date appear, and notice is given that copying is by Aurora processes tuples from incoming streams
permission of the Very Large Data Base Endowment. To copy according to a specification made by an application
otherwise, or to republish, requires a fee and/or special permission from
the Endowment
administrator. Aurora is fundamentally a data-flow
Proceedings of the 2003 CIDR Conference system and uses the popular boxes and arrows paradigm
1. (A = 1, B = 2)
Input data 2. (A = 1, B = 3)
Output to
streams applications
3. (A = 2, B = 2)
4. (A = 2, B = 1)
5. (A = 2, B = 6)
Operator boxes Continuous & ad hoc
6. (A = 4, B = 5)
Historical
Storage queries 7. (A = 4, B = 2)

Figure 1: Basic Aurora System Model Figure 2: A Sample Tuple Stream
ascending order of its sort attributes, with at least one
found in most process flow and workflow systems. Here, tuple emitted per timeout period.*
tuples flow through a loop-free, directed graph of Tumble takes an input aggregate function and a set of
processing operators (i.e., boxes), as shown in Figure 1. input groupby attributes.† The aggregate function is
Ultimately, output streams are presented to applications, applied to disjoint “windows” (i.e., tuple subsequences)
which must be constructed to handle the asynchronously over the input stream. The groupby attributes are used to
arriving tuples in an output stream. map tuples to the windows they belong to. For example,
Every Aurora application must be associated with a consider the stream of tuples shown in Figure 2. Suppose
query that defines its processing requirements, and a that a Tumble box is defined with an aggregate function
Quality-of-Service (QoS) specification that specifies its that computes the average value of B, and has A as its
performance requirements (see Section 7.1). groupby attribute. This box would emit two tuples and
2.2 Query Model have another tuple computation in progress as a result of
processing the seven tuples shown. The first emitted
Queries are built from a standard set of well-defined
tuple, (A = 1, Result = 2.5), which averages the
operators (boxes). Each operator accepts input streams (in
arrows), transforms them in some way, and produces one two tuples with A = 1 would be emitted upon the
or more output streams (out arrows). By default, queries arrival of tuple #3: the first tuple to arrive with a value of
are continuous [5] in that they can potentially run forever A not equal to 1. Similarly, a second tuple, (A = 2,
over push-based inputs. Ad hoc queries can also be Result = 3.0), would be emitted upon the arrival of
defined and attached to connection points: predetermined tuple #6. A third tuple with A = 4 would not get emitted
arcs in the flow graph where historical data is stored. until a later tuple arrives with A not equal to 4.
Aurora queries are constructed using a box-and-arrow
2.3 Run-time Operation
based graphical user interface. It would also be possible
to allow users to specify declarative queries in a language The single-node Aurora run-time architecture is shown in
such as SQL (modified to specify continuous queries), Figure 3. The heart of the system is the scheduler that
and then compile these queries into our box and arrow determines which box to run. It also determines how
representation. many of the tuples that might be waiting in front of a
Here, we informally describe a subset of the Aurora given box to process and how far to push them toward the
operators that are relevant to this paper; a complete output. We call this latter determination train scheduling
description of the operators can be found in [2, 4]. This [4]. Aurora also has a Storage Manager that is used to
subset consists of a simple unary operator (Filter), a buffer queues when main memory runs out. This is
binary merge operator (Union), a time-bounded particularly important for queues at connection points
windowed sort (WSort), and an aggregation operator since they can grow quite long.
(Tumble). Aurora also includes a mapping operator Aurora must constantly monitor the QoS of output
(Map), two additional aggregate operators (XSection and tuples (QoS Monitor in Figure 3). This information is
Slide), a join operator (Join), and an extrapolation important since it drives the Scheduler in its decision-
operator (Resample), none of which are discussed in making, and it also informs the Load Shedder when and
detail here. where it is appropriate to discard tuples in order to shed
Given some predicate, p, Filter (p) produces an output load. Load shedding is but one technique employed by
stream consisting of all tuples in its input stream that Aurora to improve the QoS delivered to applications.
satisfy p. Optionally, Filter can also produce a second *
Note that WSort is potentially lossy because it must discard any tuples
output stream consisting of those tuples which did not
that arrive after some tuple that follows it in sort order has already been
satisfy p. Union produces an output stream consisting of emitted.
all tuples on its n input streams. Given a set of sort †
Aurora’s aggregate operators have two additional parameters that
attributes, A1, A2, …, An and a timeout, WSort buffers all specify when tuples get emitted and when an aggregate times out. For
the purposes of this discussion, we assume that these parameters have
incoming tuples and emits tuples in its buffer in
been set to output a tuple whenever a window is full (i.e., never as a
result of a timeout).
inputs outputs
Storage
participating nodes in response to changing demand,
Manager Router system load, and failures, is a challenging problem, and
intractable as an optimization problem in a large network.

Q1 σ Additionally, inter-domain collaborations are not a

Q2
. Scheduler
µ
.
straightforward extension of intra-domain distribution.
.
.
.
.
For instance, some applications may not want their data
Qi …  or computation running within arbitrary domains, and
Buffer manager
Box Processors some organizations may not have the incentive to process
Catalogs
streams unless they derive tangible benefits from such
Persistent Store
processing
Qj …
. Load QoS Our architecture splits the general problem into intra-
.
Shedder Monitor
.

participant distribution (a relatively small-scale
Qn
distribution all within one administrative domain, handled
by Aurora*) and inter-participant distribution (a large-
Figure 3: Aurora Run-time Architecture scale distribution across administrative boundaries,
handled by Medusa). This method of splitting allows the
When load shedding is not working, Aurora will try to re- general problem to become tractable, enabling the
optimize the network using standard query optimization implementation of different policies and algorithms for
techniques (such as those that rely on operator load sharing. This decomposition allows three pieces to
commutativities). This tactic requires a more global view be shared between Aurora* and Medusa: (i) Aurora, (ii)
of the network and thus is used more sparingly. It does an overlay network for communication, and (iii)
have the advantage that in transforming the original algorithms for high-availability that take advantage of the
network, it might uncover new opportunities for load streaming nature of our problem domain.
shedding. The final tactic is to retune the scheduler by 3.1 Aurora*: Intra-participant Distribution
gathering new statistics or switching scheduler
Aurora* consists of multiple single-node Aurora servers
disciplines.
that belong to the same administrative domain and
cooperate to run the Aurora query network on the input
3 Distributed System Architecture
streams. In general there are no operational restrictions
Building a large-scale distributed version of a stream regarding the nodes where sub-queries can run; boxes can
processing system such as Aurora raises several important be placed on and executed at arbitrary nodes as deemed
architectural issues. In general, we envision a distributed appropriate.
federation of participating nodes in different When an Aurora query network is first deployed, the
administrative domains. Together, these nodes provide a Aurora* system will create a crude partitioning of boxes
stream processing service for multiple concurrent stream- across a network of available nodes, perhaps as simple as
based applications. Collaboration between distinct running everything on one node. Each Aurora node
administrative domains is fundamentally important for supporting the running system will continuously monitor
several reasons, including: its local operation, its workload, and available resources
(e.g., CPU, memory, bandwidth, etc.). If a machine finds
1. A federation in which each participating organization itself short of resources, it will consider offloading boxes
contributes a modest amount of computing, to another appropriate Aurora node. All dynamic
communication, and storage resources allows for a reconfiguration will take place in such a decentralized
high degree of resource multiplexing and sharing, fashion, involving only local, pair-wise interactions
enabling large time-varying load spikes to be between Aurora nodes. We discuss the pertinent load
handled. It also helps improve fault-tolerance and distribution mechanisms and policies in more detail in
resilience against denial-of-service attacks. Section 5.
2. Many streaming services, such as weather
forecasting, traffic management, and market analysis, 3.2 Medusa: Inter-participant Federated Operation
inherently process data from different autonomous Medusa is a distributed infrastructure that provides
domains and compose them; distribution across service delivery among autonomous participants. A
administrative boundaries is a fundamental constraint Medusa participant is a collection of computing devices
in these situations. administered by a single entity. Hence, participants range
in scale from collections of stream processing nodes
We envision that programs will continue to be written in capable of running Aurora and providing part of the
much the same way that they are with single-node global service, to PCs or PDAs that allow user access to
Aurora, except that they will now run in a distributed the system (e.g., to specify queries), to networks of
fashion. The partitioning of the query plan on to the sensors and their proxies that provide input streams.
Medusa is an agoric system [16], using economic crosses participant boundaries is named separately within
principles to regulate participant collaborations and solve each participant.
the hard problems concerning load management and To find the definition of an entity given its name, or
sharing. Participants provide services to each other by the location where a data stream is available or a piece of
establishing contracts that determine the appropriate a query is executing, we define two types of catalogs in
compensation for each service. Medusa uses a market our distributed infrastructure: intra-participant and inter-
mechanism with an underlying currency (“dollars”) that participant catalogs. Within a participant, the catalog
backs these contracts. Each contract exists between two contains definitions of operators, schemas, streams,
participants and covers a message stream that flows queries, and contracts. For streams, the catalog also holds
between them. One of the contracting participants is the (possibly stale) information on the physical locations
sending participant; the other is the receiving participant. where events are being made available. Indeed, streams
Medusa models each message stream as having positive may be partitioned across several nodes for load
value, with a well-defined value per message; the model balancing. For queries, the catalog holds information on
therefore is that the receiving participant always pays the the content and location of each running piece of the
sender for a stream. In turn, the receiver performs query- query. The catalog may be centralized or distributed. All
processing services on the message stream that nodes owned by a participant have access to the complete
presumably increases its value, at some cost. The receiver intra-participant catalog.
can then sell the resulting stream for a higher price than it For participants to collaborate and offer services that
paid and make money. cross their boundaries, some information must be made
Some Medusa participants are purely stream sources globally available. This information is stored in an inter-
(e.g., sensor networks and their proxies), and are paid for participant catalog and includes the list, description, and
their data, while other participants (e.g., end-users) are current location of pieces of queries running at each
strictly stream sinks, and must pay for these streams. participant.
However, most Medusa participants are “interior” nodes Each participant that provides query capabilities
(acting both as sources and sinks). They are assumed to holds a part of the shared catalog. We propose to
operate as profit-making entities; i.e., their contracts have implement such a distributed catalog using a distributed
to make money or they will cease operation. Our hope is hash table (DHT) with entity names as unique keys.
that such contracts (mostly bilateral) will allow the Several algorithms exist for this purpose (e.g., DHTs
system to anneal to a state where the economy is stable, based on consistent hashing [6, 14] and LH* [19]). These
and help derive a practical solution to the computationally algorithms differ in the way they distribute load among
intractable general partitioning problem of placing query participants, handle failures, and perform lookups.
operators on to nodes. The details of the contracting However, they all efficiently locate nodes for any key-
process are discussed in Section 7.2. value binding, and scale with the number of nodes and the
number of objects in the table.
4 Scalable Communications Infrastructure
4.2 Routing
Both Aurora* and Medusa require a scalable
Before producing events, a data source, or an
communication infrastructure. This infrastructure must
administrator acting on its behalf, registers a new schema
(1) include a naming scheme for participants and query
definition and a new stream name with the system, which
operators and a method for discovering where any portion
in turn assigns a default location for events of the new
of a query plan is currently running and what operators
type. Load sharing between nodes may later move or
are currently in place, (2) route messages between
partition the data. However, the location information is
participants and nodes, (3) multiplex messages on to
always propagated to the intra-participant catalog.
transport-layer streams between participants and nodes,
When a data source produces events, it labels them
and (4) enable stream processing to be distributed and
with a stream name and sends them to one of the nodes in
moved across nodes. The communications infrastructure
the overlay network. Upon receiving these events, the
is an overlay network, layered on top of the underlying
node consults the intra-participant catalog and forwards
Internet substrate.
events to the appropriate locations.
4.1 Naming and Discovery Each Aurora network deployed in the system explicitly
There is a single global namespace for participants, and binds its inputs and outputs to a list of streams, by
each participant has a unique global name. When a enumerating their names. When an input is bound, the
participant defines a new operator, schema, or stream, it intra-participant catalog is consulted to determine where
does so within its own namespace. Hence, each entity’s the streams of interest are currently located. Events from
name begins with the name of the participant who defined these streams are then continually routed to the location
it, and each object can be uniquely named by the tuple: where the query executes.
(participant, entity-name). Additionally, a stream that Query plans only bind themselves to streams defined
within a participant. Explicit connections are opened for
streams to cross participant boundaries. These streams are quotes. A receiving participant interested only in knowing
then defined separately within each domain. when a specific stock passes above a certain threshold
would normally have to receive the complete stream and
4.3 Message Transport
would have to apply the filter itself. With remote
When a node transfers streams of messages to another definition, it can instead remotely define the filter, and
node in the overlay, those streams will in general belong receive directly the customized content.
to different applications and have different characteristics.
In many situations, especially in the wide-area, we expect 5 Load Management
the network to be the stream bottleneck. The transport
To adequately address the performance needs of stream-
mechanism between nodes must therefore be carefully
based applications under time varying, unpredictable
designed.
input rates, a multi-node data stream processing system
One approach would be to set up individual TCP
must be able to dynamically adjust the allocation of
connections, one per message stream, between the node
processing among the participant nodes. This decision
pair. This approach, although simple to implement, has
will primarily consider the loads and available resources
several problems. First, as the number of message streams
(e.g., processor cycles, bandwidth, memory).
grows, the overhead of running several TCP connections
Both Aurora* and Medusa address such load
becomes prohibitive on the nodes. Second, independent
management issues by means of a set of algorithms that
TCP connections do not share bandwidth well and in fact
provide efficient load sharing among nodes. Because
adversely interact with each other in the network [11].
Aurora* assumes that the participants are all under a
Third, both within one participant as well as between
common administrative control, lightly-loaded nodes will
participants, we would like the bandwidth between the
freely share load with their over-burdened peers. Medusa
nodes to be shared amongst the different streams
will make use of the Aurora* mechanisms where
according to a prescribed set of weights that depend on
appropriate, but it must also worry about issues of how to
either QoS specifications or contractual obligations.
cross administrative boundaries in an economically viable
Our transport approach is to multiplex all the message
way without violating contractual constraints.
streams on to a single TCP connection and have a
In the rest of this section, we first discuss the basic
message scheduler that determines which message stream
mechanisms used for partitioning and distributing Aurora
gets to use the connection at any time. This scheduler
operator networks across multiple nodes. We then discuss
implements a weighted connection sharing policy based
several key questions that need to be addressed by any
on QoS or contract specification, and keeps track of the
repartitioning policy.
rates allocated to the different messages in time.
There are some message streaming applications where 5.1 Mechanisms: Repartitioning Aurora Networks
the in-order reliable transport abstraction of TCP is not On every node that runs a piece of Aurora network, a
needed, and some message loss is tolerable. We plan to query optimizer/load share daemon will run periodically
investigate if a UDP-based multiplexing protocol is also in the background. The main task of this daemon will be
required in addition to TCP. Doing this would require a to adjust the load of its host node in order to optimize the
congestion control protocol to be implemented [12]. overall performance of the system. It will achieve this by
4.4 Remote Definition either off-loading computation or accepting additional
computation. Load redistribution is thus a process of
To share load dynamically between nodes within a
moving pieces of the Aurora network from one machine
participant, or across participants, parts of Aurora
to another.
networks must be able to change the location where they
Load sharing must occur while the network is
execute at run-time. However, process migration raises
operating. Therefore, it must first stabilize the network at
many intractable compatibility and security issues,
the point of the transformation. Network transformations
especially if the movement crosses participant
are only considered between connection points. Consider
boundaries. Therefore, we propose a different approach,
a sub-network S that is bounded on the input side by an
which we call remote definition. With this approach, a
arc, Cin, and on the output side by an arc, Cout. The
participant instantiates and composes operators from a
connection point at Cin is first choked off by simply
pre-defined set offered by another participant to mimic
collecting any subsequent input arriving at the connection
box sliding. For example, instead of moving a WSort box,
point at Cin. Any tuples that are queued within S are
a participant remotely defines the WSort box at another
allowed to drain off. When S is empty, the network is
participant and binds it to the appropriate streams within
manipulated, parts of it are moved to other machines, and
the new domain. Load sharing and box sliding are
the flow of messages at Cin is turned back on.
discussed in more details in the following sections.
It should be noted that the reconfiguration of the
In addition to facilitating box sliding, remote definition
Aurora network will not always be a local decision. For
also helps content customization. For example, a
example, an upstream node might be required to signal a
participant might offer streams of events indicating stock
downstream node that it does not have sufficient
Before the slide …
Filter(p)

B1 B2 B3
splits into …
machine1 machine2

After the slide … Filter(p)


Filter(q) Union
B1 B2 B3
Filter(p)
machine1 machine2

Figure 4: Box Sliding Figure 5: Splitting a Filter Box

bandwidth to handle its output (this would happen if an split box returns the same result as an unsplit box), one or
upstream node notices a backup on its output link). In this more boxes must be added to the network that merges the
case, the upstream node might want to signal the box outputs back into a single stream.
neighboring downstream node to move one or more boxes The boxes required to merge results depend on the box
upstream to reduce the communication across that link. that is split. Figure 5 and Figure 6 show two examples.
We now discuss two basic load sharing mechanisms, The first split is of Filter and simply requires a Union
box sliding and box splitting, which are used to repartition box to accomplish the merge. The second split is of
the Aurora network in a pair-wise fashion. Tumble, which requires a more sophisticated merge,
consisting of Union followed by WSort and then another
Box Sliding. This technique takes a box on the edge of a Tumble. It also requires that the aggregate function
sub-network on one machine and shifts it to its neighbor. argument to Tumble, agg, have a corresponding
Beyond the obvious repositioning of processing, shifting combination function, combine, such that for any set of
a box upstream is often useful if the box has a low tuples, {x1, x2, …, xn}, and k ≤ n:
selectivity (reduces the amount of data) and the
bandwidth of the connection is limited. Shifting a box agg({x1, x2, …, xn}) =
downstream can be useful if the selectivity of the box is combine(agg({x1, x2, …, xk}), agg({xk+1, xk+2, …, xn})
greater than one (produces more data than the input, e.g.,
a join) and the bandwidth of the connection is again For example, if agg is cnt (count), combine is sum, and if
limited. We call this kind of remapping horizontal load agg is max, then combine is max also. In Figure 6, agg is
sharing or box sliding. Figure 4 illustrates upstream box cnt and combine is sum.
sliding. To illustrate the split shown in Figure 6, consider a
It should be noted that the machine to which a box is Tumble applied to the stream that was shown in Figure 2
sent must have the capability to execute the given with the aggregate function cnt and groupby attribute A.
operation. In a sensor network, some of the nodes can be Observe that without splitting, Tumble would emit the
very weak. Often the sensor itself is capable of following tuples while processing the seven tuples shown
computation, but this capability is limited. Thus, it might in Figure 2:
be possible to slide a simple Filter box to a sensor node,
whereas the sensor might not support a Tumble box. Tumble (cnt, Groupby A)
It should also be noted that box sliding could also move
boxes vertically. That is, a box that is assigned to machine splits into …
A can be moved to machine B as long as the input and
output arcs are rerouted accordingly. Tumble (cnt, Groupby A)
Filter (p)
Tumble (cnt, Groupby A)
Box Splitting. A heavier form of load sharing involves
splitting Aurora boxes. A split creates a copy of a box that
Merge
is intended to run on a second machine. This mechanism Union

can be used to offload from an overloaded machine; one


or more boxes on this machine get split, and some of the WSort (A)
load then gets diverted to the box copies resulting from
the split (and situated on other machines). Every box-split
Tumble (sum, Groupby A)
must be preceded by a Filter box with a predicate that
partitions input tuples (routing them to one box or the
other). For splits to be transparent (i.e., to ensure that a Figure 6: Splitting a Tumble Box
economic contracts, used by Aurora* and Medusa,
machine1
respectively (see Section 7).
p machine2
B
Initiation of Load Sharing. Because network topologies
Filter
Merge and loads will be changing frequently, load sharing will
need to be performed fairly frequently as well. However,
B shifting boxes around too frequently could lead to
¬p instability as the system tries to adjust to load
fluctuations. Determining the proper granularity for this
Figure 7: Remapping after a Split operation is an important consideration for a successful
system.
(A = 1, result = 2)
(A = 2, result = 3) Choosing What to Offload. Both box sliding and box
splitting require moving boxes and their input and output
Suppose that a split of the Tumble box takes place after arcs across machine boundaries. Even though a
tuple #3 arrives, and that the Filter box used for routing neighboring machine may have available compute cycles
tuples after the split uses the predicate, B < 3 to decide and memory, it may not be able to handle the additional
where to send any tuple arriving in the future (i.e., if B < bandwidth of the new arcs. Thus, the decision of which
3 then send the tuple to machine containing the original Aurora network pieces to move must consider bandwidth
Tumble box (machine #1), and otherwise send the tuple availability as well.
to machine #2). In this case, machine #1 will see tuples 1,
2, 3, 4 and 7; and machine #2 will see tuples 5 and 6. Choosing Filter Predicates for Box Splitting. Every
After machine #1 processes tuple #7, its Tumble box will box split results in a new sub-network rooted by a Filter
have emitted tuples: box. The Filter box acts as a semantic router for the
tuples arriving at the box that has been split. The filter
(A = 1, result = 2) predicate, p, defines the redistributed load. The choice of
(A = 2, result = 2) p is crucial to the effectiveness of this strategy. Predicate
p could depend on the stream content. For example, we
and after machine #2 processes tuple #6, its Tumble box might want to separate streams based on where they were
will have emitted the tuple: generated as in all streams generated in Cambridge. On
the other hand, the partitioning criterion could depend on
(A = 2, result = 1) some metadata or statistics about the streams as in the top
10 streams by arrival rate. Alternatively, p could be
Assuming a large enough timeout argument, WSort based on a simple statistic as in half of the available
rearranges the union of these tuples, emitting them in streams. Moreover, the choice of p could vary with time.
order of their values for A. The Tumble box that follows In other words, as the network characteristics change, a
then adds the values of result for tuples with like simple adjustment to p could be enough to rebalance the
load.
values of A. This results in the emission of tuples:
Choosing What to Split. Choosing the right sub-network
(A = 1, result = 2)
to split is also an important optimization problem. The
(A = 2, result = 3)
trick is to pick a set of boxes that will move “just enough”
processing. In a large Aurora network, this could be quite
which is identical to that of the unsplit Tumble box.
difficult. Moreover, it is important to move load in way
Once split has replicated a part of the network, the
that will not require us to move it again in the near future.
parallel branches can be mapped to different machines. In
Thus, finding candidate sub-networks that have durable
fact, an overloaded machine may perform a split and then
effect is important.
ask a neighbor if it can accept some additional load. If the
neighbor is willing, the network might get remapped as in
Handling Connection Points. Naively, splitting a
Figure 7.
connection point could involve copying a lot of data.
5.2 Key Repartitioning Challenges Depending on the expected usage, this might be a good
We now provide an outline of several fundamental policy investment. In particular, if it is expected that many users
decisions regarding when and how to use the load sharing will attach ad hoc queries to this connection point, then
mechanism described in the previous subsection. splitting it and moving a replica to different machine may
Particular solutions will be guided and constrained by the be a sensible load sharing strategy. On the other hand, it
high-level policy specifications and guidelines, QoS and might make sense to leave the connection point intact and
to split the boxes on either side of it. This would mean
that the load introduced by the processing would be
back-up server primary server
moved, while the data access to the second box would be
remote.

6 High Availability
A key goal in the design of any data stream processing s1 s2 s3
system is to achieve robust operation in volatile and
dynamic environments, where availability may suffer due
to (1) server and communication failures, (2) sustained back-up of the tuples in transit
congestion levels, and (3) software failures. In order to Figure 8: Primary and back-up servers
improve overall system availability, Aurora* and Medusa
rely on a common stream-oriented data back-up and
between the servers. When an upstream server sends a
recovery approach, which we describe below.
message (containing tuples) to a successor, it also
6.1 Overview and Key Features includes a monotonically increasing sequence number. It
Our high-availability approach has two unique is sufficient to include only the base sequence number, as
advantages, both due to the streaming data-flow nature of the corresponding numbers for all tuples can be
our target systems. First, it is possible to reliably back up automatically generated at the receiving server by simply
data and provide safety without incurring the overhead to incrementing the base. We now describe two remote
explicitly copy them to special back up servers (as in the queue truncation techniques that use tuple sequence
case of traditional process pair models [10]). In our numbers.
model, each server can effectively act as a back-up for its Our first technique involves the use of special flow
downstream servers. Tuples get processed and flow messages. Periodically, each data source creates and
naturally in the network (precisely as in the case of sends flow messages into the system. A box processes a
regular operation). Unlike in regular operation, however, flow message by first recording the sequence number of
processed tuples are discarded lazily, only when it is the earliest tuple that it currently depends on‡, and then
determined that their effects are safely recorded passing it onward. Note that there might be multiple
elsewhere, and, thus, can be effectively recovered in case earliest sequence numbers, one for each upstream server
of a failure. at the extreme case. When the flow message reaches a
Second, the proposed approach enables a tradeoff server boundary, these sequence values are recorded and
between the recovery time and the volume of checkpoint the message continues in the next server. Hence, each
messages required to provide safety. This flexibility server records the identifiers of the earliest upstream
allows us to emulate a wide spectrum of recovery models, tuples that it depends on. These values serve as
ranging from a high-volume checkpoints/fast-recovery checkpoints; they are communicated through a back
approach (e.g., Tandem [1]) to a low-volume channel to the upstream servers, which can appropriately
checkpoints/slow-recovery approach (e.g., log-based truncate the tuples they hold. Clearly, the flow message
recovery in traditional databases). can also be piggybacked on other control or data
messages (such as heartbeat messages, DHT lookup
6.2 Regular Operation messages, or regular tuple messages).
We say that a distributed stream processing system is k- The above scheme will operate correctly only for
safe if the failure of any k servers does not result in any straight-line networks. When there are branches and
message losses. The value of k should be set based on the recombinations, special care must be taken. Also, when
availability requirements of applications, and the messages from one server go to multiple subsequent
reliability and load characteristics of the target servers, additional extensions are required.
environments. We provide k-safety by maintaining the Whenever a message is split and sent to two
copies of the tuples that are in transit at each server s, at k destinations, then the flow message is similarly split. If a
other servers that are upstream from s. An upstream back- box gets input from two arcs, it must save the first flow
up server simply holds on to a tuple it has processed until message until it receives one on the other arc. If the two
its primary server tells it to discard the tuple. Figure 8 flow messages come from different servers, then both are
illustrates the basic mechanism for k = 1. Server s1 acts as sent onward. If they come from the same server, then the
a back-up of server s2. A tuple t sent from s1 to s2 is

simply kept at s1’s output queue until it is guaranteed that If the box has state (e.g., consider an aggregate box), then the
all tuples that depended on t (i.e., the tuples whose values recorded tuple is the one that presently contributes to the state of
got determined directly or indirectly based on t) made it the box and that has the lowest sequence number (for each
upstream server). If the box is stateless (e.g., a filter box), then
to s3.
the recorded tuple is the one that has been processed most
In order to correctly truncate output queues, we need to recently.
keep track of the order in which tuples are transmitted
minimum is computed as before and a single message The basic approach can be extended to support faster
sent onward. In this way, the correct minimum is received recovery, but at higher run time cost. Consider
at the output. establishing a collection of K virtual machines on top of
An output can receive flow messages from multiple the Aurora network running on a single physical server.
upstream servers. It must merely respond to the correct Now, utilize the approach described above for each
one with a back channel message. Similarly, when an virtual machine. Hence, there will be queues at each
upstream server has multiple successor servers, it must virtual machine boundary, which will be truncated when
wait for a back channel message from each one, and then possible. Since each queue is on the same physical
only truncate the queue to the maximum of the minimum hardware as its downstream boxes, high availability is not
values. provided on machine failures with the algorithms
An alternate technique to special flow messages is to described so far.
install an array of sequence numbers on each server, one To achieve high-availability, the queue has to be
for each upstream server. On each box’s activation, the replicated to a physical backup machine. At a cost of one
box records in this array the earliest tuples on which it message per entry in the queue, each of the K virtual
depends. The upstream servers can then query this array machines can resume processing from its queue, and finer
periodically and truncate their queues accordingly. This granularity restart is supported. The ultimate extreme is to
approach has the advantage that the upstream server can have one virtual machine per box. In this case, a message
truncate at its convenience, and not just when it receives a must be sent to a backup server each time a box processes
back channel message. However, the array approach a message. However, only the processing of the in-transit
makes the implementation of individual boxes somewhat boxes will be lost. This will be very similar to the
more complex. process-pair approach. Hence, by adding virtual machines
to the high-availability algorithms, we can tune the
6.3 Failure Detection and Recovery
algorithms to any desired tradeoff between recovery time
Each server sends periodic heartbeat messages to its and run time overhead.
upstream neighbors. If a server does not hear from its
downstream neighbor for some predetermined time 7 Policy Specifications and Guidelines
period, it considers that its neighbor failed, and it initiates
We now describe the high-level policy specifications
a recovery procedure. In the recovery phase, the back-up
employed by Aurora* and Medusa to guide all pertinent
server itself immediately starts processing the tuples in its
resource, load, and availability management decisions.
output log, emulating the processing of the failed server
We first describe application-specific QoS specifications
for the tuples that were still being processed at the failed
used by Aurora*, and then overlay the Medusa approach
server. Subsequently, load-sharing techniques can be used
for establishing (economic) contracts between different
to offload work from the back-up server to other available
domains.
servers. Alternatively, the backup server can move its
output log to another server, which then takes over the 7.1 QoS Based Control in Aurora*
processing of the failed server. This approach might be Along with a query, every Aurora application must also
worthwhile if the back-up server is already heavily loaded specify its QoS expectations [4]. A QoS specification is a
and/or migration of the output log is expected to be function of some performance, result precision, or
inexpensive. reliability related characteristic of an output stream that
6.4 Recovery Time vs. Back Up Granularity produces a utility (or happiness) value to the
corresponding application. The operational goal of
The above scheme does not interfere with the natural flow
Aurora is to maximize the perceived aggregate QoS
of tuples in the network, providing high availability with
delivered to the client applications. As a result, all Aurora
only a minimum of extra messages. In contrast, a process-
resource allocation decisions, such as scheduling and load
pair approach requires check pointing a computation to its
shedding, are driven by QoS-aware algorithms [4]. We
backup on a regular basis. To achieve high availability
now discuss some interesting issues that arise as we
with a process-pair model would require a checkpoint
extend the basic single-node QoS model to a distributed
message every time a box processed a message. This is
multi-node model to be used by Aurora*.
overwhelmingly more expensive than the approach we
One key QoS issue that needs to be dealt with in
presented. However, the cost of our scheme is the
Aurora* involves inferring QoS for the outputs of
possibly considerable amount of computation required
arbitrary Aurora* nodes. In order to be consistent with the
during recovery. In contrast, a process-pair scheme will
basic Aurora model and to minimize the coordination
redo only those box calculations that were in process at
among the individual Aurora nodes, it is desirable for
the time of the failure. Hence, the proposed approach
each node in an Aurora* configuration to run its own
saves many run-time messages, at the expense of having
local Aurora server. This requires the presence of QoS
to perform additional work at failover time.
specifications at the outputs of internal nodes (i.e., those
that are not directly connected to output applications).
Because QoS expectations are defined only at the output
nodes, the corresponding specifications for the internal
nodes must be properly inferred. This inference is …. QoSinferred
s1
illustrated in Figure 9, where a given application’s query
result is returned by node S3, but additional computation s3 QoSoutput
is done at the internal nodes S1 and S2. The QoS specified
…. s2
at the output node S3 needs to be pushed inside the QoSinferred
network, to the outputs of S1 and S2, so that these internal
nodes can make local resource management decisions.
While, in general, inferring accurate QoS requirements Figure 9: Inferring QoS at intermediate nodes
in the middle of an Aurora network is not going to be
possible, we believe that inferring good approximations to 7.2 Economic Contract Based Control in Medusa
some of the QoS specifications (such as the latency-based
QoS specification, which is a primary driver for many As discussed in previous sections, Medusa regulates
resource control issues) is achievable given the interactions between participants using an agoric model
availability of operational system statistics. To do this, we with three basic types of contracts: (a) content contracts
assume that the system has access to the average (b) suggested contracts, and (c) movement contracts. We
processing cost and the selectivity of each box. These discuss each type of contract in turn.
statistics can be monitored and maintained in an Content contracts cover the payment by a receiving
approximate fashion over a running network. participant for the stream to be sent by a sending
A QoS specification at the output of some box, B is a participant. The form of a content contract is:
function of time t and can be written as Qo(t). Assume
that box B takes, on average, TB units of time for a tuple For stream_name
arriving at its input to be processed completely. TB can be For time period
measured and recorded by each box and would implicitly With availability guarantee
include any queuing time. The QoS specification Qi(t) at Pay payment
box B’s input would be Qo(t+TB). This simple technique
can be applied across an arbitrary number of Aurora Here, stream_name is a stream known to the sender,
boxes to compute an estimated latency graph for any arc which the receiver must map to a local stream name. The
in the system. time period is the amount of time that sender will make
Another important issue relates to the precision (i.e., the stream available to the receiver, and payment is the
accuracy) of query results. Precise answers to queries are amount of money remitted. Payment can either be a fixed
sometimes unachievable or undesirable, both of which dollar amount (subscription) or it can be a per-message
potentially lead to dropped tuples. A precise query amount. An optional availability clause can be added to
answer is what would be returned if no data was ever specify the amount of outage that can be tolerated, as a
dropped, and query execution could complete regardless guarantee on the fraction of uptime.
of the time it required. A precise query answer might be With content contracts, Medusa participants can
unachievable (from an Aurora system’s perspective) if perform services for each other. Additionally, if
high load on an Aurora server necessitated dropping participants authorize each other to do remote definitions,
tuples. A precise query answer might be undesirable then buying participants can easily customize the content
(from an Aurora application’s perspective) if a query that they buy by defining a query plan at the selling
depended upon data arriving on an extremely slow participant. These two types of interactions form the basis
stream, and an approximate but fast query answer was of our system.
preferable to one that was precise but slow. QoS Additional contracts are needed to manage load
specifications describe, from an applications’ perspective, among participants and optimize queries. For instance,
what measures that it prefer Aurora take under such participant P can use remote definition and content
circumstances. For example, if tuples must be dropped, contracts to partition a query plan Q over a set of other
QoS specifications can be used to determine which and participants {P1, …, Pk} in an arbitrary manner. P needs
how many. to have remote definition authorization at each of P1
Because imprecise query answers are sometimes through Pk, but the latter do not need to have contracts
unavoidable or even preferable to precise query answers, with each other. Unfortunately, this form of collaboration
precision is the wrong standard for Aurora systems to will require that query plans be “star shaped” with P in
strive for. In general, there will be a continuum of the middle, since P1 through Pk don’t have contractual
acceptable answers to a query, each of which has some relationships with each other.
measurable deviation from the perfect answer. The degree To facilitate more efficient plans and inter-participant
of tolerable approximation is application specific; QoS load management, we need the ability to modify the way
specifications serve to define what is acceptable. queries are partitioned across participants at run time.
More precisely, we need the ability to slide boxes across Load sharing has been extensively studied in a
participants as well as the ability to add or remove a variety of settings, including distributed operating
participant from a query-processing path. For instance, systems (e.g., [9, 20]) and databases (e.g., [3, 7]). In a
we would like to remove P from the star-shaped query distributed system, the load typically consists of multiple
defined above. independent tasks (or processes), which are the smallest
Adding a participant to a query plan is logical units of processing. In Aurora, the corresponding
straightforward with remote definition and content smallest processing units are individual operators that
contracts. Removing a participant requires that the exhibit input-output dependencies, complicating their
leaving participant ask other participants to establish new physical distribution.
content contracts with each other. The mechanism for this Several distributed systems [8, 17] investigated on-
is suggested contracts: a participant P suggests to the-fly task migration and cloning as means for dynamic
downstream participants an alternate location (participant load sharing. Our Slide and Split operations not only
and stream name) from where they should buy content facilitate similar (but finer-grained) load sharing, but also
currently provided by P. Receiving participants may take into account operator dependencies mentioned
ignore suggested contracts. above, properly splitting and merging the input and
The last form of contract facilitates load balancing resulting data streams as necessary.
via a form of box sliding, and is called a movement Parallel database systems [3, 7] typically share load by
contract. Using remote definition, a participant P1 using operator splitting and data partitioning. Since
defines a query plan at another participant, P2. Using a Aurora operators are stream-based, the details of how we
content contract, this remote query plan can be activated. split the load and merge results are different. More
To facilitate load balancing, P1 can define not one, but a importantly, existing parallel database query execution
set of L remote query plans. Paired with locally running models are relatively static compared to Aurora* and
queries (upstream or downstream), these plans provide Medusa: they do not address continuous query execution,
equivalent functionality, but distribute load differently and as a result, do not also consider adaptation issues.
across P1 and P2. Hence, a movement contract between Because our load sharing techniques involve
two participants contains a set of distributed query plans dynamically transforming query plans, systems that
and corresponding inactive content contracts. There is a employ dynamic query optimization are also relevant
separate movement contract for each query crossing the (e.g., see [13] for a survey). These system change query
boundary between two participants. An oracle on each plans on the fly in order to minimize query execution
side determines at runtime whether a query plan and cost, reduce query response time, or maximize output
corresponding content contracts from one of the rates; whereas our motivation is to enable dynamic cross-
movement contracts is preferred to any of currently active machine load distribution. Furthermore, most dynamic
query plans and content contracts. If so, it communicates query optimization research addressed only centralized
with the counterpart oracle to suggest a substitution; i.e., query processing. The ones that addressed
to make the alternate query plan (and its corresponding distributed/parallel execution relied on centralized query
content contracts) active instead of the current query plan optimization and load sharing models. Our mechanisms
and contracts. If the second oracle agrees, then the switch and policies, on the other hand, implement dynamic query
is made. In this way, two oracles can agree to switch re-configuration and load sharing in a truly decentralized
query plans from time to time. way in order to achieve high scalability.
A movement contract can be cancelled at any time by While we have compared our back-up and recovery
either of the participants. If a contract is cancelled and the approach with the generic process-pair model in Section
two oracles do not agree on a replacement, then co- 4, a variation of this model [15] provides different levels
operation between the two participants reverts to the of availability for workflow management systems. Instead
existing content contract (if one is in place). Hence of backing up process states, the system logs changes to
movement contracts can be used for dynamic load the workflow components, which store inter-process
balancing purposes. Of course, oracles must carefully messages. This approach is similar to that of ours, in that
monitor local load conditions, and be aware of the system state can be recovered by reprocessing the
economic model that drives contracting decisions at the component back-ups. Unlike our approach, however, this
participant. Additionally, in the same manner as content approach does not take advantage of the data-flow nature
contracts, movement contracts can also be transferred of processing, and therefore has to explicitly back up the
using suggested contracts. components at remote servers.
Market-based approaches rely on economic principles
8 Related Work to value available resources and match supply and
We now briefly discuss previous related research, demand. Mariposa [18] is a distributed database system
focusing primarily on load sharing, high availability, and that uses economic principles to guide data management
distributed mechanisms for federated operations. decisions. While Mariposa’s resource pricing and trade
are on a query-by-query basis, the trade in Medusa is
based on service subscriptions. Medusa contracts enable SIGMOD International Conference on Management of
participants to collaborate and share load with reasonably Data, Dallas, TX, 2000.
low overhead, unlike Mariposa’s per-query bidding [6] D. Karger, E. Lehman, T. Leighton, M. Levine, D.
process. Lewin, and R. Panigrahy. Consistent Hashing and
Random Trees: Distributed Caching Protocols for
Relieving Hot Spots on the World Wide Web. In 29th
9 Conclusions Annual ACM Symposium on Theory of Computing, El
This paper discusses architectural issues encountered in Paso, Texas, 1997.
the design of two complementary large-scale distributed [7] D. DeWitt and J. Gray. Parallel database systems: the
stream processing systems. Aurora* is a distributed future of high performance database systems.
version of the stream processing system Aurora, which Communications of the ACM, 35(6):85-98, 1992.
[8] F. Douglis and J. Ousterhout. Process Migration in the
assumes nodes belonging to a common administrative
Sprite Operating System. In Proceedings of the 7th
domain. Medusa is an infrastructure supporting the International IEEE Conference on Distributed
federated operation of several Aurora nodes across Computing Systems, Berlin, Germany, 1987.
administrative boundaries. We discussed three design [9] D. Eager, E. Lazowska, and J. Zahorjan. Adaptive Load
goals in particular: a scalable communication Sharing in Homogeneous Distributed Systems. IEEE
infrastructure, adaptive load management, and high Transactions on Software Engineering, 12(5):662-675,
availability. For each we discussed mechanisms for 1986.
achieving these goals, as well as policies for employing [10] J. Gray and A. Reuter, Transaction Processing:
these mechanisms in both the single domain (Aurora*) Concepts and Techniques: Morgan Kaufman, 1993.
[11] H. Balakrishnan, H. Rahul, and S. Seshan. An Integrated
and federated (Medusa) environments. In so doing, we
Congestion Management Architecture for Internet
identified the key challenges that must be addressed and Hosts. In ACM SIGCOMM, Cambridge, MA, 1999.
opportunities that can be exploited in building scalable, [12] H. Balakrishnan and S. Seshan. The Congestion
highly available distributed stream processing systems. Manager, RFC 3124.
[13] J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A.
Deshpande, K. Hildrum, S. Madden, V. Raman, and M.
Acknowledgments Shah. Adaptive Query Processing: Technology in
We are grateful to Mike Stonebraker for his extensive Evolution. IEEE Data Engineering Bulletin, 23(2):7-18,
contributions to the work described in this paper. Funding 2000.
[14] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H.
for this work at Brandeis and Brown comes from NSF
Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup
under grant number IIS00-86057, and at MIT comes from Service for Internet Applications. In ACM SIGCOMM,
NSF ITR ANI-0205445. San Diego, CA, 2001.
[15] M. Kamath, G. Alonso, R. Guenthor, and C. Mohan.
Providing High Availability in Very Large Workflow
References Management Systems. In Proceedings of the 5th
International Conference on Extending Database
Technology, Avignon, France, 1996.
[1] Tandem Database Group. Non-Stop SQL: A
[16] M.S. Miller and K. E. Drexler, “Markets and
Distributed, High Performance, High-Reliability
Computation: Agoric Open Systems,” in The Ecology of
Implementation of SQL. In Proceedings of the
Computation, B. A. Huberman, Ed.: North-Holland,
Workshop on High Performance Transaction Systems,
1988.
Asilomar, CA, 1987.
[17] D. S. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler,
[2] D. Abbadi, D. Carney, U. Cetintemel, M. Cherniack, C.
and S. Zhou. Process migration. ACM Computing
Convey, S. Lee, M. Stonebraker, N. Tatbul, and S.
Surveys, 32(241-299, 2000.
Zdonik. Aurora: A New Model and Architecture for
[18] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A.
Data Stream Management. Brown Computer Science
Sah, J. Sidell, C. Staelin, and A. Yu. Mariposa: A Wide-
CS-02-10, August 2002.
Area Distributed Database System. VLDB Journal: Very
[3] L. Bouganim, D. Florescu, and P. Valduriez. Dynamic
Large Data Bases, 5(1):48-63, 1996.
Load Balancing in Hierarchical Parallel Database
[19] W. Litwin, M.-A. Neimat, and D. A. Schneider. LH* -
Systems. In Proceedings of International Conference on
A Scalable Distributed Data Structure. ACM
Very Large Data Bases, Bombay, India, 1996.
Transactions on Data Base Systems, 21(4):480-525,
[4] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S.
1996.
Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S.
[20] W. Zhu, C. Steketee, and B. Muilwijk. Load balancing
Zdonik. Monitoring Streams: A New Class of Data
and workstation autonomy on Amoeba. Australian
Management Applications. In proceedings of the 28th
Computer Science Communications, 17(1):588--597,
International Conference on Very Large Data Bases
1995.
(VLDB'02), Hong Kong, China, 2002.
[5] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
NiagaraCQ: A Scalable Continuous Query System for
Internet Databases. In Proceedings of the 2000 ACM

You might also like