13 Graves
13 Graves
B1 B2 B3
splits into …
machine1 machine2
bandwidth to handle its output (this would happen if an split box returns the same result as an unsplit box), one or
upstream node notices a backup on its output link). In this more boxes must be added to the network that merges the
case, the upstream node might want to signal the box outputs back into a single stream.
neighboring downstream node to move one or more boxes The boxes required to merge results depend on the box
upstream to reduce the communication across that link. that is split. Figure 5 and Figure 6 show two examples.
We now discuss two basic load sharing mechanisms, The first split is of Filter and simply requires a Union
box sliding and box splitting, which are used to repartition box to accomplish the merge. The second split is of
the Aurora network in a pair-wise fashion. Tumble, which requires a more sophisticated merge,
consisting of Union followed by WSort and then another
Box Sliding. This technique takes a box on the edge of a Tumble. It also requires that the aggregate function
sub-network on one machine and shifts it to its neighbor. argument to Tumble, agg, have a corresponding
Beyond the obvious repositioning of processing, shifting combination function, combine, such that for any set of
a box upstream is often useful if the box has a low tuples, {x1, x2, …, xn}, and k ≤ n:
selectivity (reduces the amount of data) and the
bandwidth of the connection is limited. Shifting a box agg({x1, x2, …, xn}) =
downstream can be useful if the selectivity of the box is combine(agg({x1, x2, …, xk}), agg({xk+1, xk+2, …, xn})
greater than one (produces more data than the input, e.g.,
a join) and the bandwidth of the connection is again For example, if agg is cnt (count), combine is sum, and if
limited. We call this kind of remapping horizontal load agg is max, then combine is max also. In Figure 6, agg is
sharing or box sliding. Figure 4 illustrates upstream box cnt and combine is sum.
sliding. To illustrate the split shown in Figure 6, consider a
It should be noted that the machine to which a box is Tumble applied to the stream that was shown in Figure 2
sent must have the capability to execute the given with the aggregate function cnt and groupby attribute A.
operation. In a sensor network, some of the nodes can be Observe that without splitting, Tumble would emit the
very weak. Often the sensor itself is capable of following tuples while processing the seven tuples shown
computation, but this capability is limited. Thus, it might in Figure 2:
be possible to slide a simple Filter box to a sensor node,
whereas the sensor might not support a Tumble box. Tumble (cnt, Groupby A)
It should also be noted that box sliding could also move
boxes vertically. That is, a box that is assigned to machine splits into …
A can be moved to machine B as long as the input and
output arcs are rerouted accordingly. Tumble (cnt, Groupby A)
Filter (p)
Tumble (cnt, Groupby A)
Box Splitting. A heavier form of load sharing involves
splitting Aurora boxes. A split creates a copy of a box that
Merge
is intended to run on a second machine. This mechanism Union
6 High Availability
A key goal in the design of any data stream processing s1 s2 s3
system is to achieve robust operation in volatile and
dynamic environments, where availability may suffer due
to (1) server and communication failures, (2) sustained back-up of the tuples in transit
congestion levels, and (3) software failures. In order to Figure 8: Primary and back-up servers
improve overall system availability, Aurora* and Medusa
rely on a common stream-oriented data back-up and
between the servers. When an upstream server sends a
recovery approach, which we describe below.
message (containing tuples) to a successor, it also
6.1 Overview and Key Features includes a monotonically increasing sequence number. It
Our high-availability approach has two unique is sufficient to include only the base sequence number, as
advantages, both due to the streaming data-flow nature of the corresponding numbers for all tuples can be
our target systems. First, it is possible to reliably back up automatically generated at the receiving server by simply
data and provide safety without incurring the overhead to incrementing the base. We now describe two remote
explicitly copy them to special back up servers (as in the queue truncation techniques that use tuple sequence
case of traditional process pair models [10]). In our numbers.
model, each server can effectively act as a back-up for its Our first technique involves the use of special flow
downstream servers. Tuples get processed and flow messages. Periodically, each data source creates and
naturally in the network (precisely as in the case of sends flow messages into the system. A box processes a
regular operation). Unlike in regular operation, however, flow message by first recording the sequence number of
processed tuples are discarded lazily, only when it is the earliest tuple that it currently depends on‡, and then
determined that their effects are safely recorded passing it onward. Note that there might be multiple
elsewhere, and, thus, can be effectively recovered in case earliest sequence numbers, one for each upstream server
of a failure. at the extreme case. When the flow message reaches a
Second, the proposed approach enables a tradeoff server boundary, these sequence values are recorded and
between the recovery time and the volume of checkpoint the message continues in the next server. Hence, each
messages required to provide safety. This flexibility server records the identifiers of the earliest upstream
allows us to emulate a wide spectrum of recovery models, tuples that it depends on. These values serve as
ranging from a high-volume checkpoints/fast-recovery checkpoints; they are communicated through a back
approach (e.g., Tandem [1]) to a low-volume channel to the upstream servers, which can appropriately
checkpoints/slow-recovery approach (e.g., log-based truncate the tuples they hold. Clearly, the flow message
recovery in traditional databases). can also be piggybacked on other control or data
messages (such as heartbeat messages, DHT lookup
6.2 Regular Operation messages, or regular tuple messages).
We say that a distributed stream processing system is k- The above scheme will operate correctly only for
safe if the failure of any k servers does not result in any straight-line networks. When there are branches and
message losses. The value of k should be set based on the recombinations, special care must be taken. Also, when
availability requirements of applications, and the messages from one server go to multiple subsequent
reliability and load characteristics of the target servers, additional extensions are required.
environments. We provide k-safety by maintaining the Whenever a message is split and sent to two
copies of the tuples that are in transit at each server s, at k destinations, then the flow message is similarly split. If a
other servers that are upstream from s. An upstream back- box gets input from two arcs, it must save the first flow
up server simply holds on to a tuple it has processed until message until it receives one on the other arc. If the two
its primary server tells it to discard the tuple. Figure 8 flow messages come from different servers, then both are
illustrates the basic mechanism for k = 1. Server s1 acts as sent onward. If they come from the same server, then the
a back-up of server s2. A tuple t sent from s1 to s2 is
‡
simply kept at s1’s output queue until it is guaranteed that If the box has state (e.g., consider an aggregate box), then the
all tuples that depended on t (i.e., the tuples whose values recorded tuple is the one that presently contributes to the state of
got determined directly or indirectly based on t) made it the box and that has the lowest sequence number (for each
upstream server). If the box is stateless (e.g., a filter box), then
to s3.
the recorded tuple is the one that has been processed most
In order to correctly truncate output queues, we need to recently.
keep track of the order in which tuples are transmitted
minimum is computed as before and a single message The basic approach can be extended to support faster
sent onward. In this way, the correct minimum is received recovery, but at higher run time cost. Consider
at the output. establishing a collection of K virtual machines on top of
An output can receive flow messages from multiple the Aurora network running on a single physical server.
upstream servers. It must merely respond to the correct Now, utilize the approach described above for each
one with a back channel message. Similarly, when an virtual machine. Hence, there will be queues at each
upstream server has multiple successor servers, it must virtual machine boundary, which will be truncated when
wait for a back channel message from each one, and then possible. Since each queue is on the same physical
only truncate the queue to the maximum of the minimum hardware as its downstream boxes, high availability is not
values. provided on machine failures with the algorithms
An alternate technique to special flow messages is to described so far.
install an array of sequence numbers on each server, one To achieve high-availability, the queue has to be
for each upstream server. On each box’s activation, the replicated to a physical backup machine. At a cost of one
box records in this array the earliest tuples on which it message per entry in the queue, each of the K virtual
depends. The upstream servers can then query this array machines can resume processing from its queue, and finer
periodically and truncate their queues accordingly. This granularity restart is supported. The ultimate extreme is to
approach has the advantage that the upstream server can have one virtual machine per box. In this case, a message
truncate at its convenience, and not just when it receives a must be sent to a backup server each time a box processes
back channel message. However, the array approach a message. However, only the processing of the in-transit
makes the implementation of individual boxes somewhat boxes will be lost. This will be very similar to the
more complex. process-pair approach. Hence, by adding virtual machines
to the high-availability algorithms, we can tune the
6.3 Failure Detection and Recovery
algorithms to any desired tradeoff between recovery time
Each server sends periodic heartbeat messages to its and run time overhead.
upstream neighbors. If a server does not hear from its
downstream neighbor for some predetermined time 7 Policy Specifications and Guidelines
period, it considers that its neighbor failed, and it initiates
We now describe the high-level policy specifications
a recovery procedure. In the recovery phase, the back-up
employed by Aurora* and Medusa to guide all pertinent
server itself immediately starts processing the tuples in its
resource, load, and availability management decisions.
output log, emulating the processing of the failed server
We first describe application-specific QoS specifications
for the tuples that were still being processed at the failed
used by Aurora*, and then overlay the Medusa approach
server. Subsequently, load-sharing techniques can be used
for establishing (economic) contracts between different
to offload work from the back-up server to other available
domains.
servers. Alternatively, the backup server can move its
output log to another server, which then takes over the 7.1 QoS Based Control in Aurora*
processing of the failed server. This approach might be Along with a query, every Aurora application must also
worthwhile if the back-up server is already heavily loaded specify its QoS expectations [4]. A QoS specification is a
and/or migration of the output log is expected to be function of some performance, result precision, or
inexpensive. reliability related characteristic of an output stream that
6.4 Recovery Time vs. Back Up Granularity produces a utility (or happiness) value to the
corresponding application. The operational goal of
The above scheme does not interfere with the natural flow
Aurora is to maximize the perceived aggregate QoS
of tuples in the network, providing high availability with
delivered to the client applications. As a result, all Aurora
only a minimum of extra messages. In contrast, a process-
resource allocation decisions, such as scheduling and load
pair approach requires check pointing a computation to its
shedding, are driven by QoS-aware algorithms [4]. We
backup on a regular basis. To achieve high availability
now discuss some interesting issues that arise as we
with a process-pair model would require a checkpoint
extend the basic single-node QoS model to a distributed
message every time a box processed a message. This is
multi-node model to be used by Aurora*.
overwhelmingly more expensive than the approach we
One key QoS issue that needs to be dealt with in
presented. However, the cost of our scheme is the
Aurora* involves inferring QoS for the outputs of
possibly considerable amount of computation required
arbitrary Aurora* nodes. In order to be consistent with the
during recovery. In contrast, a process-pair scheme will
basic Aurora model and to minimize the coordination
redo only those box calculations that were in process at
among the individual Aurora nodes, it is desirable for
the time of the failure. Hence, the proposed approach
each node in an Aurora* configuration to run its own
saves many run-time messages, at the expense of having
local Aurora server. This requires the presence of QoS
to perform additional work at failover time.
specifications at the outputs of internal nodes (i.e., those
that are not directly connected to output applications).
Because QoS expectations are defined only at the output
nodes, the corresponding specifications for the internal
nodes must be properly inferred. This inference is …. QoSinferred
s1
illustrated in Figure 9, where a given application’s query
result is returned by node S3, but additional computation s3 QoSoutput
is done at the internal nodes S1 and S2. The QoS specified
…. s2
at the output node S3 needs to be pushed inside the QoSinferred
network, to the outputs of S1 and S2, so that these internal
nodes can make local resource management decisions.
While, in general, inferring accurate QoS requirements Figure 9: Inferring QoS at intermediate nodes
in the middle of an Aurora network is not going to be
possible, we believe that inferring good approximations to 7.2 Economic Contract Based Control in Medusa
some of the QoS specifications (such as the latency-based
QoS specification, which is a primary driver for many As discussed in previous sections, Medusa regulates
resource control issues) is achievable given the interactions between participants using an agoric model
availability of operational system statistics. To do this, we with three basic types of contracts: (a) content contracts
assume that the system has access to the average (b) suggested contracts, and (c) movement contracts. We
processing cost and the selectivity of each box. These discuss each type of contract in turn.
statistics can be monitored and maintained in an Content contracts cover the payment by a receiving
approximate fashion over a running network. participant for the stream to be sent by a sending
A QoS specification at the output of some box, B is a participant. The form of a content contract is:
function of time t and can be written as Qo(t). Assume
that box B takes, on average, TB units of time for a tuple For stream_name
arriving at its input to be processed completely. TB can be For time period
measured and recorded by each box and would implicitly With availability guarantee
include any queuing time. The QoS specification Qi(t) at Pay payment
box B’s input would be Qo(t+TB). This simple technique
can be applied across an arbitrary number of Aurora Here, stream_name is a stream known to the sender,
boxes to compute an estimated latency graph for any arc which the receiver must map to a local stream name. The
in the system. time period is the amount of time that sender will make
Another important issue relates to the precision (i.e., the stream available to the receiver, and payment is the
accuracy) of query results. Precise answers to queries are amount of money remitted. Payment can either be a fixed
sometimes unachievable or undesirable, both of which dollar amount (subscription) or it can be a per-message
potentially lead to dropped tuples. A precise query amount. An optional availability clause can be added to
answer is what would be returned if no data was ever specify the amount of outage that can be tolerated, as a
dropped, and query execution could complete regardless guarantee on the fraction of uptime.
of the time it required. A precise query answer might be With content contracts, Medusa participants can
unachievable (from an Aurora system’s perspective) if perform services for each other. Additionally, if
high load on an Aurora server necessitated dropping participants authorize each other to do remote definitions,
tuples. A precise query answer might be undesirable then buying participants can easily customize the content
(from an Aurora application’s perspective) if a query that they buy by defining a query plan at the selling
depended upon data arriving on an extremely slow participant. These two types of interactions form the basis
stream, and an approximate but fast query answer was of our system.
preferable to one that was precise but slow. QoS Additional contracts are needed to manage load
specifications describe, from an applications’ perspective, among participants and optimize queries. For instance,
what measures that it prefer Aurora take under such participant P can use remote definition and content
circumstances. For example, if tuples must be dropped, contracts to partition a query plan Q over a set of other
QoS specifications can be used to determine which and participants {P1, …, Pk} in an arbitrary manner. P needs
how many. to have remote definition authorization at each of P1
Because imprecise query answers are sometimes through Pk, but the latter do not need to have contracts
unavoidable or even preferable to precise query answers, with each other. Unfortunately, this form of collaboration
precision is the wrong standard for Aurora systems to will require that query plans be “star shaped” with P in
strive for. In general, there will be a continuum of the middle, since P1 through Pk don’t have contractual
acceptable answers to a query, each of which has some relationships with each other.
measurable deviation from the perfect answer. The degree To facilitate more efficient plans and inter-participant
of tolerable approximation is application specific; QoS load management, we need the ability to modify the way
specifications serve to define what is acceptable. queries are partitioned across participants at run time.
More precisely, we need the ability to slide boxes across Load sharing has been extensively studied in a
participants as well as the ability to add or remove a variety of settings, including distributed operating
participant from a query-processing path. For instance, systems (e.g., [9, 20]) and databases (e.g., [3, 7]). In a
we would like to remove P from the star-shaped query distributed system, the load typically consists of multiple
defined above. independent tasks (or processes), which are the smallest
Adding a participant to a query plan is logical units of processing. In Aurora, the corresponding
straightforward with remote definition and content smallest processing units are individual operators that
contracts. Removing a participant requires that the exhibit input-output dependencies, complicating their
leaving participant ask other participants to establish new physical distribution.
content contracts with each other. The mechanism for this Several distributed systems [8, 17] investigated on-
is suggested contracts: a participant P suggests to the-fly task migration and cloning as means for dynamic
downstream participants an alternate location (participant load sharing. Our Slide and Split operations not only
and stream name) from where they should buy content facilitate similar (but finer-grained) load sharing, but also
currently provided by P. Receiving participants may take into account operator dependencies mentioned
ignore suggested contracts. above, properly splitting and merging the input and
The last form of contract facilitates load balancing resulting data streams as necessary.
via a form of box sliding, and is called a movement Parallel database systems [3, 7] typically share load by
contract. Using remote definition, a participant P1 using operator splitting and data partitioning. Since
defines a query plan at another participant, P2. Using a Aurora operators are stream-based, the details of how we
content contract, this remote query plan can be activated. split the load and merge results are different. More
To facilitate load balancing, P1 can define not one, but a importantly, existing parallel database query execution
set of L remote query plans. Paired with locally running models are relatively static compared to Aurora* and
queries (upstream or downstream), these plans provide Medusa: they do not address continuous query execution,
equivalent functionality, but distribute load differently and as a result, do not also consider adaptation issues.
across P1 and P2. Hence, a movement contract between Because our load sharing techniques involve
two participants contains a set of distributed query plans dynamically transforming query plans, systems that
and corresponding inactive content contracts. There is a employ dynamic query optimization are also relevant
separate movement contract for each query crossing the (e.g., see [13] for a survey). These system change query
boundary between two participants. An oracle on each plans on the fly in order to minimize query execution
side determines at runtime whether a query plan and cost, reduce query response time, or maximize output
corresponding content contracts from one of the rates; whereas our motivation is to enable dynamic cross-
movement contracts is preferred to any of currently active machine load distribution. Furthermore, most dynamic
query plans and content contracts. If so, it communicates query optimization research addressed only centralized
with the counterpart oracle to suggest a substitution; i.e., query processing. The ones that addressed
to make the alternate query plan (and its corresponding distributed/parallel execution relied on centralized query
content contracts) active instead of the current query plan optimization and load sharing models. Our mechanisms
and contracts. If the second oracle agrees, then the switch and policies, on the other hand, implement dynamic query
is made. In this way, two oracles can agree to switch re-configuration and load sharing in a truly decentralized
query plans from time to time. way in order to achieve high scalability.
A movement contract can be cancelled at any time by While we have compared our back-up and recovery
either of the participants. If a contract is cancelled and the approach with the generic process-pair model in Section
two oracles do not agree on a replacement, then co- 4, a variation of this model [15] provides different levels
operation between the two participants reverts to the of availability for workflow management systems. Instead
existing content contract (if one is in place). Hence of backing up process states, the system logs changes to
movement contracts can be used for dynamic load the workflow components, which store inter-process
balancing purposes. Of course, oracles must carefully messages. This approach is similar to that of ours, in that
monitor local load conditions, and be aware of the system state can be recovered by reprocessing the
economic model that drives contracting decisions at the component back-ups. Unlike our approach, however, this
participant. Additionally, in the same manner as content approach does not take advantage of the data-flow nature
contracts, movement contracts can also be transferred of processing, and therefore has to explicitly back up the
using suggested contracts. components at remote servers.
Market-based approaches rely on economic principles
8 Related Work to value available resources and match supply and
We now briefly discuss previous related research, demand. Mariposa [18] is a distributed database system
focusing primarily on load sharing, high availability, and that uses economic principles to guide data management
distributed mechanisms for federated operations. decisions. While Mariposa’s resource pricing and trade
are on a query-by-query basis, the trade in Medusa is
based on service subscriptions. Medusa contracts enable SIGMOD International Conference on Management of
participants to collaborate and share load with reasonably Data, Dallas, TX, 2000.
low overhead, unlike Mariposa’s per-query bidding [6] D. Karger, E. Lehman, T. Leighton, M. Levine, D.
process. Lewin, and R. Panigrahy. Consistent Hashing and
Random Trees: Distributed Caching Protocols for
Relieving Hot Spots on the World Wide Web. In 29th
9 Conclusions Annual ACM Symposium on Theory of Computing, El
This paper discusses architectural issues encountered in Paso, Texas, 1997.
the design of two complementary large-scale distributed [7] D. DeWitt and J. Gray. Parallel database systems: the
stream processing systems. Aurora* is a distributed future of high performance database systems.
version of the stream processing system Aurora, which Communications of the ACM, 35(6):85-98, 1992.
[8] F. Douglis and J. Ousterhout. Process Migration in the
assumes nodes belonging to a common administrative
Sprite Operating System. In Proceedings of the 7th
domain. Medusa is an infrastructure supporting the International IEEE Conference on Distributed
federated operation of several Aurora nodes across Computing Systems, Berlin, Germany, 1987.
administrative boundaries. We discussed three design [9] D. Eager, E. Lazowska, and J. Zahorjan. Adaptive Load
goals in particular: a scalable communication Sharing in Homogeneous Distributed Systems. IEEE
infrastructure, adaptive load management, and high Transactions on Software Engineering, 12(5):662-675,
availability. For each we discussed mechanisms for 1986.
achieving these goals, as well as policies for employing [10] J. Gray and A. Reuter, Transaction Processing:
these mechanisms in both the single domain (Aurora*) Concepts and Techniques: Morgan Kaufman, 1993.
[11] H. Balakrishnan, H. Rahul, and S. Seshan. An Integrated
and federated (Medusa) environments. In so doing, we
Congestion Management Architecture for Internet
identified the key challenges that must be addressed and Hosts. In ACM SIGCOMM, Cambridge, MA, 1999.
opportunities that can be exploited in building scalable, [12] H. Balakrishnan and S. Seshan. The Congestion
highly available distributed stream processing systems. Manager, RFC 3124.
[13] J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A.
Deshpande, K. Hildrum, S. Madden, V. Raman, and M.
Acknowledgments Shah. Adaptive Query Processing: Technology in
We are grateful to Mike Stonebraker for his extensive Evolution. IEEE Data Engineering Bulletin, 23(2):7-18,
contributions to the work described in this paper. Funding 2000.
[14] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H.
for this work at Brandeis and Brown comes from NSF
Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup
under grant number IIS00-86057, and at MIT comes from Service for Internet Applications. In ACM SIGCOMM,
NSF ITR ANI-0205445. San Diego, CA, 2001.
[15] M. Kamath, G. Alonso, R. Guenthor, and C. Mohan.
Providing High Availability in Very Large Workflow
References Management Systems. In Proceedings of the 5th
International Conference on Extending Database
Technology, Avignon, France, 1996.
[1] Tandem Database Group. Non-Stop SQL: A
[16] M.S. Miller and K. E. Drexler, “Markets and
Distributed, High Performance, High-Reliability
Computation: Agoric Open Systems,” in The Ecology of
Implementation of SQL. In Proceedings of the
Computation, B. A. Huberman, Ed.: North-Holland,
Workshop on High Performance Transaction Systems,
1988.
Asilomar, CA, 1987.
[17] D. S. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler,
[2] D. Abbadi, D. Carney, U. Cetintemel, M. Cherniack, C.
and S. Zhou. Process migration. ACM Computing
Convey, S. Lee, M. Stonebraker, N. Tatbul, and S.
Surveys, 32(241-299, 2000.
Zdonik. Aurora: A New Model and Architecture for
[18] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A.
Data Stream Management. Brown Computer Science
Sah, J. Sidell, C. Staelin, and A. Yu. Mariposa: A Wide-
CS-02-10, August 2002.
Area Distributed Database System. VLDB Journal: Very
[3] L. Bouganim, D. Florescu, and P. Valduriez. Dynamic
Large Data Bases, 5(1):48-63, 1996.
Load Balancing in Hierarchical Parallel Database
[19] W. Litwin, M.-A. Neimat, and D. A. Schneider. LH* -
Systems. In Proceedings of International Conference on
A Scalable Distributed Data Structure. ACM
Very Large Data Bases, Bombay, India, 1996.
Transactions on Data Base Systems, 21(4):480-525,
[4] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S.
1996.
Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S.
[20] W. Zhu, C. Steketee, and B. Muilwijk. Load balancing
Zdonik. Monitoring Streams: A New Class of Data
and workstation autonomy on Amoeba. Australian
Management Applications. In proceedings of the 28th
Computer Science Communications, 17(1):588--597,
International Conference on Very Large Data Bases
1995.
(VLDB'02), Hong Kong, China, 2002.
[5] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
NiagaraCQ: A Scalable Continuous Query System for
Internet Databases. In Proceedings of the 2000 ACM