Parallel Processing With Autonomous Databases in A Cluster System
Parallel Processing With Autonomous Databases in A Cluster System
Parallel Processing With Autonomous Databases in A Cluster System
ABSTRACT. We consider the use of a cluster system for Application Service Provider (ASP). In
the ASP context, hosted applications and databases can be update-intensive and must remain
autonomous. In this paper, we propose a new solution for parallel processing with
autonomous databases, using a replicated database organization. The main idea is to allow
the system administrator to control the tradeoff between database consistency and
application performance. Application requirements are captured through execution rules
stored in a shared directory. They are used (at run time) to allocate cluster nodes to user
requests in a way that optimizes load balancing while satisfying application consistency
requirements. We also propose a new preventive replication method and a transaction load
balancing architecture which can trade-off consistency for performance using execution
rules. Finally, we discuss the on-going implementation at LIP6 using a Linux cluster running
Oracle 8i.
1
see www.industrie.gouv.fr/rntl/AAP2001/Fiches_Resume/[email protected]
administrator to control the database consistency/performance tradeoff when placing
applications and databases onto cluster nodes. Databases and applications can be
replicated at multiple nodes to obtain good load balancing. Application
requirements are captured (at compile time) through execution rules stored in a
shared directory used (at run time) to allocate cluster nodes to user requests.
Depending on the users’ requirements, we can control database consistency at the
cluster level. For instance, if an application is read-only or the required consistency
is weak, then it is easy to execute multiple requests in parallel at different nodes. If,
instead, an application is update-intensive and requires strong consistency (e.g.
integrity constraints satisfaction), then an extreme solution is to run it at a single
node and trade performance for consistency. Or, if we want both consistency and
replication (e.g. for high availability), another extreme solution is synchronous
replication with 2 phase commit (2PC) [9] for refreshing replicas. However, 2PC is
both costly in terms of messages and blocking (failure of the coordinator cannot be
terminated independently by the participants).
There are cases where copy consistency can be relaxed. With optimistic
replication [12], transactions are locally committed and different replicas may get
different values. Replica divergence remains until reconciliation. Meanwhile, the
divergence must be controlled for at least two reasons. First, since synchronization
consists in producing a single history from several diverging ones, the higher the
divergence is, the more difficult the reconciliation. The second reason is that read-
only applicationsdo not always require to read perfectly consistent data and may
tolerate some inconsistency. In this case, inconsistency reflects a divergence
between the value actually read and the value that should have been read in ACID
mode. Non-isolated queries are also useful in non replicated environments (e.g.
ANSI isolation levels and their critique [2]). Specification of inconsistency for
queries has been widely studied in the literature, and may be divided in two
dimensions, temporal and spatial [18]. An example of temporal dimension is found
in quasi-copies [1], where a cached (image) copy may be read-accessed according to
temporal conditions, such as an allowable delay between the last update of the copy
and the last update of the master copy. The spatial dimension consists of allowing a
given "quantity of changes" between the values read-accessed and the effective
values stored at the same time. This quantity of changes, referred to as import-limit
in epsilon transactions [23], may be for instance the number of data items changed,
the number of updates performed or the absolute value of the update. In the
continuous consistency model [24], both temporal dimension (staleness) and spatial
dimension (numerical error and order error) are controlled. Each node propagates its
writes by either pull or push access to other nodes, so that each node maintains a
predefined level of consistency for each dimension. Then each query can be sent to
a node having a satisfying level of consistency (w.r.t. the query) in order to optimize
load balancing.
In this paper, we strive to capitalize on the work on relaxing database
consistency for higher performance which we apply in the context of cluster
systems. We make the following contributions:
- replicated database architecture for cluster systems that does not hurt
application and database autonomy, using non intrusive database techniques, i.e.
techniques that work independently of any DBMS;
- new preventive replication method that provides strong consistency without the
overhead of synchronous replication, by exploiting the cluster’s high speed
network;
- transaction load balancing architecture which can trade-off consistency for
performance using optimistic replication and execution rules;
- conflict manager architecture which exploits the database logs and execution
rules to perform replica reconciliation among heterogeneous databases.
This paper is organized as follows. Section 2 introduces our cluster system
architecture with database replication. Section 3 presents our replication model with
both preventive and optimistic replication. Section 4 describes the way we can
capture and exploit execution rules about applications. Section 5 describes our
execution model which uses these rules to perform load balancing and manage
global consistency. Section 6 briefly describes our on-going implementation.
Section 7 compares our approach with related work. Section 8 concludes.
2. Cluster architecture
Internet
Preventive replication
manager
DBMS DBMS DBMS DBMS
DB DB DB DB
Conflicts manager
Cluster
3. Replication Model
With lazy replication, a transaction can commit after updating a replica at some
node. After the transaction commits, the updates are propagated towards the other
replicas, which are then updated in separate transactions. Unlike synchronous
replication (with 2 phase commit), updating transactions need not wait that mutual
copy consistency be enforced. Thus lazy replication does not block and scales up
much better compared with the synchronous approach. This performance advantage
has made lazy replication widely accepted in practice, e.g. in data warehousing and
collaborative applications on the Web [12].
Following [13] we characterize a lazy replication scheme using: ownership,
configuration, transaction model propagation, refreshment. The ownership
parameter defines the permissions for updating replicas. If a replica R is updateable,
it is called a primary copy, otherwise it is called a secondary copy, noted r. A node
M is said to be a master node if it only stores primary copies. A node S is said to be
a slave node if it only stores secondary copies. In addition, if a replica copy R is
updateable by several master nodes then it is said to be a multi-owner copy. A node
MO is said to be a multi-owner master node if it stores only multi-owner copies. For
cluster computing we only consider master, slave and multi-owner master nodes. A
master node M or a multi-owner node MO is said to be a master of a slave node S iff
there exists a secondary copy of r in S of a primary copy R in M or MO. We also say
that S is a slave of M or MO.
The transaction model defines the properties of the transactions that access
replicas at each node. Moreover, we assume that, once a transaction is submitted for
execution to a local transaction manager at a node, all conflicts are handled by the
local concurrency control protocol. In our framework, we fix the properties of the
transactions. We focus on four types of transactions that read or write replicas:
update transactions, multi-owner transactions, refresh transactions and queries. An
update transaction T updates a set of primary copies. A refresh transaction, RT, is
associated with an update transaction T, and is made of the sequence of write
operations performed by T used to refresh secondary copies. We use the term multi-
owner transaction, noted MOT, to refer to a transaction that updates a multi-owner
copy. Finally, a query Q, consists of a sequence of read operations on primary or
secondary copies.
The propagation parameter defines when the updates to a primary copy or multi-
owner copy R must be multicast towards the slaves of R or all owners of R. The
multicast protocol is assumed to be reliable and preserve the global FIFO order [16].
We focus on deferred update propagation: the sequence of operations of each
refresh transaction associated with an update transaction T is multicast to the
appropriate nodes within a single message M, after the commitment of T.
The refreshment parameter defines when should a MOT or RT be triggered and
the commit order of these transactions. We consider the deferred triggering mode.
With a deferred-immediate strategy, a refresh transaction RT or multi-owner
transaction MOT is submitted for execution as soon as the corresponding message
M is received by the node.
3.2. Managing replica consistency
r1, s1 S 1, R 1 S 3, R 3
R
S r2, s2 S 2, R 2 S 4, R 4
a ) b o w tie b ) m u lti-m a s te r
For all configurations, the problem is to manage data consistency. That is, any
node that holds a replica should always see the same sequence of updates to this
replica. Consistency management for lazy master has been addressed in [13]. The
problem is more difficult with multi-master where independent transactions can
update the same replica at different master nodes. A conflict arises whenever two or
more transactions update the same object. The main solution used by replication
products [19] is to tolerate and resolve conflicts. After the commitment of a
transaction, a conflict detection mechanism checks for conflicts which are resolved
by undoing and redoing transactions using a log history. During the time interval
between the commitment of a transaction and conflict resolution, users may read
and write inconsistent data. This solution is optimistic and works best with few
conflicts. However, it may introduce inconsistencies.
We propose an alternative, new solution which prevents conflicts and thus
avoids inconsistency. A detailed presentation of the preventive replication scheme
and its algorithms is in [14]. With this preventive solution, each transaction T is
associated with a chronological timestamp value, and a delay d is introduced before
each transaction submission. This delay corresponds to the maximum amount of
time to propagate a message between any two nodes. During this delay, all
transactions received are ordered following the timestamp value. After the delay has
expired, all transactions younger than T are guaranteed to be received. Therefore,
transactions at each node are executed following the same timestamp order and
consistency is assured.
This preventive approach imposes waiting a specific delay d, before the
execution of multi-owner and refresh transactions. Our cluster computing context is
characterized by short distance, high performance inter-process communication
where error rates are typically low. Thus, d can be negligible to attain strong
consistency. On the other hand, the optimistic approach avoids the waiting time d
but must deal with inconsistency management. However, there are many
applications that tolerate reading inconsistent data. Therefore, we decided to support
both replication schemes to provide a continuum from strong consistency with
preventive replication to weaker consistency with optimistic replication.
Q u e r ie s a n d U p d a te T r a n s a c tio n s
R efresh er
Log
Log
R e p lic a In p u t L o g
M o n it o r R -L o g
In te r fa c e
O w n er P r o p a g a to r R e c e iv e r
lo g
N etw o r k
2
see www.tpc.org/tpcc
SELECT item FROM Stock where quantity < threshold
Policy Manager
5. Execution model
In this section, we present the execution model for our cluster system. The
objective is to increase load balancing based on execution rules. The problem can be
reduced as follows: given the cluster’s state (nodes load, running transactions, etc.),
the cluster’s data placement, and a transaction T with a number of consistency
requirements, choose one optimal node and execute T at that node. Choosing the
optimal node requires to first choose the replication mode, and then choose the
candidate nodes where to execute T with that replication mode. This yields a set of
TEPs (one TEP per candidate node) among which the best one can be selected
based on a cost function. In the rest of this section, we present the algorithm to
produce candidates TEPs and the way to select the best TEP and execute it. Finally,
we illustrate the transaction routing process on our running example.
Let us now illustrate the previous algorithms on the transactions of the example
of Section 4.1 and show how TEPs are produced from the TP sent by the policy
manager. We assume that the TPs are received in order (T1, T2, Q), that data at
nodes N1 and N2 is accessed in optimistic mode and that no other transaction is
running and conflicting with T1, T2 or Q. We first consider a case where integrity
constraint C is not taken into account. Then we show how C influences transaction
routing.
Case 1 : no integrity constraint
Upon receiving TP (T1, type = trans., priority = null, compatible = (), update-
mode = no-compensate, IC = (C), max-change = {(Stock.quantity, 15), (Stock, 1)}),
the transaction router does the following:
computes the set of candidate nodes {N1, N2};
detects that T1 is not conflicting with any running transaction, thus the candidate
nodes are {N1, N2};
sends T1 to the least loaded node (say N1) with T1 as synchronization, which
means that N1 must send T1 to the other node as a synchronizing transaction;
infers Imax(Stock, N1)=1, which means that at most one tuple can be modified
at N1 before synchronization.
Upon receiving TP (T2, type = trans., priority = null, compatible = ( (T1,
commut.) ), update-mode = no-compensate, IC = (C), max-change =
{(Stock.quantity, 10), (Stock, 1)}), the transaction router does the following :
computes the set of candidate nodes {N1, N2};
detects that T2 is conflicting with T1 but commutes with it;
sends T2 to the least loaded node (assume N2) with T2 as synchronization. As
T1 and T2 are commutable, the order in which they will be executed at N1 (resp.
N2) does not matter;
infers Imax(Stock, N2) = 1.
Upon receiving TP (Q, type = query., priority = null, compatible = ( (T1, no-
commut.), (T2, no-commut.) ), query-mode = ( (imprecision = 5 unit), (time-bound
= no), (priority = time))), the transaction router does the following:
computes the set of candidate nodes {N1, N2};
detects that Q is conflicting with both T1 and T2;
from the current values of Imax(Stock, N1) and Imax(Stock, N2), it computes
that executing Q at either N1 or N2 would yield a result with an imprecision of at
most one unit. As the query mode imposes an imprecision of at most 5 units, Q is
sent to the least loaded node (say N1)
In the case the query mode of Q was not allowing any imprecision, the router
would have waited for the next synchronization of N1 and N2 to send Q.
Case 2 : with integrity constraint
The transaction router would detect that both T1 and T2 are likely to violate C
and are not compensatable. Sending T1 and T2 to different nodes could lead to the
situation where C is not violated at either N1 or N2, but is violated during
synchronization. Since T1 and T2 are not compensatable, this situation is not
acceptable and T2 must be sent to the same node as T1. Then we have Imax(Stock,
N1) = 0 and Imax(Stock, N2) = 2. Upon receiving Q, the transaction router may still
choose the least loaded node to execute it. Since the priority is given to time in the
query mode, the least loaded node is chosen : N2. Had the priority been given to
precision, N1 would have been selected by the transaction router.
6. Implementation
3
see www.objectweb.org/rmijdbc
6.2 Conflict manager
6.3 Directory
All information used for load balancing (execution rules, data placement,
replication mode, cluster load) is stored in an LDAP compliant directory. The
directory is accessed through Java Directory Naming Interface (JDNI) which
provides an LDAP client implementation. Dynamic parameters measuring the
cluster activity (load, resource usage) are stored in the directory. They are used for
transaction routing. Values are updated periodically at each node. To measure
DBMS node activity, we take advantage of dynamic views maintained by Oracle.
For instance, the following queries collect CPU usage and I/O made by all running
transactions at a node :
select username, se.sid, cpu_usage select username, os_user, process pid,
from v$session ss, v$sesstat se, ses.sid,
v$statname sn physical_reads, block_gets,
where se.statistic# = sn.statistic# consistent_gets,
and name like '%CPU used by this block_changes, consistent_changes
session%' from v$session ses, v$sess_io sio
and se.sid = ss.sid where ses.sid = sio.sid
order by value desc order by physical_reads, ses.username
8. Conclusion
References
[1] R. Alonso, D. Barbará, H. Garcia-Molina. Data Caching Issues in an
Information Retrieval System. ACM Transactions on Database Systems (TODS),
15(3), 1990.
[2] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O'Neil, P. O'Neil. A
Critique of ANSI SQL Isolation Levels. In ACM SIGMOD Int. Conf. on
Management of Data, 1995.
[3] A. Doucet, S. Gançarski, C. León, M. Rukoz. Checking Integrity
Constraints in Multidatabase Systems with Nested Transactions. In Int. Conf. On
Cooperative Information Systems (CoopIS), 2001.
[4] S. Gançarski, H. Naacke, P. Valduriez. Load Balancing of Autonomous
Applications and Databases in a Cluster System. In 4th Workshop on Distributed
Data and Structure (WDAS), 2002.
[5] T. Grabs, K. Böhm, H.-J. Schek. Scalable Distributed Query and Update
Service Implementations for XML Document Elements. In IEEE RIDE Int.
Workshop on Document Management for Data Intensive Business and Scientific
Applications, 2001.
[6] M. Hayden. The Ensemble System. Technical Report, Departement of
Computer Science, Cornell University, TR-98-1662, 1998.
[7] B. Kemme, G. Alonso. Don’t be lazy be consistent : Postgres-R, A new
way to implement Database Replication. In Int. Conf on Very Large Databases
(VLDB), 2000.
[8] C. Olston, J. Widom. Offering a Precision-Performance Tradeoff for
Aggregation Queries over Replicated Data. In Int. Conf. on Very Large
Databases (VLDB), 2000.
[9] T. Özsu, P. Valduriez: Principles of Distributed Database Systems. Prentice
Hall, 2nd edition, 1999.
[10] T. Özsu, P. Valduriez. Distributed and Parallel Database Systems -
Technology and current state-of-the-art. ACM Computing Surveys, 28(1), 1996.
[11] E. Pacitti. Improving Data Freshness in Replicated Databases. PhD Thesis,
INRIA-RR 3617, 1999.
[12] E. Pacitti, O. Dedieu. Algorithms for Optimistic Replication on the Web.
Journal of the Brazilian Computing Society, 2002, to appear.
[13] E. Pacitti, P. Minet, E. Simon. Replica Consistency in Lazy Master
Replicated Databases. Distributed and Parallel Databases, 9(3), 2001.
[14] E. Pacitti. Preventive Lazy Replication in Cluster Systems. Technical
Report RR-2002-01, CRIP5, University Paris 5, 2002.
[15] M. Patiño-Martínez, R. Jiménez-Peris, B. Kemme, G. Alonso. Scalable
Replication in Database Clusters. In 14th Int. Conf. on Distributed Computing
(DISC), 2000.
[16] D. Powel et al. Group communication (special issue). Communication of
the ACM, 39(4), 1996.
[17] U. Röhm, K. Böhm, H.-J. Schek. Cache-Aware Query Routing in a Cluster
of Databases. Int. Conf. on Data Engineering (ICDE), 2001.
[18] A. Sheth, M. Rusinkiewicz. Management of Interdependent Data:
Specifying Dependency and Consistency Requirements. Workshop on the
Management of Replicated Data, 1990.
[19] D. Stacey. Replication: DB2, Oracle, or Sybase. Database Programming &
Design. 7(12), 1994.
[20] P. Valduriez. Parallel Database Systems: open problems and new issues.
Int. Journal on Distributed and Parallel Databases, 1(2), 1993.
[21] G. Voelker et al. Implementing Cooperative Prefetching and Caching in a
Global Memory System.In ACM Sigmetrics Conf. on Performance
Measurement, Modeling, and Evaluation, 1998.
[22] G. Weikum. Principles and Realization Strategies of Multilevel Transaction
Management. ACM Transactions on Database Systems (TODS), 16(1), 1991.
[23] K. L. Wu, P. S Yu, C. Pu. Divergence Control for Epsilon-Serializability.
In 8th Int. Conf. on Data Engineering (ICDE), 1992.
[24] H. Yu, A. Vahdat. Efficient Numerical Error Bounding for Replicated
Network Services. In Int. Conf. On Very Large Databases (VLDB), 2000.