Using A Rule Engine For Distributed Systems Management: An Exploration Using Data Replication Quan Pham
Using A Rule Engine For Distributed Systems Management: An Exploration Using Data Replication Quan Pham
1
Table of contents
1 Introduction ............................................................................................................................. 2
2 Background ............................................................................................................................. 3
2.1 Rule engines .................................................................................................................... 3
2.1.1 Autonomic computing ................................................................................................ 4
2.1.2 Drools Rule Engine ..................................................................................................... 5
2.2 Related work ................................................................................................................... 6
2.2.1 Autonomic computing ................................................................................................ 6
2.2.2 Autonomic Toolkit ...................................................................................................... 6
2.2.3 ABLE toolkit ............................................................................................................... 7
2.2.4 Kinesthetics eXtreme (KX)......................................................................................... 8
2.2.5 Challenges ................................................................................................................... 8
2.3 Data replication ............................................................................................................... 9
2.3.1 The Replica Location Service (RLS) .......................................................................... 9
2.3.2 Lightweight Data Replicator (LDR) ......................................................................... 10
2.3.3 Data Replication Service (DRS) ............................................................................... 11
2.3.4 Autonomous systems with data replication .............................................................. 12
3 System design & implementation ......................................................................................... 13
3.1 System design ............................................................................................................... 14
3.1.1 Control module ......................................................................................................... 14
3.1.2 Tool module .............................................................................................................. 14
3.1.3 Web service module .................................................................................................. 15
3.2 Implementation ............................................................................................................. 16
3.2.1 Control module ......................................................................................................... 16
3.2.2 Tool module .............................................................................................................. 16
3.2.3 Web service module .................................................................................................. 17
3.2.4 Data replication system rules .................................................................................... 17
4 Experiment and Discussion................................................................................................... 19
4.1 Implementation complexity .......................................................................................... 19
4.2 Execution performance ................................................................................................. 19
4.2.1 No failure during transfer, no replication site failure ............................................... 21
4.2.2 Some failure during transfer, no replication site failure ........................................... 23
4.2.3 Some transfer failures, one replication site failure ................................................... 24
5 ConclusionS .......................................................................................................................... 25
6 Appendix: Rule complexity .................................................................................................. 26
7 References ............................................................................................................................. 26
1 INTRODUCTION
Dynamic changes in distributed systems are common, due to their many components and the fact
that different components are frequently subject to different policies. These changes can make it
difficult to construct applications that ensure functionality or performance properties required by
users [1]. In order to run efficiently and get high performance, applications must adapt to those
changes. (To use a popular terminology, they must incorporate autonomic [2] capabilities.)
2
However, the logic required to perform this autonomic adaptation can be complex and hard to
implement and debug, especially when embedded deeply within application code. Thus, we ask:
may it be possible to reduce the complexity of distributed applications by using higher-level,
declarative approaches to specifying adaptation logic?
The following example illustrates some of the challenges. The Laser Interferometer Gravitational
Wave Observatory (LIGO), a multi-site national research facility, has faced a data management
challenge. They needed to replicate approximately 1 TB/day of data to multiple sites on two
continents securely, efficiently, robustly, and automatically. They also needed to keep track of
replica locations and to use the data in a multitude of independent analysis runs [3]. Yet while
the high-level goal is simple (ensure that data is replicated in a timely manner), its
implementation is difficult due to the fact that individual sites, network links, storage systems,
and other components can all fail independently. To address this problem, LIGO had developed
the Lightweight Data Replicator (LDR) [4], an integrated solution that combined several basic
Grid components with other tools to provide an end-to-end system for managing data. Using
LDR, over 50 terabytes of data have been replicated to sites in the U.S.A and Europe between
2002 and 2005 [3]. LDR makes used of Globus Toolkit components to transfer data using the
GridFTP high-performance data transport protocol. LDR works well, but its replication logic is
embedded within a substantial body of code.
2 BACKGROUND
We first provide some background on rule engines, autonomic computing, and data replication,
and review prior related research.
3
A system with a large number of rules and facts may result in many rules being true for the same
facts; these rules are said to be in conflict. Different rule engines can use different conflict
resolver strategies to determine the order of execution of the conflict rules.
In a rule engine, there are two execution methods: Forward Chaining and Backward Chaining
[5]. An engine can implement either methodor, in a hybrid engine Hybrid Rule System, both
methods. Forward chaining is a "data-driven" method. Upon facts being inserted or updated, the
rule engine uses available facts and inference rules to extract more facts until a goal is reached,
where one or more matching rules will be concurrently true and scheduled for execution. Hence
the rule engine starts with facts and ends with conclusion.
Self-configuration: For large computing system, the process of installing, configuring the
system is error-prone and challenging. Autonomic system can configure itself automatically
following some predefined, high-level policies. The policies should specify what the components
in the system should accomplish, not how. For example, when a new component is introduced
into the system, the new component should be aware of the system configuration and able to
adjust itself to the whole system.
Self-healing: When errors or failure occur in a large computing system, it usually takes a long
time and much effort for administrators and users to diagnose and trouble-shooting the problems.
Sometimes, the problems might disappear without identifying the clear root of failure. An
autonomic computing system should be able to detect, diagnose, and repair the system to some
extent. If it cannot fully repair the system, it should alert the administrators / developers of the
failure.
Self-optimization: An autonomic system can continually find ways to improve its operations. It
should be able to tune itself towards more efficient in performance or cost. Through monitoring
and self-learning, the system should get more and more efficient. This is a challenge for human
tuning in large complex systems with hundreds of tuning parameters and configurations.
Self-protection: An autonomic system should be able to identify and protect against malicious
attacks or cascading failures that are not repaired by self-healing. The system should be able to
avoid those attacks and failure if possible through log monitoring and other methods.
4
2.1.2 Drools Rule Engine
The Drools rule engine [6] that we use in this work implements an extended Rete algorithm[7].
The Rete algorithm is a pattern-matching algorithm for implementing production rule systems; it
is more efficient than the basic nave implementation of checking rules serially against the set of
facts. The Rete-based algorithm creates a generalized trie of nodes, in which each node
corresponds to one pattern in the conditional part of the rules. A path from the root of the trie to
its leaf corresponds to one complete conditional part of a rule. When a fact is inserted or updated,
it is propagated along the trie and nodes with a matching pattern are annotated. If all nodes on
one path from the root to a leaf are annotated, the corresponding rule is satisfied and triggered.
The Drools Rete implementation is called ReteOO, which stands for the optimized
implementation of the Rete algorithm for object oriented systems.
Figure 1 shows the Drools architecture. Drools stores rules in its Production Memory and facts in
its Working Memory. Facts are asserted into the Working Memory where they may then be
modified or deleted. Drools uses its Agenda to manage the execution order of these conflicting
rules. The default conflict resolution strategies employed by Drools are Salience (or priority,
where the user assigns different priority numbers to each rule and the conflicting rule with the
highest priority number is executed first) and LIFO (based on an internal assigned action counter
value).
Currently, Drools uses forward-chaining method (as of version 5.0). There are plans for
backward-chaining support in future releases.
5
2.2 Related work
Several programming frameworks have been developed th that seek to implement autonomic
management of parallel/distributed/grid application
applications , although in different ways.
ways While
Automate [8], K-Components [9] [9], SAFRAN [10],, CoreGRID Component Model [11] all provide
distributed system-based
based component frameworks with autonomic capability, each framework has
been developed for a specific
fic application
application.. In our project, we want to use a commodity off-the-
off
shelf rule engine to show the generality and applicability of rule engine on distributed systems
management.
Based on this MAPE-K K model for autonomic computing, there are many implementations
implementation in
both research and production areas. We review some
ome of those implementations below.
2.2.2 Autonomic Toolkit
The Autonomic Toolkit provides
provides a practical framework and reference implementation for
incorporating autonomic capabilities into software systems [1]. It is an open set of Java class
libraries, plug-ins, and tools created for the Eclipse development environment
environment. It is implemented
in Java, using XML messages to communicate with other application
application, for
or example, analyzing the
6
logs of a managed application. At the core, the Automated Management Engine (AME) hosts
deployed resource models. Resource models define event types, polling intervals, thresholds, and
actions to take when thresholds are crossed. The engine executes resource model scripts within a
control loop. It also stores operational data in an embedded local database.
The developers of the Autonomic Toolkit describe an application development suite that
provides software developers with a technology to develop autonomic applications, including
dynamically self-configuring network services such as DHCP, DNS, LDAP, and other server
platforms [1]. However, they do not present any performance measures.
The authors describe the set of functionality provided in the ABLE toolkit and demonstrate its
utility via three application case studies: system administration, diagnostic application and auto-
tune agent for Apache web servers.
7
Figure 4 AbleRuleSet bean and inference engines
2.2.4 Kinesthetics eXtreme (KX)
KX [14] is an implementation of an easily-integrable external monitoring infrastructure. The
overview of the system is in Figure 5. KX can be used to add autonomic self-management and
self-healing functionality to legacy systems that was not designed with autonomic properties in
mind. Its developers describe three use cases in failure detection, load balancing, and email
processing to demonstrate their solution. KX is implemented in Java, using the Little-JIL [15]
formalism and the ACME ADL [16].
The Event Distiller performs sophisticated cross-stream temporal event pattern analysis and
correlation to monitor desirable or undesirable behaviors by performing time-based pattern
matching. Internally, according to the authors, the Event Distiller uses a collection of
nondeterministic state engines for temporal complex event pattern matching.
2.2.5 Challenges
So far there has not been any comprehensive work on evaluation criteria or metrics for
autonomic computing. The definition of how well an autonomic system performs depends on
each system. Evaluation criteria can be challenging to define as the evaluation may not be based
8
on the increased performance of the system, but its ability to meet a certain SLA. An evaluation
metric can be the convergence and time for the system to converge to some predefined stable
states. Alternatively, there can be an establishment of a representative Grand Challenge
Application (e.g., keep this system running for a week without any human intervention) that
can allow differing techniques to be compared and rated.
An RLS deployment consists of Local Replica Catalog service (LRC) and Replica Location
Index service (RLI) as in Figure 6. LRC stores the mappings between logical and physical
location of replicas, and is responsible for discovering corresponding replica of each logical file
names. RLI stores information about the logical name mappings from several LRC(s). It is used
in a distributed RLS, and can be used to answer user query on LRC. User can query the RLI to
find which RLC contains mapping of a logical file name, and then query the RLCs to ask for
physical location of those replicas.
9
gets periodically refreshed by subsequent updates. To reduce the network and update delays,
RLS implements Bloom bitmap filter [18] to compress the updates.
RLS performance has been measured to be millions of entries and one hundred requesting
threads for a single RLS server or for a distributed RLS with multiple RLCs and RLIs [19]. The
LRC achieves query rates of 1700 to 2100 per second, add rates of 600 to 900 per second and
delete rates of 470 to 570 per second.
However, we need to note that the RLS does not check for correctness or consistency of RLS
entries. The RLS is just a registry that allows users to register mappings. Hence it is the users /
other application to determine what/how/where to make replica and register to the RLS. Also, if
replicas are modified, the users must inform the RLS to update the mappings.
In Figure 7, a typical LDR deployment, each site needs to run GridFTP server with a local
storage, a Globus LRS service with one LRC and one RLI, a Metadata Catalog, a Scheduler
Daemon and a Transfer Daemon for file transport.
10
The Transfer Daemon periodically checks the Priority Queue, uses the LRS to find location of
the logical file name then choose randomly among the available remote sites to retrieve the file
in a pull model.
Although LDR can be considered as a minimum collection of components necessary for fast,
efficient, robust, and secure replication of data, it lacks the flexibility for users with more
complicated scenarios.
2.3.3 Data Replication Service (DRS)
The Globus Data Replication Service [20] is a set of flexible, composable, general-purpose,
higher-level data management services to support Grid applications. DRS was designed with the
aim of generalizing LDRs publication functionality to achieve independence from the LIGO
infrastructure. DRS is based on GT4 Delegation Service, RFT, LRS and GridFTP services.
11
the requested files. Next, in the transfer phase (8, 9, 10, 11, 12), the Replicator passes the control
to the RFT resource and wait for the GridFTP transfer to complete. Then in the registration phase
(13, 14), it adds the new mapping to the LRS services.
DRS performance can vary considerably on operations such as discovery (from 307 to 5371
milliseconds) and registration (from 295 to 4305 milliseconds) [20]. Note that any user
replication request must specify the desired files, identified by their logical file names, and the
desired destination locations, identified by URLs. Hence there is no automation in the selection
of remote replication sites or any consistency check for replicas.
2.3.4 Autonomous systems with data replication
One recent work on autonomous data replication is from [21]. This replication system is
designed to provide a suitable replica location to minimize file access time according to a user-
specified Round Trip Time (RTT) requirement. From Figure 9:
The ant algorithm is self-organized, adaptive, and distributed. The system uses the ant algorithm
to explore participating node without any prior configuration of the environment, initial
conditions and topology. The ants walk along the DKS ring to collect information of each place
they pass by and record the best position (according to the RTT) in their statuses. The
destinations of these ants are the nodes in the first level (level 0) of the DKS routing table for the
node where the ants are sent out. At each step, the default next destination for the ant is the
successor of the current node; hence it will eventually cover the entire DKS ring.
This paper has given some thought into using autonomous system with data replication.
However, the use of the ant algorithm makes it permanent to some specific application, which is
not flexible enough for a general framework of using autonomous system with data management.
12
Figure 9 Autonomous system with data replication overview
File status
o file replication status
o number of replications
13
o location of replications
Replication site status
o site availability
o number of files replicated on that site
14
Figure 10 Core system design
15
Figure 11 Web service module
3.2 Implementation
We describe the control, tool, and web service modules, and the rules that implement the data
replication logic. We note that our implementation requires the following components:
16
3.2.3 Web service module
The web service wrappers (server and client) of the system use the Globus Toolkit Java Web
Service Core. This module provides the communication protocol between users' client and the
replication service on a server. However, in the experiment of the system, the interaction
between client and server will not be counted towards the system performance.
Conditions Consequences
We present one rule in detail to give a flavor of our what our rules look like. The following rule
implementation specifies a name New DataCatalog, and indicates that rule will be specified in
the Java programming language. The rule contains two parts, the conditional part, defined in the
when clause, and the consequence part, defined in the then clause.
17
rule ":ew DataCatalog"
dialect "java"
when
# total number of replicas and in-progress replicas does not meet requirement
$data : DataCatalog(
status == DataCatalog.STATUS_AVAILABLE,
requiredReplicaCount > replicaCount )
then
Config.appendLog("INFO: JOB START: start replicate " + $data + " to " + $site);
insert ( new DataTransfer ( $data, $site, $session ) );
modify ( $data ) { addReplicationSite ($site) };
modify ( $site ) { addDataCatalog ($data) };
modify ($tranferCounter) { inc() };
end
A new DataCatalog object $data is inserted or updated in the WorkingMemory of the rule
engine.
A ReplicationSite object $site is inserted or updated.
A DataTransfer object is updated or removed
A transferCounter is updated.
The conditional part is designed to evaluate to true if a new replica should and can be created at a
particular site. More specifically, it will evaluate to be true if all of the following conditions are
true:
The DataCatalog has available status and has less than the required number of replicas.
The ReplicationSite is available.
There is no DataTransfer object that represents the replica of this $data on the $site.
The number of parallel transfers is less than some setting (here, the value 20).
If the conditional part evaluates to be true, then the consequence (then) part is executed. In this
rule, the consequence part will:
Insert a new DataTransfer object to perform the transfer. In the implementation of the
DataTransfer class, upon construction, the DataTransfer object will start a new thread to
transfer the give file to the given remote replication site. Once the transfer is finished, the
18
DataTransfer object will update its status (success/failure) in the WorkingMemory of the
engine.
Modify DataCatalog and ReplicationSite objects to update the new replica information.
Update the counter of current parallel transfer.
Most of the application code is contained within the modules that implement the client interface
and perform the file transfer operations. The business logic proper is expressed concisely using
Drools declarative language. All system logic is maintained in one file that is cleanly separated
from the rest of the application code.
Ideally, we would have compared the code size of our implementation with that of other systems.
In practice, this was not easy to do. Nevertheless, our review of technical descriptions of other s
data replication system makes us believe that that our implementation is significantly less
complex and much easier to extend to incorporate new functionality.
number of files
file size
19
replication ratio
network failure rate (intermediate failure of with some failure in transfer)
replication site failure rate (replication site gone down completely)
The runtime experiments mainly aim to assess the overhead due to management of the rule
engine.
We run our experiments on Teraport, an IBM e1350 eServer cluster based upon the AMD
Operton architecture. We run the replication service on one IBM e325 node with two 2.2 GHz
AMD64 processors, 4 GB RAM, and 80 GB local disk. The data that is to be replicated is
located on the local disk of this node. Four GridFTP servers are located on four other nodes with
the same hardware configuration. These nodes are connected via a switch with available
bandwidth of 1 Gb/s per node. For simple comparison and configuration, GridFTP transfers are
not striped, and the application can only have a pre-defined maximum number of concurrent
connections.
Each experiment proceeds in two phases. First, the files that are to be replicated must be
identified. At the start of the experiment, the system is given a set of directories for replication.
The system crawls those directories to add all files in those directories to the WorkingMemory of
the rule engine. This part of the experiment is the Add-File period in the result graphs. During
this period, no rule is fired, only the WorkingMemory of the engine is filled with file
information.
The second part of the experiment is the Transfer-File period. In this period, the system keeps
firing all the rules in the engine. If there is no satisfied rule due to current ongoing transfers, the
system waits five seconds before trying to trigger all rules again. In our experiments, the system
will stop once there is scheduled transfer after each time all rules are fired. Each experiment is
run five times; we record the average run time in the graphs.
20
4.2.1 No failure during transfer, no replication site failure
We first present the results of experiments in which we measure the performance of the system
when replicating large numbers of files via a variety of methods. These experiments are designed
to measure the overheads associated with the rule engine.
4.2.1.1 Simulation Transfer
This first set of experiments is designed to evaluate rule engine performance. Each time the rule
New DataCatalog is fired, the construction of a DataTransfer object will result in a new
connection to a remote replication site and a start of the transfer. However, to exclude the
influence of network instability and other environment variables, we do not make the real
connection to replication sites. Th
Thus, the actual transfer takes zero seconds to finish,
finish allowing us
to observe the performance of the application without any external influence.
Our results, in Figure 12, show that even when performing no actual transfers, the add-file period
is less than 2% of the transfer-file
file period
period. As in addition the add-file
file time appears to grow
roughly linearly with the number of files, we conclude that the add-file operation will likely not
be a significant contributor to overall execution time, even for a large number of files.
file Thus, in
all of the graphs and discussion that follow
follow, we present only total runtimes:
s: we no longer
separate transfer-file and add-file
file times.
21
4.2.1.2 Small file transfer using GridFTP
In this second set of experimentss, we transfer files via the Globus Toolkit implementation of the
GridFTP protocol. Each file is transferred over a distinct connection and thus incurs the
authentication and startup, teardown costs of the GridFTP protocol. The file size ranges from 1
KBytes to 5 KBytes, and the average
erage total transfer time of each file is less than one second
(including directory creation time if needed). Future development of this system should reuse
GridFTP connections to remove the overhead of authentication and connection startup.
4500
4000
3500
3000
Seconds
2500
2000
1500 Runtime with
1000 small file transfer
500 using gridftp
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Number of files
Figure 14 Comparison of replication times when using GridFTP and simulated transfers
transfer
We see in Figure 13 and Figure 14 that runtime displays a roughly linear trend when replicating
between 103 and 104 files. The ~250
250 second difference between the simulated transfer case and
the GridFTP-transfer case is surpr
urprisingly larger and seems likely to involve more than the delay
of the GridFTP transfer. Perhaps we are seeing an increase in memory usage and stress on the
system when calling the GridFTPTP library. This difference should be investigated further.
22
Since the experiments
xperiments using simulation and experiment using GridF
GridFTP show great similarity, all
the following experiments are based on simulation with zero transfer time.
23
Figures 15 and 16 show our results. We see that runtime increases roughly linearly as the failure
transfer rate changes from 0% to 20%. At 18% transfer failure rate, the runtime increases by
30%. Due to the 18% failure rate, the percentage of re-transfer is:
The actual runtime increase of 30% is presumably due to extra processing performed by the
replication system: for example, facts being retracted (to delete failed transfer) and inserted (to
add new transfers), and rule matching/triggering. More investigation may be needed to explain
the difference.
In this experiment, replication site failure is detected by an external script and reported to the
rule engine via the rule engine web service interface. The unavailability of the replication site is
reflected in the rule engine working memory as a change in the availability attribute of the
corresponding object.
We evaluate three different scenarios. In each case, we perform some action after the replication
process stabilized and measure the time that the system takes to respond that action.
In the first experiment, we delete a single file from a monitored directory. We observe that the
replica management system takes 0.01 seconds to respond to this deletion by deleting the three
replicas that have been created for that file.
In the second experiment, we add a single file to a monitored directory. We observe that the
replica management system takes 0.01 seconds to respond to this addition by creating three
replicas for the new file. Note that in these experiments, we are simulating transfers, so that 0.01
seconds does not include the time required to transfer the replicated file.
In the third experiment, we take down an entire replication site. Figure 17 shows the time
required to recover from the loss of a site as a function of the number of files that are being
replicated. As noted above, we have four replica sites in this experiment, and each file is to be
replicated three times. Thus, if we are replicating N files, we will have 3N replicas after
24
stabilization, and will lose 3N/4 repl
replicas
icas when a single site is removed. Therefore recovery
requires the creation of 3N/4 new replicas. This activity should have a cost similar to ~3/4 of a
replication process with a replication rate of 1, and thus we also give that data in Figure 17.
Surprisingly,, the times for the two activities are quite different. One possible explanation is the
different states of the system in the recovery vs. in the warm up processes. In the replication
process (warming up before stabilizing), each new object is inserted into the Working Memory
and must be pattern-matched
matched with a data set that is still being constructed; whereas in the
recovery process, each replacement object is pattern
pattern-matched
matched against a data set that has already
been organized. These behaviors demand more investigation.
5 CONCLUSION
CONCLUSIONS
We have explored the feasibility of using a rule engine to implement distributed systems
management functionality by using a specific rule engine (Drools) to implement a particular
distributed systems management function (replica management)
management). Our Drools-based
based replica
management system allows the user to specify, in a declarative fashion, high high-level
level objectives
(e.g., that a specified number of repli
replicas
cas should be maintained for each file) and associated
business logic (e.g., if too few replicas exist for a file, a new replica should be created; if too
many replicas exist, one should be deleted). The Drools
Drools-based
based system then evaluates these rules
againstst a database of facts representing the current state of the overall system, and executes
appropriate actions (e.g., create or delete replicas) as required. We have evaluated our solution
from the perspectives of both complexity and performance, with satis satisfactory
factory results.
We conclude from this experiment that it is indeed feasible to use a rule engine - and Drools in
particular - to implement distributed system management logic. We have not engaged in any
usability studies, but the compact and readable nature of the rules that underpin our
implementation make us feel that this approach should be highly attractive to developers. In
25
future work, we should both implement yet more sophisticated behaviors and measure the
effectiveness of developers as they add new capabilities.
We have also evaluated the performance of the Drools rule engine from the perspective of our
application. We note that runtime increases linearly, as we might expect, with the number of files
that must be replicated. In some settings (e.g., if many files are to be replicated and sites
frequently change availability) then we may want to explore alternative implementation
approaches: e.g., the replication of collections rather than individual files.
Our results also suggest a range of other topics for future work. From the perspective of
performance, we would like to investigate the upper limit in number of files / replication sites a
rule engine can handle, and the stability and performance of the replication system after the
warm-up process. It would also be interesting to explore other rule engine implementation (not in
Java?), and to explore opportunities for distributed rule engine implementations.
From a semantic perspective, we would like to investigate more complex replication policies,
such as policies that seek to maximize replication performance by taking into account network
topology or that vary replication rates based on recent loss rates. We are also interested in
exploring situations in which multiple stakeholders impose policies that must be satisfied
simultaneously, as for example when individual sites impose constraints on the maximum space
that can be used for different purposes.
7 REFERENCES
26
3. Large-scale data replication for LIGO. Available from:
http://www.globus.org/solutions/data_replication/.
7. Forgy CL. Rete: A fast algorithm for the many pattern/many object pattern match problem.
Artif Intell. 1982 9;19(1):17-37.
8. Agarwal M, Bhat V, Liu H, Matossian V, Putty V, Schmidt C, et al. In: AutoMate: Enabling
autonomic applications on the grid. Autonomic computing workshop, 2003. p. 48-57.
11. CoreGRID. D.PM.04 basic features of the grid component model (assessed). CoreGRID
NoE deliverable series, Institute on Programming Model. Feb. 2007.
12. IBM 2003. An architectural blueprint for autonomic computing. Tech rep , IBM. 2003.
13. Bigus JP, Schlosnagle DA, Pilgrim JR, Mills III WN, Diao Y. ABLE: A toolkit for building
multiagent autonomic systems. IBM Systems Journal. 2002;41(3):350-71.
14. Kaiser G, Parekh J, Gross P. In: Kinesthetics eXtreme: An external infrastructure for
monitoring distributed legacy systems. ; 2003. p. 22-30.
15. Wise A, Cass AG, Lerner BS, McCall EK, Osterweil LJ, Sutton SM, Jr. In: Using little-JIL
to coordinate agents in software engineering. Automated software engineering, 2000.
proceedings ASE 2000. The fifteenth IEEE international conference; 2000. p. 155-63.
16. Schmerl B, Garlan D. In: Exploiting architectural design knowledge to support self-repairing
systems. SEKE '02: Proceedings of the 14th international conference on software engineering
and knowledge engineering; Ischia, Italy. New York, NY, USA: ACM; 2002. p. 241-8.
27
18. Bloom BH. Space/time trade-offs in hash coding with allowable errors. Commun ACM.
1970;13(7):422-6.
19. Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R. In: Performance and
scalability of a replica location service. HPDC '04: Proceedings of the 13th IEEE international
symposium on high performance distributed computing; Washington, DC, USA: IEEE Computer
Society; 2004. p. 182-91.
20. Chervenak A, Schuler R, Kesselman C. In: Wide area data replication for scientific
collaborations. Proceedings of the 6th international workshop on grid computing; 2005.
22. Ghodsi A. Distributed K-ary system: Algorithms for distributed hash tables [dissertation].
Stockholm, Sweden,: KTH---Royal Institute of Technology; 2006.
23. Resnick M. Turtles, termites, and traffic jams: Explorations in massively parallel
microworlds (complex adaptive systems) The MIT Press; 1997.
28