0% found this document useful (0 votes)
41 views10 pages

Fault-Injection and Dependability Benchmarking For Grid Computing Middleware

This document discusses fault injection and dependability benchmarking tools for grid computing middleware. It presents the FAIL-FCI fault injection system from INRIA, which provides fault injection in large distributed systems. It also presents DBGS, a dependability benchmark for grid services being developed by the University of Coimbra. The goal is to provide tools and techniques to evaluate dependability metrics and conduct dependability benchmarking of grid middleware and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views10 pages

Fault-Injection and Dependability Benchmarking For Grid Computing Middleware

This document discusses fault injection and dependability benchmarking tools for grid computing middleware. It presents the FAIL-FCI fault injection system from INRIA, which provides fault injection in large distributed systems. It also presents DBGS, a dependability benchmark for grid services being developed by the University of Coimbra. The goal is to provide tools and techniques to evaluate dependability metrics and conduct dependability benchmarking of grid middleware and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fault-Injection and Dependability Benchmarking for

Grid Computing Middleware

Sébastien Tixeuil, 1, Luis Moura Silva2,

William Hoarau1, Gonçalo Jesus2, João Bento2, Frederico Telles2


1
LRI – CNRS UMR 8623 & INRIA Grand Large,
Université Paris Sud XI, France
Email : [email protected]
2
Departamento Engenharia Informática, Universidade de Coimbra,
Polo II, 3030-Coimbra, Portugal
Email: [email protected]

Abstract. In this paper we will present some work on dependability


benchmarking for Grid Computing that represents a common view between two
groups of Core-Grid: INRIA-Grand Large and University of Coimbra. We
present a brief overview of the state of the art, followed by a presentation of the
FAIL-FCI system from INRIA that provides a tool for fault-injection in large
distributed systems. Then we present DBGS, a dependable Benchmark for Grid
Services. We conclude the paper with some considerations about the avenues of
research ahead that both groups would like to contribute, on behalf of the Core-
GRID network.

1 Introduction

One of the topics of paramount importance in the development of Grid middleware is


the impact of faults since their probability of occurrence in a Grid infrastructure and
in large-scale distributed system is actually very high. So it is mandatory that Grid
middleware should be itself reliable and should provide a comprehensive support for
fault-tolerance mechanisms, like failure-detection, checkpointing, replication,
software rejuvenation, component-based reconfiguration, among others. One of the
techniques to evaluate the effectiveness of those fault-tolerance mechanisms and the
reliability level of the Grid middleware it to make use of some fault-injection tool and
robustness tester to conduct some experimental assessment of the dependability
metrics of the target system. In this paper, we will present a software fault-injection
tool and a workload generator for Grid Services that can be used for dependability
benchmarking in Grid Computing.
The final goal of our common work is to provide some contributions for the
definition of a dependability-benchmark for Grid computing and to provide a set of
176

tools and techniques that can be used by the developers of Grid middleware and Grid-
based applications to conduct some dependability benchmarking of their systems.
In this paper we present a fault-injection tool for large-scale distributed systems
(developed developed by INRIA-GrandLarge) and a workload generator for Grid
Services (being developed by the University of Coimbra) that include those four
components mentioned before. To the best of our knowledge the combination of these
two tools represent the most complete testbed for dependability benchmarking of Grid
applications.
The remainder of this paper is organized as follows. Section 2 describes a summary
of the related work. Section 3 describes the FAIL-FCI infrastructure from INRIA.
Section 4 briefly describes DBGS, a dependability benchmarking tool for Grid
Services. Section 5 concludes the paper.

2 Related Work

In this section we present a summary of the state-of-the-art in the two main topics of
this paper: dependability benchmarking and fault-injection tools.

2.1 Dependability Benchmarking

The idea of dependability benchmarking is now a hot-topic of research [1] and there
are already several publications in the literature. In [2] it is proposed a dependability
benchmark for transactional systems (DBench-OLTP). Another dependability
benchmark for transactional systems is proposed in [3]. This one considered a
faultload based on hardware faults. A dependability benchmark for operating systems
is proposed by [4]. Research work developed at Berkeley University has lead to the
proposal of a dependability benchmark to assess human-assisted recovery processes
[5]. The work carried out in the context of the Special Interest Group on
Dependability Benchmarking (SIGDeB), created by the IFIP WG 10.4, has resulted
in a set of standardized availability classes to benchmark database and transactional
servers [6]. Research work at Sun Microsystems defined a high-level framework [7]
dedicated specifically to availability benchmarking. Within this framework, they have
developed two benchmarks: one benchmark [8] that addresses specific aspects of a
system's robustness on handling maintenance events such as the replacement of a
failed hardware component or the installation of software patch; and another
benchmark is related to system recovery [9]. At IBM, the Autonomic Computing
initiative is also developing benchmarks to quantify a system's level of autonomic
capability, addressing four main spaces of IBM's self-management: self-
configuration, self-healing, self-optimization, and self-protection [10]. We are looking
with detail into this initiative and our aim will be to introduce some of these metrics
in Grid middleware to reduce the maintenance burden and to increase the availability
of Grid applications in production environments. Finally, in [11] the authors present a
177

dependability benchmark for Web-Servers. In some way we follow this trend by


developing a benchmark for SOAP-based Grid services.

2.2 Fault-injection Tools

When considering solutions for software fault injection in distributed systems, there
are several important parameters to consider. The main criterion is the usability of the
fault injection platform. If it is more difficult to write fault scenarios than to actually
write the tested applications, those fault scenarios are likely to be dropped from the
set of performed tests. The issues in testing component-based distributed systems
have already been described and methodology for testing components and systems
has already been proposed [12-13]. However, testing for fault tolerance remains a
challenging issue. Indeed, in available systems, the fault-recovery code is rarely
executed in the test-bed as faults rarely get triggered. As the ability of a system to
perform well in the presence of faults depends on the correctness of the fault-recovery
code, it is mandatory to actually test this code. Testing based on fault-injection can be
used to test for fault-tolerance by injecting faults into a system under test and
observing its behavior. The most obvious point is that simple tests (e.g. every few
minutes or so, a randomly chosen machine crashes) should be simple to write and
deploy. On the other hand, it should be possible to inject faults for very specific cases
(e.g. in a particular global state of the application), even if it requires a better
understanding of the tested application. Also, decoupling the fault injection platform
from the tested application is a desirable property, as different groups can concentrate
on different aspects of fault-tolerance.
Decoupling requires that no source code modification of the tested application
should be necessary to inject faults. Also, having experts in fault-tolerance test
particular scenarios for application they have no knowledge of favors describing fault
scenarios using a high-level language, that abstract practical issues such that
communications and scheduling. Finally, to properly evaluate a distributed
application in the context of faults, the impact of the fault injection platform should be
kept low, even if the number of machines is high. Of course, the impact is doomed to
increase with the complexity of the fault scenario, e.g. when every action of every
processor is likely to trigger a fault action, injecting those faults will induce an
overhead that is certainly not negligible.
Several fault injectors for distributed systems already exist. Some of them are
dedicated to distributed real-time systems such as DOCTOR [14]. ORCHESTRA [15]
is a fault injection tool that allows the user to test the reliability and the liveliness of
distributed protocols. ORCHESTRA is a "Message-level fault injector" because a
fault injection layer is inserted between two layers in the protocol stack. This kind of
fault injector allows injecting faults without requiring the modification of the protocol
source code. However, the expressiveness of the faults scenario is limited because
there is no communication between the various state machines executed on every
node. Then, as the faults injection is based on exchanged messages, the knowledge of
the type and the size of these messages is required. Nevertheless, those approaches do
not fit the cluster and Grid category of applications.
178

The NFTAPE project [16] arose from the double observation that no tool is
sufficient to inject all fault models and that it is difficult to port a particular tool to
different systems. Although NFTAPE is modular and very portable, the choice of a
completely centralized decision process makes it very intrusive (its execution strongly
perturbs the system being tested). Finally, writing a scenario quickly becomes
complex because of the centralized nature of the decisions during the tests when they
imply numerous nodes.
LOKI [17] is a fault injector dedicated to distributed systems. It is based on a partial
view of the global state of the distributed system. An analysis a posteriori is executed
at the end of the test to infer a global schedule from the various partial views and then
verify if faults were correctly injected (i.e. according to the planned scenario).
However, LOKI requires the modification of the source code of the tested application.
Furthermore, faults scenario are only based on the global state of the system and it is
difficult (if not impossible) to specify more complex faults scenario (for example
injecting "cascading" faults). Also, in LOKI there is no support for randomized fault
injection.
In [18] is presented Mendosus, a fault-injection tool for system-area networks that
is based on the emulation of clusters of computers and different network
configurations. This tool made some first steps in the fault-injection and assessment
of faults in large distributed systems, although FCI has made some steps ahead.
Finally in [19] is presented a fault-injection tool that was specially developed to
assess the dependability of Grid (OGSA) middleware. This is the work more related
with ours and we welcome the first contributions done by those authors in the area of
grid middleware dependability. However, the tool described in that paper is very
limited since it only allows the injection of faults in the XML messages in the OGSA
middleware, which seems to be a bit far from the real faults experienced in real
systems.
In the rest of the paper we will present two tools for fault-injection and workload
generation that complement each other quite well, and if used together might
represent an interesting package to be used by developers of Grid middleware and
applications.

3 FAIL-FCI Framework from INRIA

In this section, we describe the FAIL-FCI framework from INRIA. First, FAIL (for
FAult Injection Language) is a language that permits to easily described fault
scenarios. Second, FCI (for FAIL Cluster Implementation) is a distributed fault
injection platform whose input language for describing fault scenarios is FAIL. Both
components are developed as part of the Grid eXplorer project [20] which aims at
emulating large-scale networks on smaller clusters or grids.
The FAIL language allows defining fault scenarios. A scenario describes, using a
high-level abstract language, state machines which model fault occurrences. The
FAIL language also describes the association between these state machines and a
computer (or a group of computers) in the network. The FCI platform (see Figure 1)
is composed of several building blocks:
179

1. The FCI compiler: The fault scenarios written in FAIL are pre-compiled
by the FCI compiler which generates C++ source files and default
configuration files.
2. The FCI library: The files generated by the FCI compiler are bundled
with the FCI library into several archives, and then distributed across the
network to the target machines according to the user-defined
configuration files. Both the FCI compiler generated files and the FCI
library files are provided as source code archives, to enable support for
heterogeneous clusters.
3. The FCI daemon: The source files that have been distributed to the target
machines are then extracted and compiled to generate specific executable
files for every computer in the system. Those executables are referred to
as the FCI daemons. When the experiment begins, the distributed
application to be tested is executed through the FCI daemon installed on
every computer, to allow its instrumentation and its handling according to
the fault scenario.
Our approach is based on the use of a software debugger. Like the Mantis parallel
debugger [21], FCI communicates to and from gdb (the Free Software Foundation's
portable sequential debugging environment) through Unix pipes. But contrary to
Mantis approach, communications with the debugger must be kept to a minimum to
guarantee low overhead of the fault injection platform (in our approach, the debugger
is only used to trigger and inject software faults). The tested application can be
interrupted when it calls a particular function or upon executing a particular line of its
source code. Its execution can be resumed depending on the considered fault scenario.

With FCI, every physical machine is associated to a fault injection daemon. The
fault scenario is described in a high-level language and compiled to obtain a C++
code which will be distributed on the machines participating to the experiment. This
C++ code is compiled on every machine to generate the fault injection daemon. Once
this preliminary task has been performed, the experience is then ready to be launched.
The daemon associated to a particular computer consists in:
1. a state machine implementing the fault scenario,
2. a module for communicating with the other daemons (e.g. to inject faults
based on a global state of the system),
3. a module for time-management (e.g. to allow time-based fault injection),
4. a module to instrument the tested application (by driving the debugger),
and
5. a module for managing events (to trigger faults).

FCI is thus a Debugger-based Fault Injector because the injection of faults and the
instrumentation of the tested application is made using a debugger. This makes it
possible not to have to modify the source code of the tested application, while
enabling the possibility of injecting arbitrary faults (modification of the program
counter or the local variables to simulate a buffer overflow attack, etc.). From the user
point of view, it is sufficient to specify a fault scenario written in FAIL to define an
experiment. The source code of the fault injection daemons is automatically
generated. These daemons communicate between them explicitly according to the
180

user-defined scenario. This allows the injection of faults based either on a global state
of the system or on more complex mechanisms involving several machines (e.g. a
cascading fault injection). In addition, the fully distributed architecture of the FCI
daemons makes it scalable, which is necessary in the context of emulating large-scale
distributed systems.


)$,/)&, 
'DHPRQ 


)$,/)&, 
'DHPRQ 


)$,/)&, 
'DHPRQ 


)$,/)&, 
'DHPRQ 


)$,/)&, 
'DHPRQ 

)$,/
&DQG3HUO
)DLO6FHQDULR )$,/FRPSLOHU &FRPSLOHU )&,
6RXUFHFRGH
'DHPRQ

Figure 1: the FCI Platform

FCI daemons have two operating modes: a random mode and a deterministic mode.
These two modes allow fault injection based on a probabilistic fault scenario (for the
first case) or based on a deterministic and reproducible fault scenario (for the second
case). Using a debugger to trigger faults also permits to limit the intrusion of the fault
injector during the experiment. Indeed, the debugger places breakpoints which
correspond to the user-defined fault scenario and then runs the tested application. As
181

long as no breakpoint is reached the application runs normally and the debugger
remains inactive.
Fail-FCI has been used to assess the dependability of XtremWeb [22] and some
results are being collected that allow us to assess the effectiveness of some fault-
tolerance techniques that can be applied to desktop grids.

4 DBGS: Dependability Benchmark for Grid Services

DBGS is a dependability benchmark for Grid Services that follow the OGSA
specification [23]. Since the OGSA model is based on SOAP technology we have
developed a benchmark tool for SOAP-based services. This benchmark includes the
four components, mentioned in section 1: (a) definition of a workload to the system
under test (SUT); (b) optional definition of a faultload to the SUT system; (c)
collection and definition of the benchmark measurements; (d) definition of the
benchmark procedures. The DGGS is composed by the following components
presented in Figure 2.

 
V WV
T XH
3 5H
6 2$

Figure 2: Experimental setup overview of the DBGS benchmark.

The system-under-test (SUT) consists of a SOAP server running some Grid or


Web-Service. From the point of view of the benchmark the SUT corresponds to an
application server, a SOAP router and a Grid service that will execute under some
workload, and optionally will be affected by some fault-load.
The Benchmark Management System (BMS) is a collection of software tools that
allows the automatic execution of the benchmark. It includes a module for the
definition of the benchmark, a set of procedures and rules, definition of the workload
182

that will be produced in the SUT, a module that collects all the benchmark results and
produces some results that are expressed as a set of dependability metrics. The BMS
system may activate a set of clients (running in separate machines) that inject the
defined workload in the SUT by making SOAP requests to the end Grid Service. All
the execution of these client machines is timely synchronized and all the partial
results collected by each individual client are merged into a global set of results that
generated the final assessment of the dependability metrics. The BMS system
includes a reporting tool that presents the final results in a readable and graphic
format.

The results generated by each benchmark run are expressed as throughput-over-


time (requests-per-second in a time axis), the total turnaround time of the execution,
the average latency, the functionality of the services, the occurrence of failures in the
Grid service/server, the characterization of those failures (crash, hang, zombie-
server), the correctness of the final results at the server side and the failure scenarios
that are observed at the client machines (explicit SOAP error messages or time-outs).

From the side of the SUT system, there are four modules that also make part of the
DBGS benchmark: a fault-load injector, a configuration manager, a collector of
benchmark results and a watchdog of the SUT system.

The fault-load injector does not inject faults directly in the software like the fault-
injection tools, previously mentioned in section 2. This injector only produces some
impact at the operating system level: it consumes resources from the operating system
like memory, threads, file-handles, database-connections, sockets. We have observed
that Grid and WS middleware is not robust enough because the underlying
middleware (e.g. Application server and the SOAP implementation) is very unreliable
when there are lack of operating system resources, like memory leakage, memory
exhaustion and over-creation of threads. These are the scenarios we want to generate
with this fault-load module. This means that software bugs are not directly emulated
by this module, but rather by a tool like FAIL-FCI.

The configuration manager helps in the definition of the configuration parameters


of the SUT middleware. It is absolutely that the configuration parameters may have a
considerable impact in the robustness of the SUT system. By changing those
parameters in different runs of the benchmark it allow us to assess the impact of those
parameters in the results expressed as dependability metrics.

Finally, the SUT system should also be installed with a module to collect raw data
from the benchmark execution. This data will be then sent to the BMS server that will
merge and compare with the data collected from the client machines. The final
module is a SUT-Watchdog that detects when a SUT system crashes or hangs when
the benchmark is executing. When a crash or hang is detected the watchdog generates
a restart of the SUT system and associated applications, thereby allowing an
automatic execution of the benchmark runs without user intervention.
183

We have been collecting a large set of experimental results with this tool. The
results are not presented here for lack of space, but in summary, we can say that this
benchmark tool allowed us to spot some of the software leaks that can be found in
current implementations of SOAP that are currently being used in Grid services and
those problems may completely undermine the dependability level of the Grid
applications.

5 Conclusions and Future Work

This paper presented a fault-injection tool for large-scale distributed systems that is
currently being used to measure the fault-tolerance capabilities included in
XtremWeb, and a second tool that can be directly used for dependability
benchmarking of Grid Services that follow the OGSA model, and are thereby
implemented by using SOAP technology. These two tools together fit quite well,
since their target is really complementary. We feel that these two groups of Core-
GRID will provide some valuable contribution in the area of dependability
benchmarking for Grid Computing, and our work in cooperation has a long avenue
ahead with several research challenges. At the end of the road we hope to have
contributed to increase the dependability of Grid middleware and applications by the
deployment of these tools to the community.

6 Acknowledgements

This research work is carried out in part under the FP6 Network of Excellence
CoreGRID funded by the European Commission (Contract IST-2002-004265).

References

1. P.Koopman, H.Madeira. “Dependability Benchmarking & Prediction: A Grand


Challenge Technology Problem”, Proc. 1st IEEE Int. Workshop on Real-Time
Mission-Critical Systems: Grand Challenge Problems; Phoenix, Arizona, USA, Nov
1999
2. M. Vieira and H. Madeira, “A Dependability Benchmark for OLTP Application
Environments”, Proc. 29th Int. Conf. on Very Large Data Bases (VLDB-03), Berlin,
Germany, 2003.
3. K. Buchacker and O. Tschaeche, “TPC Benchmark-c version 5.2 Dependability
Benchmark Extensions”, http://www.faumachine.org/papers/tpcc-depend.pdf, 2003.
4. A. Kalakech, K. Kanoun, Y. Crouzet and A. Arlat. “Benchmarking the Dependability
of Windows NT, 2000 and XP”, Proc. Int. Conf. on Dependable Systems and
Networks (DSN 2004), Florence, Italy, IEEE CS Press, 2004.
5. A. Brown, L. Chung, W. Kakes, C. Ling, D. A. Patterson, "Dependability
Benchmarking of Human-Assisted Recovery Processes", Dependable Computing and
Communications, DSN 2004, Florence, Italy, June, 2004
184

6. D. Wilson, B. Murphy and L. Spainhower. “Progress on Deining Standardized


Classes of Computing the Dependability of Computer Systems”, Proc. DSN 2002,
Workshop on Dependability Benchmarking, Washington, D.C., USA, 2002.
7. J. Zhu, J. Mauro, I. Pramanick. “R3 - A Framwork for Availability Benchmarking,”
Proc. Int. Conf. on Dependable Systems and Networks (DSN 2003), USA, 2003.
8. Ji J. Zhu, J. Mauro, and I. Pramanick, “Robustness Benchmarking for Hardware
Maintenance Events”, in Proc. Int. Conf. on Dependable Systems and Networks
(DSN 2003), pp. 115-122, San Francisco, CA, USA, IEEE CS Press, 2003.
9. J. Mauro, J. Zhu, I. Pramanick. “The System Recovery Benchmark,” in Proc. 2004
Pacific Rim Int. Symp. on Dependable Computing, Papeete, Polynesia, 2004.
10. S. Lightstone, J. Hellerstein, W. Tetzlaff, P. Janson, E. Lassettre, C. Norton, B.
Rajaraman and L. Spainhower. "Towards Benchmarking Autonomic Computing
Maturity", 1st IEEE Conf. on Industrial Automatics (INDIN-2003), Canada, August
2003.
11. J. Durães, M. Vieira and H. Madeira. "Dependability Benchmarking of Web-
Servers", Proc. 23rd International Conference, SAFECOMP 2004, Potsdam,
Germany, September 2004. Lecture Notes in Computer Science, Volume 3219/2004
12. S Ghosh, AP Mathur, "Issues in Testing Distributed Component-Based Systems", 1st
Int. ICSE Workshop on Testing Distributed Component-Based Systems, 1999
13. H. Madeira, M. Zenha Rela, F. Moreira, and J. G. Silva. “Rifle: A general purpose
pin-level fault injector”. In European Dependable Computing Conference, pages
199–216, 1994.
14. S. Han, K. Shin, and H. Rosenberg. “Doctor: An integrated software fault injection
environment for distributed real-time systems”, Proc. Computer Performance and
Dependability Symposium, Erlangen, Germany, 1995.
15. S. Dawson, F. Jahanian, and T. Mitton. Orchestra: A fault injection environment for
distributed systems. Proc. 26th International Symposium on Fault-Tolerant
Computing (FTCS), pages 404–414, Sendai, Japan, June 1996.
16. D.T. Stott and al. Nftape: a framework for assessing dependability in distributed
systems with lightweight fault injectors. In Proceedings of the IEEE International
Computer Performance and Dependability Symposium, pages 91–100, March 2000.
17. R. Chandra, R. M. Lefever, M. Cukier, and W. H. Sanders. Loki: A state-driven fault
injector for distributed systems. In In Proc. of the Int.Conf. on Dependable Systems
and Networks, June 2000.
18. X. Li, R. Martin, K. Nagaraja, T. Nguyen, B.Zhang. “Mendosus: A SAN-based Fault-
Injection Test-Bed for the Construction of Highly Network Services”, Proc. 1st
Workshop on Novel Use of System Area Networks (SAN-1), 2002
19. N. Looker, J.Xu. “Assessing the Dependability of OGSA Middleware by Fault-
Injection”, Proc. 22nd Int. Symposium on Reliable Distributed Systems, SRDS, 2003
20. http://www.lri.fr/~fci/GdX
21. S. Lumetta and D. Culler. “The Mantis parallel debugger”. In Proceedings of
SPDT’96: SIGMETRICS Symposium on Parallel and Distributed Tools, pages 118–
126, Philadelphia, Pennsylvania, May 1996.
22. G. Fedak, C. Germain, V. Néri, and F. Cappello. “XtremWeb: A generic global
computing system”. Proc. of IEEE Int. Symp. on Cluster Computing and the Grid,
2001.
23. I.Foster, C. Kesselman, J.M. Nick and S. Tuecke. “Grid Services for Distributed
System Integration”, IEEE Computer June 2002.
24. J. Kephart. “Research Challenges of Autonomic Computing”, Proc. ICSE05,
International Conference on Software Engineering, May 2005

You might also like