0% found this document useful (0 votes)
29 views6 pages

Automatic Data Structure Repair For Self-Healing Systems: Brian Demsky Martin Rinard

The document summarizes an approach for automatically detecting and repairing violations of data structure consistency properties. It involves specifying consistency properties, detecting violations, converting constraints to disjunctive normal form, and applying repairs. The approach is demonstrated on a simplified Linux file system, where consistency properties include bitmap consistency, reference count consistency, and free count correctness. The tool is able to automatically repair detected violations without external intervention.

Uploaded by

bob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Automatic Data Structure Repair For Self-Healing Systems: Brian Demsky Martin Rinard

The document summarizes an approach for automatically detecting and repairing violations of data structure consistency properties. It involves specifying consistency properties, detecting violations, converting constraints to disjunctive normal form, and applying repairs. The approach is demonstrated on a simplified Linux file system, where consistency properties include bitmap consistency, reference count consistency, and free count correctness. The tool is able to automatically repair detected violations without external intervention.

Uploaded by

bob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Automatic Data Structure Repair for Self-Healing Systems

Brian Demsky Martin Rinard


Laboratory for Computer Science Laboratory for Computer Science
Massachusetts Institute of Technology Massachusetts Institute of Technology
Cambridge, MA 02139 Cambridge, MA 02139

ABSTRACT Each specification contains model definition rules, a set of


We have developed a system that accepts a specification of internal consistency constraints, and a set of external con-
key data structure constraints, then dynamically detects and sistency constraints. The model definition rules identify the
repairs violations of these constraints, enabling the program different kinds of objects and relations in the abstract view
to recover from otherwise crippling errors to continue to ex- and define a translation from the concrete data structure to
ecute productively. We present our experience using our the abstract model. The internal consistency constraints
system to repair violated constraints in a simplified version capture the consistency properties of the data structure;
of the ext2 file system and in the CTAS air-traffic control these constraints are expressed at the level of abstract ob-
program. Our experience indicates that the specifications jects and relations in the model. The external consistency
are relatively straightforward to develop and that our tech- constraints capture the relationship between the model and
nique enables the applications to effectively recover from the concrete data structure; our tool uses the external con-
data structure corruption errors. sistency constraints to translate any repairs from the model
back into the concrete data structure. The repair algorithm
operates as follows:
1. INTRODUCTION
Any system that operates successfully for an extended pe- • Inconsistency Detection: It evaluates the constraints
riod of time inevitably sustains and must recover from some in the context of the current data structures to find
form of damage. Development errors make software systems consistency violations.
vulnerable to self-inflicted damage that may cause the sys-
tem to crash, corrupt key data structures, or otherwise ex- • Disjunctive Normal Form: It converts each vio-
ecute unacceptably. Data structure corruption can become lated constraint into disjunctive normal form; i.e., a
especially problematic for persistent data structures since disjunction of conjunctions of basic propositions. Each
the corruption persists across system reboots and, unless basic proposition has a repair action that will make the
repaired, can permanently impair the ability of the system proposition true. For the constraint to hold, all of the
to execute acceptably. basic propositions in at least one of the conjunctions
This paper proposes a new approach to recovering from must hold.
data structure corruption. We have developed a tool that ac-
cepts a specification of key data structure consistency prop- • Repair: The algorithm repeatedly selects a violated
erties [7]. It uses this specification to automatically detect constraint, chooses one of the conjunctions in that con-
and repair violations of these consistency properties, en- straint’s normal form, then applies repair actions to
abling the system to recover from the inconsistency and con- all of the basic propositions in that conjunction that
tinue to execute successfully within its designed operating are false. A repair cost heuristic biases the system
envelope. This technique promises to dramatically increase toward choosing the repairs that perturb the existing
the ability of software systems to automatically detect and data structures the least.
recover from data structure corruption errors without the
need for external operator intervention. We have developed a complete implementation of the data
Our tool supports several different usage scenarios. It structure repair tool. The implementation consists of ap-
can be used in stand-alone mode to repair persistent data proximately 13,000 lines of C++ code. The source code for
structures. It can also be used to repair the volatile data the tool and sample specifications are available at
structures of a running program, with the repair applied http://www.cag.lcs.mit.edu/∼bdemsky/repair.
either on program demand or to recover from an execution
error such as an addressing violation.
Our approach involves two data structure views: a con- 2. FILE SYSTEM CASE STUDY
crete view at the level of the bits in memory and an abstract We next discuss how our approach can be used to auto-
view at the level of relations between abstract objects. The matically detect and repair inconsistencies in a simplified
abstract view facilitates both the specification of the data version of the ext2 Linux file system. Like many Unix file
structure consistency properties and the reasoning required systems, this file system has a superblock, an inode for ev-
to repair any inconsistencies. ery file, and uses bitmaps to facilitate the allocation and
∗ deallocation of inodes and disk blocks.
This research was supported in part by a fellowship from
the Fannie and John Hertz Foundation, DARPA Contract
F33615-00-C-1692, NSF Grant CCR00-86154, and NSF
Grant CCR00-63513.
Disk disk;
2.1 Data Layout Declarations
The declarations in Figure 1 specify the layout of data for struct Disk {
selected portions of the file system. The structure definition Block b[disk.superblock.NumberofBlocks];
language is similar to C with extensions to support packed label b[0]: Superblock superblock;
label b[1]: Groupblock groupblock;
bit arrays, a form of structural inheritance, and variable }
sized arrays.
The Disk struct declaration specifies that the disk consists struct Block {
of an array of Block objects, with the superblock stored reserved byte[disk.superblock.BlockSize];
in the first disk block and the groupblock stored in the }
second disk block. The superblock defines the parameters
struct Superblock subtype of Block {
of the disk: the size and number of the disk blocks in the int FreeBlockCount;
file system, the number of inodes, and the inode for the root int FreeInodeCount;
directory. The groupblock contains references to the inode int NumberofBlocks;
table and to the inode and block bitmaps. There is a bit int NumberofInodes;
in the block bitmap for each block in the file system and a int RootDirectoryInode;
bit in the inode bitmap for each inode in the file system. If int BlockSize;
}
the block or inode is currently used, this bit is set to true.
Otherwise, it is set to false. struct Groupblock subtype of Block {
This file system has many consistency properties and can int BlockBitmapBlock;
become corrupted in many ways. We focus on the following int InodeBitmapBlock;
properties: int InodeTableBlock;
int GroupFreeBlockCount;
1. Presence of File System Structures: Basic file int GroupFreeInodeCount;
system structures should be present. }
Figure 1: Data Layout Declarations
2. Bitmap Consistency: The inode and block bitmaps
should be consistent with the use of the inode and set Blocks of integer: partition UsedBlocks | FreeBlocks
blocks on the disk. set UsedBlocks of integer: partition SuperBlock |
GroupBlock | FileDirectoryBlocks | InodeTableBlock |
3. Reference Count Consistency: An inode’s refer- InodeBitmapBlock | BlockBitmapBlock
set FileDirectoryBlock of integer: DirectoryBlocks |
ence count should be consistent with the number of
FileBlocks
directory entries referencing it. set Inodes of integer: partition UsedInodes | FreeInodes
set UsedInodes of integer: partition FileInodes |
4. Free Counts Correct: The counts for free blocks DirectoryInodes
and inodes should be consistent. set DirectoryInodes of integer: RootDirectoryInode
set DirectoryEntries of DirectoryEntry:
5. Block Usage Consistency: A given block should be relation blockstatus: Blocks -> string
used by at most one disk structure. relation contents: UsedInodes -> FileDirectoryBlocks
relation inodestatus: Inodes -> string
Notice that these constraints are stated at the level of relation referencecount: Inodes -> integer
abstract concepts such as blocks and inodes and not at the relation filesize: Inodes -> integer
level of bits on the disk. We believe that this is the natural relation inodeof: DirectoryEntries -> UsedInodes
way that developers think about such constraints and that Figure 2: Object and Relation Declarations
they would like to express their consistency properties at
this level of abstraction. The abstract data structure view block cannot be used simultaneously in two different disk
allows the developer to think about the data structures at structures. Similarly, we have chosen to represent Inodes in
this level. the abstract representation with their integer index in the in-
ode table and appropriately partitioned this set. Finally, the
2.2 Object Model DirectoryEntries set contains the set of directory entries
Figure 2 presents the object and relation declarations for in the disk. This set represents the used directory entries in
the abstract representation used in our example. This ab- the file system.
straction contains three main categories of objects: Blocks, The model uses relations to capture important properties
Inodes, and DirectoryEntries. The first declaration in of the objects in the model and to represent relationships
Figure 2 specifies that the abstract model uses integers to between the objects. So, the blockstatus relation captures
identify the Blocks (this simplifies the correspondence be- information in the block bitmap by mapping blocks to the
tween the abstract blocks in the model and the concrete set {Free, Used}. Similarly, the inodestatus relation maps
disk blocks) and that the set of Blocks is partitioned into Inodes to the set {Free, Used}, capturing the information
UsedBlocks and FreeBlocks. The set of UsedBlocks is fur- in the inode bitmap. The relation referencecount maps
ther partitioned into different sets corresponding to the uses Inodes to the corresponding reference count. The relation
of blocks in the file system. These sets are the SuperBlock filesize maps Inodes to their corresponding size in bytes.
set, the GroupBlock set, the FileDirectoryBlock set, the The contents and inodeof relations capture relationships
InodeTableBlock set, the InodeBitmapBlock set, and the between the objects in the model. Specifically, contents
BlockBitmapBlock set. The FileDirectoryBlocks set is maps UsedInodes to the FileDirectoryBlocks that con-
further partitioned into the FileBlocks set and the tain the contents of the file or directory, and inodeof maps
DirectoryBlocks set. The use of partitions ensures that a DirectoryEntries to their corresponding UsedInodes.
[], true => 0 in SuperBlock SuperBlock={0}
[], true => 1 in GroupBlock GroupBlock={1}
[], disk.groupblock.InodeTableBlock< FileDirectoryBlock={6,8}
disk.superblock.NumberofBlocks => BlockBitmapBlock={3}
disk.groupblock.InodeTableBlock in InodeTableBlock InodeTableBlock={4}
[], disk.groupblock.InodeBitmapBlock< InodeBitmapBlock={}
disk.superblock.NumberofBlocks => FileBlocks={6,...}
disk.groupblock.InodeBitmapBlock in InodeBitmapBlock DirectoryBlocks={8}
[for i in UsedInode, for itb in InodeTableBlock, FreeBlocks={2,5,7,...}
for j=0 to 11], cast(InodeTable,disk.b[itb]).itable[i]. FileInodes={0,1}
Blockptr[j]<disk.superblock.NumberofBlocks and RootDirectoryInode={2}
!cast(InodeTable,disk.b[itb]).itable[i].Blockptr[j]=0 FreeInodes={...}
=> cast(InodeTable,disk.b[itb]).itable[i].Blockptr[j] DirectoryEntries={D0 ,D1 ...}
in FileDirectoryBlocks blockstatus={h0, Usedi, h1, Usedi, h2, Freei, h3, Usedi,
[for j=0 to disk.superblock.NumberofBlocks-1], h4, Usedi, h5, Usedi, h6, Usedi, h7, Freei, h8, Usedi, ...}
!(j in UsedBlocks) => j in FreeBlocks contents={h0, 6i, h1, 6i, h2, 8i}
inodestatus={}
Figure 3: Model Definition Declarations referencecount={h0, 1i, h1, 1i, h2, 0i}
filesize={h0, 100i, h1, 200i, h2, 8192i}
[for u in UsedInodes], u.inodestatus=Used inodeof={hD0 , 0i, hD1 , 1i}
[for f in FreeInodes], f.inodestatus=Free
[for i in UsedInodes], i.referencecount= Figure 6: Abstract Representation for Corrupted
sizeof(inodeof.i) File System
[for i in UsedInodes], i.filesize<=
sizeof(i.contents)*8192
[],sizeof(RootDirectoryInode)=1 InodeBitmapBlock={2}
[for u in UsedBlocks], u.blockstatus=Used FreeBlocks={5,7,...}
[for f in FreeBlocks], f.blockstatus=Free blockstatus={h0, Usedi, h1, Usedi, h2, Usedi, h3, Usedi,
[for b in FileDirectoryBlocks],sizeof(contents.b)=1 h4, Usedi, h5, Freei, h6, Usedi, h7, Freei, h8, Usedi, ...}
[],sizeof(BlockBitmapBlock)=1 and sizeof(SuperBlock)=1 contents={h0, 6i, h2, 8i}
[],sizeof(InodeTableBlock)=1 and sizeof(GroupBlock)=1 inodestatus={h0, Usedi, h1, Usedi, h2, Usedi, ...}
[],sizeof(InodeBitmapBlock)=1 filesize={h0, 100i, h1, 0i, h2, 8192i}

Figure 4: Internal Consistency Constraints Figure 7: Changes to the Abstract Representation


of the Corrupted File System
2.3 Model Construction
Given the declarations of the objects and relations, we an Inode has. The fifth constraint ensures that the file sys-
are now in a position to present the model definition, which tem has a root directory. The next two constraints ensure
translates the concrete data structure into the abstract model. that the blockstatus relation is consistent with the use of
Figure 3 presents part of these model definition declarations, the Blocks. The next constraint ensures that a given block
which our tool uses to construct the abstract model. The is in at most one file or directory. And the remaining con-
intention is that the key high level consistency constraints straints ensure that various disk structures exist. As this ex-
will be enforced in the abstract model. Low level validity ample illustrates, the internal constraints typically capture
constraints (for example, that block references are in a valid the important structural properties of the data structures
range) are checked during the translation process so that and their relationships, and ensure that various parts of the
low level errors are not a concern at the model level. data structures are consistent with each other.
The first set of declarations specifies the sets of Blocks in
the file system. For example, the first two declarations set 2.5 Inconsistency Detection and Repair
up the SuperBlock set and the GroupBlock set. The omit- Figure 5 presents a diagram of an inconsistent file system
ted declarations define the various sets of Inodes in the file — the index of the block containing the inode bitmap is
system, the DirectoryEntries set, and the various relations corrupted and two inodes reference the same block.
in the abstract model. Our tool interprets these definitions The first step in the inconsistency detection and repair
to derive an algorithm that constructs the objects and rela- process is to use the layout declarations, the model decla-
tions in the model, setting the stage for the definition and rations, and the model definition rules to construct the ab-
enforcement of the consistency properties. stract model. For the corrupt file system shown in Figure 5
the tool would generate the sets and relations in Figure 6.
2.4 Internal Constraints The abstract representation shown in Figure 6 violates
Internal constraints capture the consistency properties that many of the constraints in Figure 4. For example, the empty
can be expressed using the model alone. We anticipate that InodeBitmapBlock set violates the last constraint in Fig-
these constraints will typically be used to capture the most ure 4. Furthermore, the fact that the relation blockstatus
important structural constraints. Figure 4 presents the set has the tuple h2, Usedi and the set FreeBlocks contains the
of internal constraints in our example. block 2 violates the seventh constraint shown in Figure 4
The first two constraints in Figure 4 ensure that the — the block is not used for anything but is marked Used.
inodestatus relation is consistent with the use of the Inodes. The fact that the contents relation contains two tuples that
The third constraint ensures that the referencecount func- reference block 6 violates the eighth constraint.
tion returns the number of times an Inode is referenced by The inconsistency detection algorithm evaluates the inter-
the inodeof relation. The fourth constraint ensures that the nal constraints over the model. In our example, this eval-
filesize function is consistent with the number of Blocks uation uncovers the consistency violations described above.
Super Group Block Inode Inode
Block Block Bitmap Table Bitmap

Corrupted
Value
Figure 5: Corrupted ext2 file system

For each such violation, it identifies the violated constraint For our example, these constraints translate the repairs
and the specific objects and relations that violate the con- made in the abstract representation (shown in Figure 7)
straint. For example, the tool discovers that (among other to the concrete data structures on the disk. The repaired
violations), block 6 is referenced by multiple inodes, indicat- version of the concrete data structure shown in Figure 5 is
ing a violation of the constraint that each block is in at most shown in Figure 9. Notice that our tool has regenerated the
one file or directory. When the tool discovers a violation, it inode bitmap. Furthermore, the illegal block sharing has
executes a sequence of actions to repair the violation. It first been removed, and the block bitmap is consistent with the
converts the constraint into disjunctive normal form, i.e., a use of blocks in the file system.
disjunction (ors) of conjunctions (ands) of basic propositions
and negated basic propositions. The basic propositions cap- [for u in InodeTableBlock], true =>
ture basic numerical requirements on the values involved in disk.groupblock.InodeTableBlock=u
relations (for example, a certain value must be less than a [for u in InodeBitmapBlock], true =>
disk.groupblock.InodeBitmapBlock=u
certain expression), constraints on the sizes of sets and rela- [for u in RootDirectoryInode], true =>
tions, and objects or pairs that must be included in specific disk.superblock.RootDirectoryInode=u
sets or relations. Each basic proposition comes with an ac- [for i in UsedInode, for itb in InodeTableBlock,
tion that is guaranteed to make the proposition true and an for j=0 to 11], j<sizeof(i.contents) => cast(
action that is guaranteed to make the proposition false. De- InodeTable, disk.b[itb]).itable[i].Blockptr[j]=
pending on the form of the proposition, these repair actions element j of i.contents
[for i in UsedInode, for itb in InodeTableBlock,
may calculate a value that ensures that the constraint does for j=0 to 11], !j<sizeof(i.contents) =>
or does not hold, or insert or remove objects or pairs from cast(InodeTable,disk.b[itb]).itable[i].Blockptr[j]=0
sets or relations.
To repair a violated constraint, the tool chooses one of Figure 8: External Consistency Constraints
the conjunctions in the normal form, then repairs all of the
violated basic propositions in the conjunction. At this point 2.7 Experience
the constraint is no longer violated, and the tool proceeds
on to the next violated constraint.1 We developed a fault insertion strategy designed to sim-
In our example, the tool discovers that the set ulate the effect of potential inconsistencies.2 Our fault in-
InodeBitmapBlock is empty, which violates the last con- sertion mechanism simulates the effect of a system crash: it
straint in Figure 4. In this case, there is only one con- shuts down the file system (potentially in the middle of an
junction in the disjunctive normal form of the constraint; operation that requires several disk writes), then discards
the tool must therefore insert an object into this set to the cached state. Our workload opens and writes several
satisfy the constraint. The repair action moves a block files, closes the files, then reopens the files to verify that the
from the FreeBlocks set to the InodeBitmapBlock set (be- data was written correctly. We crash the system part of the
cause the object and relation declarations in Figure 2 spec- way through writing the files, then rerun the workload. The
ify that FreeBlocks and UsedBlocks partition the set of second run overwrites the partially written files and checks
blocks, InodeBitmapBlock is a UsedBlock, the repair action that the final versions are correct.
must remove the new InodeBitmapBlock from the set of In all of our tested cases, the algorithm is able to repair the
UsedBlocks). The tool then moves on to repair the other file system and the workload correctly runs to completion.
violations, producing the repaired model in Figure 7. Without repair, files end up sharing inodes and disk blocks
and the file contents are incorrect. For a file system with
2.6 External Constraints 1024 8KB blocks, our repair tool takes 1.5 seconds on an
IBM ThinkPad X23 with a 866 Mhz Pentium III processor
External constraints constrain the relation between the and 384 MB of RAM running RedHat Linux 7.2 to construct
abstract model and the concrete data structures. Figure 8 the file system model, check the consistency of the model,
presents several of the external constraints for our exam- and repair the file system.
ple. The first four constraints ensure that any newly created
structures in the model are written back out to disk. The 2
Fault insertion was originally developed in the context of
next two constraints translate the model repairs of the software testing to help evaluate the coverage of testing pro-
InodeTableBlock back to the disk. cesses [17]. It has also been used by other researchers for the
purposes of evaluating standard failure recovery techniques
1
The reader may be concerned that the repair process may such as duplication, checkpointing, and fast reboot [2]. The
not terminate. We have implemented a specification analysis rationale behind fault insertion is that faults, while serious
algorithm that determines if the repair process will always when they do occur, occur infrequently enough to seriously
terminate for a given specification; the tool uses this algo- complicate the experimental investigation of failure recovery
rithm to reject any specifications that might generate repair techniques. Fault insertion makes it practical to evaluate
sequences that do not terminate. proposed recovery techniques on a range of faults.
Super Group Inode Block Inode
Block Block Bitmap Bitmap Table

Figure 9: Repaired ext2 file system

3. CTAS CASE STUDY acquires enough flight plans and radar data history to make
The Center-TRACON Automation System (CTAS) is a reasonable trajectory predictions. And for the particular
set of air-traffic control tools developed at the NASA Ames error we explored in our experiments, rebooting is futile.
research center [1, 16]. The system is designed to help air When the system reacquires and attempts to process the
traffic controllers visualize and manage the complex air traf- flight plan that caused the preceding failure, it will simply
fic flows at centers surrounding large metropolitan airports. fail again.
The goal is to automate much of the aircraft traffic man- CTAS illustrates that data structure repair can enable
agement, reducing traffic delays and increasing safety. The systems to recover from otherwise fatal data structure cor-
current source code consists of over 1 million lines of C and ruption errors and enable the program to continue to execute
C++ code. Versions of this source code are deployed at successfully. This property may be especially important for
centers surrounding major metropolitan airports. safety-critical applications in which potentially degraded ex-
Our fault insertion methodology attempts to mimic errors ecution is far preferable to no execution at all.
in the flight plan processing routine that produce illegal val- The CTAS system in particular illustrates some of the
ues in the flight plan data structures. When the program reasons why continued execution can be the best choice for
uses these illegal values to access the array of airport data, some applications. The absence of repair makes the entire
the array access is out of bounds, which typically leads to computation vulnerable to errors, even if the error would
the program failing due to an addressing error. Our speci- have no effect on the data and functionality of much of the
fication captures the constraint that the flight plan indices system. Repair enables the program to continue to execute
must be within the bounds of the airport data array. The and generate useful results from the correct parts of the
specification itself consists of 100 lines, of which 83 lines data and the unaffected parts of the computation. Note also
contain structure definitions. The primary difficulty in de- that repair followed by continued execution may eventually
veloping this specification was understanding the flight plan flush any anomalies out of the system to restore the data
data structures. structures to a completely correct state.
We used a recorded midday radar feed from the Dallas-
Ft. Worth center as a workload. We identified consistency 4. DEVELOPER CONTROL OF REPAIRS
points within the application, then configured the system to The repair algorithm often has multiple options to satisfy
catch addressing exceptions, perform the consistency checks a given constraint; these options may translate into different
and repair in the fault handler, then restart from the last repaired data structures. We recognize that some repair
consistency point. Each consistency check and repair takes actions may produce more desirable data structures than
approximately 3 milliseconds, which is an acceptable repair other repair actions, and that the developer may wish to
time in that it imposes no performance degradation that influence the repair process. We have therefore provided
is visible in the graphical user interface that displays the the developer with several mechanisms that he or she can
aircraft information. use to control how the repair algorithm chooses to repair an
Without repair, CTAS fails because of an addressing ex- inconsistent data structure. Specifically, the developer may
ception. With repair, it continues to execute in a largely specify the costs of given repair actions, provide a procedure
acceptable state. Specifically, the effect of the repair is to which decides which repair action to perform for a given
potentially change the origin or destination airport of the constraint violation, or supply a hand-coded repair routine
aircraft with the faulty flight plan processing. Even with for a given constraint.
this change, continued operation is clearly a better alterna-
tive than failing. First, one of the primary purposes of the
system (visualizing aircraft flow) is unaffected by the repair, 5. RELATED WORK
and continued execution enables the system to provide this Software reliability has been an important area for many
functionality to the controller even in the presence of flight years. Most research has focused on preventing or elimi-
plan processing errors. Second, only the origin or destina- nating software errors, with the approaches ranging from
tion airport of the plane whose flight plan triggered the error enhanced software testing and validation to full program
is affected. All other aircraft (during the recorded feed, the verification. Software error detection has become an espe-
system is processing flight plans for several hundred aircraft) cially active area in recent years [6, 10, 5]. In contrast, our
are processed with no errors at all, enabling the system to research goal is to enable software to survive errors by re-
deliver useful trajectory prediction and scheduling function- pairing damaged data structures.
ality for those aircraft. And finally, once the aircraft in
question leaves the center, its data structures are deallo- 5.1 Manual Detection and Repair Systems
cated from the system, which is then back to a completely Researchers have manually developed several systems that
correct state. find and repair data structure inconsistencies. File systems
The standard alternative to repair is to fail and reboot. have many characteristics that motivate the development of
This solution is problematic for this application because re- such programs (they are persistent, store important data,
booting the system can take several minutes as the system and acquire disabling inconsistencies in practice). Develop-
ers have responded with utilities such as Unix fsck and the therefore see our research as taking an important step to-
Norton Utilities that attempt to fix inconsistent file systems. ward the effective construction of robust, self-healing sys-
The Lucent 5ESS telephone switch and IBM MVS oper- tems that can successfully recover from the damage that
ating systems are two examples of critical systems that use they will inevitably experience during their long lifetimes.
inconsistency detection and repair to recover from software
failures [11, 13]. The software in both of these systems con- 7. REFERENCES
tains a set of manually coded procedures that periodically
[1] Center-tracon automation system.
inspect their data structures to find and repair inconsisten- http://www.ctas.arc.nasa.gov/ .
cies. The reported results indicate an order of magnitude [2] P. Broadwell, N. Sastry, and J. Traupman. FIG: A
increase in the reliability of the system [8]. Researchers prototype tool for online verification of recovery
have also developed a domain-specific language for speci- mechanisms. In Workshop on Self-Healing, Adaptive
fying these procedures for the 5ESS system [9]. The goal is and self-MANaged Systems, June 2002.
to enhance the reliability and reduce the development time [3] A. Brown and D. A. Patterson. Undo for operators:
of the inconsistency detection and repair software. Building an undoable e-mail store. In Proceedings of
the 2003 USENIX Annual Technical Conference, June
5.2 Recovery Oriented Computing 2003.
[4] G. Candea and A. Fox. Recursive restartability:
Researchers in the area of recovery oriented computing Turning the reboot sledgehammer into a scalpel. In
have developed a variety of techniques to help software re- Proceedings of the 8th Workshop on Hot Topics in
cover from runtime errors [14]. One of these techniques, Operating Systems (HotOS-VIII), pages 110–115,
recursive restartability, composes large systems out of many Schloss Elmau, Germany, May 2001.
smaller modules that are individually rebootable [4]. The [5] J.-D. Choi and et al. Efficient and precise datarace
goal is to build systems in which faults can be isolated at detection for multithreaded object-oriented programs.
In Proceedings of the SIGPLAN ’02 Conference on
the module level by rebooting. Program Language Design and Implementation, 2002.
In some cases, the consequences of an error may not be [6] J. Corbett, M. Dwyer, J. Hatcliff, C. Pasareanu,
immediately apparent and the system may run ahead, gen- Robby, S. Laubach, and H. Zheng. Bandera :
erating an unacceptable execution. In such cases, the ability Extracting finite-state models from java source code.
to undo an application’s operations to return to an earlier In Proceedings of the 22nd International Conference
state, repair the error in the earlier state, and then replay the on Software Engineering, 2000.
application’s operations would be useful. Operation Undo [7] B. Demsky and M. Rinard. Automatic detection and
repair of errors in data structures. Technical Report
provides an application-neutral framework for building sys- MIT-LCS-TR-875, MIT, Massachusetts Institute of
tems that support undo [3]. Technology, Dec. 2002.
[8] J. Gray and A. Reuter. Transaction Processing:
5.3 Specification Languages Concepts and Techniques. Morgan Kaufmann, 1993.
The basic concepts in our internal constraint language are [9] N. Gupta, L. Jagadeesan, E. Koutsofios, and D. Weiss.
similar to those in constraint languages for object model- Auditdraw: Generating audits the FAST way. In
ing formalisms such as UML [15] and Alloy [12]. Object Proceedings of the 3rd IEEE International Symposium
models have traditionally been used to help developers ex- on Requirements Engineering, 1997.
[10] S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system
plore conceptual design properties in the absence of any and language for building system-specific, static
specific implementation. Our approach, in contrast, estab- analyses. In Proceedings of the SIGPLAN ’02
lishes a precise connection between the low-level, highly en- Conference on Program Language Design and
coded data structures that appear in many programs and Implementation, 2002.
the high-level conceptual properties captured in our internal [11] G. Haugk, F. Lax, R. Royer, and J. Williams. The
constraint language. This kind of connection may become 5ESS(TM) switching system: Maintenance
especially important for future design conformance systems, capabilities. AT&T Technical Journal, 64(6 part
2):1385–1416, July-August 1985.
which check that a program conforms to its design.
[12] D. Jackson. Alloy: A lightweight object modelling
notation. Technical Report 797, Laboratory for
6. CONCLUSION Computer Science, Massachusetts Institute of
Technology, 2000.
Data structure inconsistencies are an important source of
[13] S. Mourad and D. Andrews. On the reliability of the
software errors. Our implemented system attacks this prob- IBM MVS/XA operating system. IEEE Transactions
lem by accepting a data structure consistency specification, on Software Engineering, September 1987.
then automatically detecting and repairing data structures [14] D. A. Patterson and et al. Recovery-oriented
that violate this specification. Our experience indicates that computing (ROC): Motivation, definition, techniques ,
our system is able to deliver repaired data structures that and case studies. Technical Report
enable the corresponding programs to continue to execute UCB//CSD-02-1175, UC Berkeley Computer Science,
successfully within their designed operating envelope. With- March 15, 2002.
[15] Rational Inc. The unified modeling language.
out repair, the programs usually fail. http://www.rational.com/uml.
As the field of computer science continues to mature, there [16] B. D. Sanford, K. Harwood, S. Nowlin, H. Bergeron,
is an increasing need to deliver systems that can contin- H. Heinrichs, G. Wells, and M. Hart. Center/tracon
uously operate for very long, even unbounded, periods of automation system: Development and evaluation in
time. Repair is a central aspect of almost all long-lived sys- the field. In 38th Annual Air Traffic Control
tems in other fields, and we believe that the development Association Conference Proceedings, October 1993.
of effective repair technology is a necessary prerequisite for [17] J. M. Voas and G. McGraw. Software Fault Injection.
the construction of robust, long-lived computer systems. We Wiley, 1998.

You might also like