Application checkpointing
Checkpointing is a technique to add fault tolerance into computing systems. It basically consists of saving a snapshot of the application's state, so that it can restart from that point in case of failure. This is particularly important for long running application that are executed in vulnerable computing system.
Checkpointing in distributed systems
In distributed computing, checkpointing is a technique that helps tolerate failures that otherwise would force long-running application to restart from the beginning. The most basic way to implement checkpointing, is to stop the application, copy all the required data from the memory to reliable storage (e.g., Parallel file system) and then continue with the execution. Checkpointing implementations should preserve system consistency. There are two main approaches for checkpointing in such systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).