Checkpointing strategies for a fixed-length execution
SC24-W: Workshops of the International Conference for High …, 2024•ieeexplore.ieee.org
This work considers checkpointing strategies for a parallel application executing on a large-
scale platform whose nodes are subject to failures. The application executes for a fixed
duration, namely the length of the reservation that it has been granted. We start with small
examples that show the difficulty of the problem: it turns out that the optimal checkpointing
strategy neither always uses periodic checkpoints nor always takes its last checkpoint
exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic …
scale platform whose nodes are subject to failures. The application executes for a fixed
duration, namely the length of the reservation that it has been granted. We start with small
examples that show the difficulty of the problem: it turns out that the optimal checkpointing
strategy neither always uses periodic checkpoints nor always takes its last checkpoint
exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic …
This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the difficulty of the problem: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we determine threshold times T n such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between T n and T n+1 . Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations.
ieeexplore.ieee.org
Showing the best result for this search. See all results