Cai Nat
Cai Nat
https://doi.org/10.1007/s11227-019-02857-y
Abstract
Until now, jobs running on HPC clusters were tied to the node where their execu-
tion started. We have removed that limitation by integrating a user-level checkpoint/
restart library into a resource manager, fully transparent to both the user and running
application. This opens the door to a whole new set of tools and scheduling pos-
sibilities based on the fact that jobs can be migrated, checkpointed, and restarted
on a different place or in a different moment, while providing fault tolerance for
every job running on the cluster. This is of utmost importance in the future gen-
eration of exascale HPC clusters, where an increasing degree and complexities of
efficient scheduling make it challenging to obtain the required degree of parallelism
demanded by the applications.
1 Introduction
* José A. Moríñigo
[email protected]
1
Department of Technology, CIEMAT, Avda. Complutense 40, 28840 Madrid, Spain
2
Department of Electrical and Computer Engineering, Northeastern University, 360 Huntington
Avenue, Boston, MA 02115, USA
13
Vol.:(0123456789)
M. Rodríguez‑Pascual et al.
13
Job migration in HPC clusters by means of checkpoint/restart
13
M. Rodríguez‑Pascual et al.
The question next arises if more potential outcomes can be achieved beyond sim-
ple resilience, once that the capability of saving the state of a job and restoring is
available. In particular, if the job scheduler is aware of this possibility, it can be
employed in the decision-making process, thus allowing one to modify the alloca-
tion of jobs in real time. For example, jobs can be moved inside the cluster to con-
centrate them in the minimum possible number of nodes, or jobs can be distributed
evenly within a partially filled cluster. Other uses come from preemption, removing
a running job from its resource to allocate another job with a higher priority.
The same approach can be applied to system administration: Any maintenance
operation on a node can be performed immediately, by checkpointing all jobs run-
ning on that node and placing them back in the job queue. The users do not need to
be notified, since the checkpoint and later restart will be completely transparent to
them. Since there is no need to shut down the whole service, the impact of the pro-
cess is greatly reduced.
All of the previous can result in a better computing and/or energy efficiency in the
cluster and represents a new concept of novelty in this work as paves the way for add-
ing more customized artificial intelligence capabilities in the RM depending on the
cluster characteristics and general use. Thus, in order to support this new approach,
the C/R library must satisfy certain minimum requirements in order to provide an
optimum scenario according to the analysis carried out as part of this work:
– First, it must be able to save the state of all or most applications running on the
cluster.
– Second, it must be transparent at the application level. The end user should not
have to recompile or modify the application in any way, and the checkpoint and
restart must be completely transparent to that end user.
– And third, of course, the overhead induced by the checkpoint library should be
kept to a minimum.
13
Job migration in HPC clusters by means of checkpoint/restart
the main reasons for selecting Slurm is that the architecture of this RM provides a
plug-in capability for integrating new functionality. This is used here to add a check-
point–restart plug-in to Slurm, thus enabling new scheduling algorithms.
The architecture used here is able to manage: nodes (the compute resource in
Slurm); partitions (grouping nodes into logical sets; jobs, or allocations of resources
assigned to a user for a specific time slot); and job steps (sets of, possibly parallel,
tasks within a job). Thus, several parameters can be defined in these partitions (job
time limit, job size limit, user access permissions, etc.). These parameters enable
the second objective of this work: the use of C/R to provide new features, including
more accurate scheduling algorithms, dynamic migration of tasks, preemption, and
transparent maintenance operations to the user.
In the rest of this document, we first describe the broad framework enabling this
research: a full, transparent integration between Slurm and DMTCP. This is used to
explore novel features on top of this integration:
2 Related work
As mentioned earlier, joint opportunistic user scheduling and power allocation are
topics of major importance in modern HPC systems [11]. Among other aspects, it
can be designed in order to achieve throughput optimization and fair resource shar-
ing [12]. Beyond those, there is a link between fault tolerance and improvement in
cluster throughput that this work is tackling, i.e., enhance the computational effi-
ciency in a cluster profiting from checkpointing methodologies in order to design
better scheduling algorithms, perform job preemption techniques, make administra-
tion maintenance operations transparently to the user, etc.
A description follows of the related work regarding checkpointing/restart libraries
as well as live job migration.
13
M. Rodríguez‑Pascual et al.
With the increasing use of parallelism in HPC, checkpoint libraries are becoming
increasingly more valuable.
Checkpointing can be accomplished either at the system level (transparently to
the application), or at the application level (integrated into the application). While
the first type is easier to apply for the end user, the latter is typically more efficient
[13].
In system-level checkpointing, the state of a computation is saved by an external
entity. The complete process information has to be stored, including memory con-
tents, open files, CPU, content of registers, and so on. On restart, the state is care-
fully restored, so that the execution can seamlessly continue at the same point where
it was interrupted.
System-level checkpoint solutions can either be implemented inside the kernel or
at the user level. The former has the advantage that the checkpointer has full access
to the target process as well as its resources, while user-level checkpointers have to
find other ways to gather this information. On the flip side, user-level schemes are
typically more portable and easier to deploy.
Regarding current projects, BLCR [14] is primarily a kernel-based implementa-
tion of checkpoint/restart, employing a Linux kernel module. It is fast and efficient.
However, the kernel module must be re-compiled and possibly re-tuned for each
particular Linux kernel version. Another weakness of BLCR is that it does not sup-
port the SysV enhancements, such as System V shared memory. Many MPI imple-
mentations employ System V shared memory as an optimization for message pass-
ing among MPI ranks on the same node. While most MPI implementations can be
configured to avoid making use of System V shared memory, this is non-optimal.
At the same time, coordinated checkpointing and rollback recovery for MPI-based
parallel applications has been provided by integrating BLCR with LAM and other
implementations of MPI through a checkpoint–restart service specific to each MPI
implementation [15].
The updated DMTCP version [16] is a strictly user-space, system-level check-
point. It makes use of a simple, yet powerful idea in order to be able to capture
all state related to the running application: A DMTCP library is injected into each
running process, and that library starts a DMTCP-specific checkpoint thread. With
this configuration, DMTCP can monitor all activities in the process. As a drawback,
this approach adds a thin software layer. In most applications, the overhead due to
this software layer is usually smaller than the jitter, or variation in time, when the
application is re-run. However, a notable exception is its support for InfiniBand; the
overhead can be measured at a fraction of 1%. This non-negligible overhead will be
determined later in this work.
CRIU [17] is a promising new project, but it is still not able to checkpoint parallel
or distributed applications. It does, however, provide the possibility of checkpoint-
ing Docker containers along with some other interesting features [18]. A similar
conclusion can be made for the compiler-based lightweight memory checkpoint-
ing (LMC) [19], which has demonstrated low performance overhead with strictly
bounded memory usage at runtime as demonstrated on server applications.
13
Job migration in HPC clusters by means of checkpoint/restart
13
M. Rodríguez‑Pascual et al.
running MPI jobs in Slurm [29]. After introducing some small modifications
in the code to indicate where the execution can be arbitrarily distributed, their
framework gathers all the available resources and increases the degree of parallel-
ism of the application at runtime. This approach, however, requires modifications
of the source code suitable only for certain kind of applications, thus preventing a
wider adoption.
Also, the inclusion of resilience capabilities into the MPI standard through the
proposed user-level failure mitigation (ULFM) has enabled the implementation of
resilient MPI applications [30]. Its low overhead when tolerating failures in one or
several MPI processes has been shown [31]. This solution is built on top of Com-
Piler for Portable Checkpoiting (CPPC), an application-level checkpointing tool for
MPI applications. Thus, the proposed development transparently makes MPI appli-
cations resilient by instrumenting the original application code; this does require a
previous customization of the code to be checkpointed.
Regarding full jobs, the HTCondor scheduling system [32] applies checkpoint/restart
to complete jobs (although only serial ones) to achieve a better utilization of compute
clusters for high-throughput computing. Since most of the serial jobs are components
of larger Monte Carlo simulations, achieving fault tolerance is less important.
There also exists a formal approach that can be applied to multi- and many-core
chips. Several algorithms and mechanisms have been proposed [33, 34], although
their impact is beyond the scope of this work.
Similarly, the most widely used VM (virtual machine) managers (Xen, OpenVZ,
KVM, VirtualBox, etc.) support live migration of full VMs. This possibility is regu-
larly employed by platforms such as OpenNebula [35] and OpenStack [36] to rear-
range running VMs inside the clusters, usually with the objective of concentrating
the VMs in the fewest possible number of resources to reduce energy consumption.
In the case of software containers, this technique has still not been widely adopted,
and to the authors’ knowledge, only Docker + CRIU offers this possibility [37].
Of course, the technologies employed vary depending on the software layer, but
the underlying ideas are always roughly the same.
From the existing work presented in this section, it can be inferred that the check-
pointing of jobs in local clusters is an ongoing work that has already provided useful
results and production-ready tools. Live migration of different elements (tasks, vir-
tual machines, etc.) is a mature technology, suitable for many situations. However,
there is still a lack of a real and effective connection between these two areas in HPC
environments. Thus, fault tolerance manager (FTM) for coordinated checkpoint files
is able to provide users automatic recovery from failures when losing computing
nodes [38], though it is particularly useful in infrastructure-as-a-service cloud plat-
forms environments, and is based on the RADIC architecture. Finally, task migra-
tion for fault tolerance in heterogeneous multi-cluster systems exists, but was devel-
oped primarily for grid computing infrastructures [39].
Motivated by the issues described above, the present work presents a transpar-
ent, novelty approach to live migration of tasks based on jobs checkpointing and
restart, which incurs in low overhead. Under this approach, more accurate sched-
uling, dynamic migration of tasks, flexible preemption and maintenance operations
are possible, hence opening a new set of capabilities to be seamlessly performed
13
Job migration in HPC clusters by means of checkpoint/restart
by system administrators. Thus, such a design presents not only a new successfully
tested engineering solution, but rather a new and novel approach and concept to bet-
ter exploit supercomputers not previously foreseen to the authors’ knowledge.
3.1 Software stack
For this work, we have chosen Slurm as the resource manager and DMTCP as the
checkpoint library.
Slurm is one of the most widely employed resource managers for supercom-
puters. In particular, it is the workload manager on about 60% of the TOP500
supercomputers according to their main developer, SchedMD. It is an open-
source tool, and so we are able to dig into its internals and modify them accord-
ing to our needs. Further, due to Slurm’s modular and plug-in-based design, these
modifications of the internals are kept to a minimum.
DMTCP is chosen as the checkpoint library. As discussed in “Related Work”
section, it is, to the authors’ knowledge, the only checkpoint library that pro-
vides the needed requirements for this project: open source; fully transparent to
the users’ and applications; support for parallel applications (MPI and OpenMP);
stable; and continuing active support by its development team. In addition,
petascale-level checkpointing has been demonstrated by DMTCP through a new
mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode
and for updating the remote address on each UD-based send. Results have dem-
onstrated low overhead in tests with real applications and benchmarks running on
more than thirty thousand MPI processes and CPU cores [16]. An extrapolation
of those results to future SSD-based storage systems shows that this approach
will remain practical in the exascale generation.
Note, also, that this project is not tied to DMTCP. The modular design of
Slurm allows one to change the checkpoint library without affecting the rest of
the tool, since the library is modularized through a well-defined API. Thus, a dif-
ferent, future library could be adopted by simply modifying a configuration file.
The integration of Slurm and DMTCP was performed using an existing checkpoint
API present in Slurm. This was implemented by creating a shell wrapper for each
of the three checkpoint functions (start, checkpoint and restart). There were some
13
M. Rodríguez‑Pascual et al.
challenges due to concurrency issues when starting MPI jobs, but these were solved
through a lock mechanism on files in a shared folder.
It is important to note that this plug-in has a behavior opposite to the original
Slurm design for checkpointing. Originally, Slurm was designed to start a job with
checkpoint support only if requested by the user through a command-line flag dur-
ing submission. Since we want to support all running jobs in order to more broadly
support job migration, we changed this default behavior to start the application with
checkpoint support providing that the user does not disable checkpointing during
submission through a command-line flag.
Aside from the plug-in, the changes made in Slurm were kept to a minimum. The
structure, APIs, and existing functionality were maintained for compatibility with
the official Slurm version. The modifications only add optional extensions to the
existing API calls.
The following section is devoted to describe the tools and functionality enabled by the
availability of job migration inside clusters. We first present smigrate, a tool for cluster
system administration that employs job migration to idle nodes in a fast and secure way,
and, then, two different scheduling algorithms using job migration to dynamically real-
locate running tasks. Note that the objective of this section is not the creation of complex
tools and algorithms, but to demonstrate how the implemented job migration inside clus-
ters can enable a new and wide set of additional tools for cluster administration.
For the sake of completeness, it is worth noticing that optimization of task assign-
ments on parallel computers is carried out in Slurm by a best fit algorithm based on
Hilbert curve scheduling or a fat tree network topology [40].
The nodes composing a cluster need to be maintained (i.e., removed from active ser-
vice on the cluster to perform operations related to system administration) on a regu-
lar basis. Reasons include, but are not limited to, software updates, reconfigurations,
network issues, changes on the hardware, etc. In normal usage, users are notified of
these updates in advance, since part or all of the cluster will be out of service.
This maintenance process greatly harms the computation throughput. Although
the update itself may be short, users are warned in advance not to submit long jobs
and the cluster remains not fully occupied immediately after the update, since users
have not yet submitted new jobs. Moreover, if the maintenance is urgent and there is
no time to notify users in advance, then the jobs running on those nodes are simply
killed and the corresponding computation time is lost. This is especially harmful in
the case of parallel applications.
As an alternative, job migration can be used to avoid the loss of computation
time. If a particular slot, node or part of a cluster must receive maintenance, then the
tasks running there can be migrated. With this goal in mind, we have created smi-
grate, an application that allows administrators and users to manually migrate jobs.
13
Job migration in HPC clusters by means of checkpoint/restart
After the desired tasks have been performed on the node and it is ready to go back
into production, it remains to update its status to “AVAILABLE.” Slurm will then
place the node into the resource queue again and submit the corresponding restart jobs.
Last, it is important to note that this process is not limited to one node at a time.
Instead, an arbitrary set of nodes can be emptied of jobs at the same time.
The smigrate tool is implemented in C language as Slurm is and can be down-
loaded from [41], where the reader can find the whole bunch of programs (source,
object, header, makefile, etc.), a wiki page with instructions about how to install and
execute it, and even a demo video.
The preemption policy establishes that a running job may be canceled so that
another one with higher priority can use their resources.
Without the possibility of checkpoint/restart, the usage of preemption mechanisms
was extremely limited as it implies losing all the computational effort invested in
the job being canceled. The ability to use checkpoint/restart represents here a game
changer, as these low-priority jobs can be re-queued and their execution continued just
as the higher-priority job completes, or when the first resource becomes available.
Preemption can now be used for a wider set of purposes, as the drawbacks that
kept the mechanism from a wider adoption have disappeared with the use of check-
point/restart. In particular, we have identified and tested three different use cases.
A basic usage for preemption is the support for queues with different priorities. In
this use case, Slurm is configured so that if there are pending jobs on high-priority
queues, the ones running on low-priority queues are preempted and re-queued. In
this way, urgent jobs can begin their execution as soon as possible, with the only
13
M. Rodríguez‑Pascual et al.
cost being to delay the execution of low-priority jobs. Of course, users could abuse
this system by submitting all their jobs with high priority, but the Slurm quota sys-
tem makes it straightforward to avoid this situation.
A similar situation happens in clusters where resources are limited to a specific
set of nodes, such as Xeon Phi accelerators being present only in some nodes. Cur-
rently there are two alternatives: leave these nodes with special resources idle for
jobs requiring those resources; or use them for any job, with the ones with special
requirements having to wait. By using preemption, we can execute any kind of job
on these nodes and obtain full usage from the cluster, moving these jobs away when
there is a specific job requiring the resource.
Perhaps the most interesting use case, only enabled by the use of checkpoint/
restart-based preemption, is the creation of Eternal Jobs. The underlying idea is that
some users have a computational demand that exceeds the available resources, espe-
cially in the shared environments typical of clusters. A solution is the creation of a
low-priority queue of serial jobs that are preempted whenever a more important job
arrives. There is no need to set a particular length for these jobs, hence the qualifica-
tion of “eternal”: They simply run whenever the cluster is not fully occupied, and
they are preempted when new “normal” jobs arrive. In this way, the cluster increases
its usage and the most demanding users can be assigned more CPU time, at the sole
cost of yielding to other jobs with a higher priority.
Together, these use cases demonstrate a new approach to the cluster administra-
tion with more flexible tools, through the use of checkpoint/restart.
Traditional scheduling algorithms determine where and when a job should run. This
is decided taking into account several factors: job information provided by the user;
state of the cluster; and future demand based on the pending job queue. After a deci-
sion is taken and the job starts its execution, the process has finished.
Live job migration adds another dimension to the scheduling process: the pos-
sibility of altering the execution of a job by saving its state, canceling it, and then
restoring it on a different physical location and/or in a different moment. This way,
the scheduler can adapt to either changes in the infrastructure or on demand.
Although creating sophisticated scheduling algorithms is out of the scope of this
work, we consider of high interest to demonstrate these new capabilities. For this
sake, we have designed two algorithms with different behaviors.
The first algorithm is devoted to job compaction. Its objective is concentrating the
jobs running on the cluster in as few nodes as possible, leaving the rest of the infra-
structure idle. This can be performed for several reasons, like to make the infrastruc-
ture available for parallel tasks or to reduce power consumption by switching down
or reducing the voltage of empty nodes.
Algorithm 1 describes this process from a high-level point of view.
13
Job migration in HPC clusters by means of checkpoint/restart
The behavior of this algorithm is straightforward: If there are queued jobs in the cluster,
we can safely assume that Slurm will automatically try to place them filling partially filled
13
M. Rodríguez‑Pascual et al.
nodes; if not, the algorithms sees if the compaction makes sense and, if so, checkpoints the
jobs from the node to empty and restarts them in the rest of the partially filled ones.
Algorithm 2: Scheduling with priorities
13
Job migration in HPC clusters by means of checkpoint/restart
A similar concept can be employed to migrate jobs among resources with differ-
ent priorities. This tool can be very useful on facilities where certain nodes present
different characteristics than others.
The first and most obvious example is a heterogeneous cluster where certain
nodes are more powerful or have less power consumption than others. In this case,
it makes sense to use these nodes as much as possible, while leaving the less power-
ful or more consuming ones idle. In particular, application performance over KNC/
KNL of non-ported jobs will not be optimum.
Another straightforward use is when certain resources are scarce, as depicted in
“preemption and eternal jobs” use case, but with a proactive approach: as soon as it
is possible, empty the scarce resources so they are available when needed.
This approach can also be handy in an island of clusters. In periods where the
island is not fully used, jobs can be concentrated and a full island emptied and shut
down. Also, by concentrating smaller jobs in a reduced number of islands, larger
parallel jobs can be executed on a single island and not split among several ones,
thus reducing communication overhead.
4 Experimental results
4.1 DMTCP overhead
There are several factors that have an influence on the time employed for a check-
point/restart operation. Among them, some of the most important are the size of
the application in terms of memory and temporary files, the storage speed, and the
Slurm configuration. This section is devoted to measuring this overhead, so its influ-
ence in the total execution time of an application could be determined.
The Slurm resource manager is designed to control the execution of thousands
of jobs at the same time. To do so, it includes a sophisticated cache mechanism
to store the job state, thus greatly reducing the database overhead. This infor-
mation is stored and processed at a fixed rate. Until this happens, applications
13
M. Rodríguez‑Pascual et al.
4.2 smigrate
13
Job migration in HPC clusters by means of checkpoint/restart
Comparing that value with that presented in Table 1, we can see that our
approach obtains performance gains of up to several orders of magnitude and in
every single case is significantly faster.
Once C/R is implemented and fully integrated into Slurm, job preemption is com-
pletely straightforward. The same happens with the Eternal Job paradigm, which
can be considered a natural consequence of preemption.
PartitionName=eternalJobs
Nodes=<node list>
PriorityTier=1
PreemptMode=CHECKPOINT (...)
PartitionName=normal
Nodes = <node list>
PriorityTier=2 (...)
13
M. Rodríguez‑Pascual et al.
13
Job migration in HPC clusters by means of checkpoint/restart
Fig. 1 Randomly generated workloads for this work representing 25%, 50%, 75%, and 90% of the system
resources
In a more general way, Fig. 4 presents the values of repeating this experiment
with different workloads. As can be seen, in all cases the migration has been use-
ful to compact the running jobs in a smaller number of nodes. The percentage of
nodes with a mixed behavior (partially occupied) has been reduced from 33.8 to
13
M. Rodríguez‑Pascual et al.
Fig. 3 Node occupancy with and without migration for job compaction with a total workload of 50%.
The last couple of columns in the right of the figure depicts the total result of the test, being the rest the
results per node
28.2% when the workload represents 25% of the cluster, from 53.5 to 41.7% for the
50% case, from 52.8 to 42% for the 75% case, and from 30.5 to 22.3% for the 90%
case. In other words, the difference in the percentage of partially used nodes without
and with migration has two behaviors: When the cluster is either empty or almost
fully load (25% and 90% of workload, respectively), the algorithm for the migration
13
Job migration in HPC clusters by means of checkpoint/restart
of tasks for compaction improves the cluster occupation in 5.6% (25%) and 8.2%
(90%). On their side, when the workload is in a mid-range, this improvement is bet-
ter: 11.8% for the 50% workload case and 10.8% for the 75% one.
The same methodology has been followed when testing the algorithm “scheduling
with priorities.” In this case, the objective was to have the nodes in which the “high-
Priority” queue was configured to be as full as possible, moving the jobs there from
the “lowPriority.” To do so, we divided the cluster into two partitions with different
priorities and created random jobs that could run on any of them. Slurm would then
try to run every incoming job on the high-priority queue if there were free resources
and on the low-priority one if not.
This scenario could be of great interest in heterogeneous clusters where there are
nodes containing different processors, i.e., there are nodes which present a higher
computing power than others. By defining an algorithm that will take into account
the peak performance of these different zones of the whole cluster, for example,
tasks could be migrated to/from the “highPriority” queue in order to obtain a better
usage of the supercomputer. This concept could be extended to the clusters com-
posed of islands.
The first thing to take into account is the pre-fixed conditions for this test. In this
case, we already defined these two boundary cases:
– The percentage of nodes fully empty was reduced to a minimum in the nodes
where the “highPriority” queue was configured, just around 1%. Thus, there were
not many free cores to which migrate jobs.
– Only two out of the eight nodes were configured with the “highPriority” queue.
– The test was carried out with the highest workloads.
Doing so, we set up experimental conditions against the migration of tasks, i.e.,
drawbacks that were limiting the possibility of migrating tasks. The aim of this test
definition is to demonstrate that if we are able to improve the cluster efficiency in
such an unfavorable scenario, it can be derived that the possibility of migrating tasks
will be useful in general conditions.
Table 2 shows a comparison of the Slurm default scheduling algorithm with a
migration-based one. As can be seen, migration can be a helpful tool for this kind of
environments.
Thus, the percentage of free nodes remain constant in those nodes where the
“highPriority” queue was configured (1.05% and 1.25%), which indicates that the
test has been properly carried out according to the first boundary previously men-
tioned. Second, with the 75% load, there is an increase in the time in which the
nodes are fully used in the “highPriority”-related nodes along the test, moving from
25.58 to 31.64%. As the percentage of nodes with a mixed usage is decreased, the
cluster is more efficiently used and complies with the purpose of the test. Third,
when the cluster is almost fully used, the percentages are roughly the same (when
13
M. Rodríguez‑Pascual et al.
Table 2 Results of the test in which two queues with different priorities were configured
Without migration With migration
Low priority High priority Low priority High priority
75% load
%empty 20.33 1.06 19.24 1.05
%mixed 43.49 73.36 48.33 67.31
%full 36.18 25.58 32.43 31.64
90% load
%empty 7.72 1.25 7.72 1.25
%mixed 31.65 57.56 36.85 58.02
%full 60.63 41.19 55.43 40.73
The percentage of the nodes of the cluster without and with the migration of tasks via checkpointing/
restart is shown
even a small decrease in the test with migration of tasks), a result that is consistent
with the fact that there is almost nothing to migrate to.
Again, it is worth mentioning that the scheduling algorithms used for the experi-
ments comprise two different scenarios, to demonstrate the feasibility of the live
task migration, i.e., counting on more sophisticated algorithms in which artificial
intelligence and/or stochastic processes would be applied would be conducive to
even better results.
There is a clear need for checkpoint/restart in current and future HPC exascale sys-
tems. Platforms with millions of cores and thousands of nodes are expected to suf-
fer of more errors that should be overcome in an efficient way. Mean time between
failures will be reduced and codes will make use of a great number of resources. So
checkpoint/restart is a must.
In this work, we have presented the design, implementation, and further inte-
gration between a user-level checkpoint/restart library and a resource manager. By
making this integration transparent and automatically available for all jobs, a whole
new set of possibilities have been enabled. In this way, not only is fault tolerance
enhanced, but also the scheduling mechanisms, dynamic migration of tasks, job
preemption, easier cluster administration, and so on, can be seamlessly performed
from now on. Results on a real cluster are provided to demonstrate this.
This demonstrates possibilities for a more efficient and flexible use of supercom-
puters in which these new algorithms can be defined to improve the efficiency of
the platform, reduce their energy consumption, and determine a trade-off between
efficiency and reduced energy.
Related to the checkpoint library, DMTCP is nowadays only able to save the state
of standard CPUs, though there is a working version for Nvidia CUDA that has pro-
vided their first successful results [43]. Such a fact will deeply enhance the impact
13
Job migration in HPC clusters by means of checkpoint/restart
of this work. With respect to Xeon Phi, it seems that it is going to be decommis-
sioned, but from the C/R operational point of view, it is like a standard CPU as KNL
places the accelerator directly on the motherboard.
Regarding the migration of software containers inside HPC clusters, checkpoint-
ing of Docker is not supported by DMTCP at this time. Until this support is added,
jobs employing such technologies are just marked as not-checkpointable and not
considered by the migration policies and tools. The integration of these accelerators
and/or containers as resources supported by DMTCP will, in the future, enhance the
impact of the work presented here. Finally, it should be reiterated that the solution
showed in this work is not tied to DMTCP and another checkpoint library with the
right properties could be integrated into the tool described here, while continuing to
support the same functionality.
Most importantly, future scheduling algorithms can benefit from job migration.
But there is still scarce literature on sophisticated scheduling algorithms that have
been tested at scale and in practice. In part, this is because of the previous lack of a
robust mechanism for checkpoint-based job migration in common usage. The sched-
uling algorithms presented in this work serve as a test case and a demonstration of
our approach, but are still limited in their functionality and performance. This opens
up the future possibility of the use of artificial intelligence both for pure schedul-
ing design and for new resilience strategies in which actual workload traces will be
analyzed instead of random generated workloads in order to provide tailored and
customized analysis to specific supercomputers. Approaches using simulation [44]
and forecasting methodologies can also be considered here [45].
Acknowledgements This work was partially funded by the Spanish State Research Agency projects
CODEC2 (TIN2015-63562-R) and CODEC-OSE (RTI2018-096006-B-I00) with FEDER funds and
the EU H2020 Project Enerxico (Grant Agreement No 828947) and supported by the RICAP Network
(517RT0529) with CYTED funds.
References
1. Flich J et al. (2017) MANGO: exploring manycore architectures for next-generation HPC systems.
In: Kubatova H, Novotny M, Skavhaug A (eds) Euromicro Conferences on Digital System Design
(DSD), pp 478–485
2. Wyngaard J, Inggs M, Collins J, Farrimond B (2013) Towards a many-core architecture for HPC.
In: Cardoso JMP, Morrow K, Diniz PC (eds) 23rd International Conference on Field Programmable
Logic and Applications (FPL2013)
3. European technology platform for high performance computing (2017). www.etp4hpc.eu, Strategic
Research Agenda
4. Bailey C, Parry J (2017) Co-design, modelling and simulation challenges: from components to sys-
tems. In: Proceedings 23rd International Workshop on Thermal Investigations of ICs and Systems
(THERMINIC), pp 1–4
5. Hill MD, Marty MR (2017) Retrospective on Amdahl’s law in the multicore Era. Computer
50(6):12–14
6. Martineau M, McIntosh-Smith S (2017) The arch project: physics mini-apps for algorithmic explo-
ration and evaluating programming environments on HPC architectures. In: Proceedings IEEE
International Conference on Cluster Computing (CLUSTER2017), pp 850–857
7. Aupy G et al (2016) Co-scheduling algorithms for high-throughput workload execution. J Sched
19(6):627–640
13
M. Rodríguez‑Pascual et al.
8. Rajan M, Doerfler D (2010) HPC application performance and scaling: understanding trends and
future challenges with application benchmarks on past, present and future tri-lab computing sys-
tems. In Psihoyios G, Tsitouras C (eds) Numerical Analysis and Applied Mathematics, vol I–III
(AIP Conference Proceedings 1281), pp 1777–1780
9. Yoo AB, Jette MA, Grondona M (2003) SLURM: simple linux utility for resource management. In:
Feitelson D, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing
(JSSPP 2003), vol 2862. Lecture Notes in Computer Science. Springer, Berlin
10. Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and
the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, Rome, pp 1–12
11. Tao J, Kolodziej J, Ranjan R, Jayaraman PP, Buyya R (2015) A note on new trends in data-aware
scheduling and resource provisioning in modern HPC system. Future Gener Comput Syst 51:45–46
12. Ge X, Jin H, Leung VCM (2018) Joint opportunistic user scheduling and power allocation: through-
put optimisation and fair resource sharing. IET Commun 12(5):634–640
13. Padua D (2011) Encyclopedia of parallel computing. Springer, New York
14. Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. J Phys
Conf Ser 46:494–499
15. Sankaran S et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpoint-
ing. Int J High Perform Comput Appl 19(4):479–493
16. Cao J, Arya K, Garg R, Matott S, Panda DK, Subramoni H, Vienne J, Cooperman G (2016) System-
level scalable checkpoint-restart for petascale computing. In: Proceedings IEEE 22nd International
Conference on Parallel and Distributed Systems (ICPADS), pp 932–941
17. https://criu.org/Main_Page
18. Li W, Kanso A, Gherbi A (2015) Leveraging linux containers to achieve high availability for cloud
services. In: Proceedings IEEE International Conference on Cloud Engineering, Tempe, AZ, pp 76–83
19. Vogt D, Giuffrida C, Bos H, Tanenbaum AS (2015) Lightweight memory checkpointing. In: Pro-
ceedings 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks,
Rio de Janeiro, pp 474–484
20. Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An application-level incremental check-
pointing mechanism with automatic parameter tuning. In: 5th International Symposium on Comput-
ing and Networking (CANDAR), Aomori, pp 389–394
21. Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental check-
pointing for extreme-scale computing. Future Gener Comput Syst 30:66–77
22. Moody A, Bronevetsky G, Mohror K, Supinski BR (2010) Design, modeling, and evaluation of a
scalable multi-level checkpointing system. In: Proceedings ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis, pp 1–11
23. Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI:
high performance fault tolerance interface for hybrid systems. In: Proceedings International Confer-
ence for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, pp 1–12
24. Tiemeyer MP, Wong JSK (1998) A task migration algorithm for heterogeneous distributed comput-
ing systems. J Syst Softw 41(3):175–188
25. Tsakalozos K, Verroios V, Roussopoulos M, Delis A (2017) Live VM migration under time-con-
straints in share-nothing IaaS-clouds. IEEE Trans Parallel Distrib Syst 28(8):2285–2298
26. Jaswal T, Kaur K (2016) An enhanced hybrid approach for reducing downtime, cost and power con-
sumption of live VM migration. In: Proceedings International Conference on Advances in Informa-
tion Communication Technology & Computing, vol 72
27. Bargi A, Sarbazi-Azad H (2011) Task migration in three-dimensional meshes. J Supercomput
56(3):328–352
28. Kale LV, Krishnan S (1993) CHARM ++: a portable concurrent object oriented system based on
C ++. In: Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems,
Languages, and Applications, pp 91–108
29. Iserte S, Mayo R, Quintana-Ortí SE, Beltran V, Peña JA (2017) Efficient scalable computing
through flexible applications and adaptive workloads. In: 46th International Conference on Parallel
Processing Workshops (ICPPW), pp 180–189
30. Losada N, Martín MJ, González P (2017) J Supercomput 73:316–329
31. Losada N, Cores I, Martín MJ et al (2017) J Supercomput 73:100
32. http://research.cs.wisc.edu/htcondor/
33. Afsharpour S, Patologhy A, Fazeli M (2016) Performance/energy aware task migration algorithm
for many-core chips. Comput Digit Tech 10:165–173
13
Job migration in HPC clusters by means of checkpoint/restart
34. Holmbacka S et al (2014) A task migration mechanism for distributed many-core operating systems.
J Supercomput 68(3):1141–1162
35. Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in pri-
vate and hybrid clouds. IEEE Internet Comput 13(5):14–22
36. Sefraoui O, Aissaoui M, Eleuldj M (2012) OpenStack: toward an open-source solution for cloud
computing. Int J Comput Appl 55(3):38–42
37. Boucher R (2016) Cloning running services with docker and CRIU. In: Docker Conference
38. Villamayor J, Rexachs D, Luque E (2017) A fault tolerance manager with distributed coordinated
checkpoints for automatic recovery. In: International Conference on High Performance Computing
& Simulation (HPCS), Genoa, pp 452–459
39. Cabello U, Rodriguez J, Meneses A, Mendoza S, Decouchant D (2014) Fault tolerance in heteroge-
neous multi-cluster systems through a task migration mechanism. In: Proceedings 11th International
Conference on Electrical Engineering, Computing Science and Automatic Control
40. Pascual JA, Navaridas J, Miguel-Alonso J (2009) Effects of topology-aware allocation policies on
scheduling performance. Lect Notes Comput Sci 5798:138–144
41. http://rdgroups.ciemat.es/web/sci-track/intranet
42. Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge
University Press, Cambridge
43. Garg R, Mohan A, Sullivan M, Cooperman G (2018) CRUM: checkpoint-restart support for
CUDA’s unified memory. In: IEEE International Conference on Cluster Computing (CLUSTER), pp
302–313
44. Levy S, Topp B, Ferreira KB, Widener P, Arnold D, Hoefler T (2014) Using simulation to evaluate
the performance of resilience strategies and process failures, SANDIA report, SAND2014-0688
45. Fernández-Anta A et al (2018) Competitive analysis of fundamental scheduling algorithms on
a fault-prone machine and the impact of resource augmentation. Future Gener Comput Syst
78:245–256
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
13