Drom Preprint
Drom Preprint
Resource Managers
Marco D’Amico Marta Garcia-Gasulla Víctor López
Barcelona Supercomputing Center Barcelona Supercomputing Center Barcelona Supercomputing Center
Barcelona, Spain Barcelona, Spain Barcelona, Spain
[email protected] [email protected] [email protected]
ABSTRACT to speed up their application even if some of the resources are left
In the design of future HPC systems, research in resource manage- underutilized. We claim that these two objectives must coexist and
ment is showing an increasing interest in a more dynamic control that cooperation between the different stack layers is the way to
of the available resources. It has been proven that enabling the reach this goal.
jobs to change the number of computing resources at run time, i.e. We propose to provide resource managers with more tools that
their malleability, can significantly improve HPC system perfor- will give them a dynamic control of resources allocated to the
mance. However, job schedulers and applications typically do not application and a particular feedback about the utilization of these
support malleability due to the common belief that it introduces resources. In this paper we will extend the DLB [17] [18] library
additional programming complexity and performance impact. This with a new API designed to be used by the resource managers. This
paper presents DROM, an interface that provides efficient malleabil- new API will offer a transversal layer in the HPC software stack to
ity with no effort for program developers. The running application coordinate the resource manager and the parallel runtime. We call
is enabled to adapt the number of threads to the number of as- this API Dynamic Resource Ownership Management (DROM). DROM
signed computing resources in a completely transparent way to the has been implemented as a part of DLB distribution and integrated
user through the integration of DROM with standard programming with well know programming models, i.e. MPI [31], OpenMP [33]
models, such as OpenMP/OmpSs, and MPI. We designed the APIs and OmpSs[12] and with the SLURM [8] node manager.
to be easily used by any programming model, application and job By integrating DROM with above programming models, the
scheduler or resource manager. Our experimental results from two API will work transparently to the application, and thus, to de-
realistic use cases analysis, based on malleability by reducing the velopers. By integrating the API with SLURM, we enable efficient
number of cores a job is using per node and jobs co-allocation, co-scheduling and co-allocation of jobs. This means that jobs are
show the potential of DROM for improving the performance of scheduled to share compute nodes by dynamically partitioning
HPC systems. In particular, the workload of two MPI+OpenMP in an effective way the available resources, improving hardware
neuro-simulators are tested, reporting improvement in system met- utilization and job’s response time.
rics, such as total run time and average response time, up to 8% and This paper presents the following contributions:
48%, respectively. • Definition of DROM, an API that allows cooperation between
any job manager and any programming model.
CCS CONCEPTS • Integration of DROM with SLURM node manager for effec-
• Software and its engineering → Software libraries and repos- tive resources distribution in the case of co-allocation.
itories; • Integration of DROM with MPI, OpenMP and OmpSs pro-
gramming models.
• Evaluation of DROM with real use cases and applications mo-
1 INTRODUCTION
tivated based on needs in the Human Brain Project (HBP) [35].
In High Performance Computing (HPC) systems the software stack
consists of different layers, from parallel runtime to the workload The rest of the paper is organized as follows: Section 2 presents
manager, each one being responsible for a specific task. Application the related work, Section 3 describes the DROM API, Sections 4
developers focusing on the individual performance of their applica- and 5 present the DROM integration with programming models
tions use different programming models. This approach is a must to and SLURM, Section 6 shows the experiments done to validate the
hide low-level architectural details from the application developers integration and demonstrate the potential of this proposal, and
and users and extract the maximum performance of new systems. finally Section 7 presents the conclusions and future work.
On the other hand, the objective of the workload manager is to
maximize the efficient utilization of the computing resources. How- 2 RELATED WORK
ever, improving the system efficiency is, typically, not well accepted Malleable Parallel Task Scheduling (MPTS) problem has been ex-
by users and application developers, since their only objective is plored for many years. The theoretical research shows its potential
benefits [27] [13] [32]. These works mainly pick the number of for MPI processes, but in this case, there was no integration with
resources that best improves the performance of the parallel task the programming model. This approach is equivalent to oversub-
based on a model of its performance given at schedule time. Feitel- scription of resources, i.e., more than one process running in the
son [16] classify a malleable job as a job that can adapt to changes same core, which in general has a negative impact on the applica-
in the number of processors at run time. Deciding on resizing a job tions’ performance, as demonstrated in [26]. In our integration, we
at run time is not an easy task for a scheduler, and it is still not used OpenMP/OmpSs programming models to adapt the number
fully supported by any standard programming model. However, of threads to the change in the number of computing resources.
job scheduling simulations [21] showed the potential benefits of OpenMP and OmpSs use threads instead of processes, easier to cre-
malleability concerning response time. ate and destroy, more efficient, lighter than MPI processes. At the
Several studies propose malleability based on MPI [31], that same time, we support hybrid MPI+OpenMP/OmpSs applications,
allows, in different ways, to spawn new MPI processes at run time or which allow the expansion of DROM capabilities to multi-node
use moldability and folding techniques [36]. These approaches are environments.
limited by the inherent program data partition between processes.
Data partition and redistribution is application dependent, so it 3 DROM: DYNAMIC RESOURCE OWNERSHIP
needs to be done by application’s developers. Furthermore, data MANAGEMENT
transfer among nodes has a high impact on performance, making
DROM is a new module included in the DLB library; it offers a new
malleability very costly, especially if using checkpoint and restart
API to change the computing resources assigned to a process at run
techniques. To limit the amount of extra code, for the users to
time. This module provides a communication channel between an
have malleable applications the structure of MPI application is
administrator process and other processes to adjust the number of
usually constrained, iterative applications using split/merge of MPI
threads accordingly.
processes are used in [28], master/slave applications are needed
In this section we will explain the structure of the DLB library
in [11]. Martin et al. [29] try to automatize data redistribution, but
briefly to understand how DROM is integrated into it, we will
only for vectors and matrices.
present the proposed DROM API, and we will detail how we have
Recent work includes an effort on Charm++ [19] programming
integrated it with SLURM.
model to support malleability. Charm++ allows malleability for ap-
plications by implementing fine-grained threads encapsulated into
Charm++ objects. This solution is not transparent to developers, 3.1 DLB Framework
i.e., they need to rewrite their applications using this programming DLB is a dynamic library that aims at improving the performance
model. Adaptive MPI [20] tries to solve this issue by virtualizing of individual applications, and at the same time to maximize the
MPI processes into Charm++ objects, partially supporting MPI stan- utilization of the computational resources within a node.
dard. Charm++ lacks a set of API that would allow communicating The DLB Framework is transversal to the different layers of the
with the job scheduler, because malleability features were studied HPC software stack, from the job scheduler to the operating system.
for load balancing purpose. There was an effort to implement a The interaction with the different layers is always done through
Charm++ to Torque [15] communication protocol to enable mal- standard mechanisms such as PMPI [2], or OMPT [3] explained in
leability, but they are not comparable with DROM because DROM more detail in Section 4. Thus, as a general rule, applications do not
gives generalized APIs that can serve to communicate with any job need to be modified or recompiled to be run with DLB as long as
scheduler or programming model. they use a supported programming model (MPI + OpenMP/OmpSs).
Castain et al. in [9] presented an extensive set of APIs, part of Simply by pre-loading the library, these standard mechanisms can
PMIx project, including job’s expanding and shrinking features. It be used to intercept the calls to the programming models and modify
is an interesting attempt to create standardized APIs that can be the number of required resources as needed.
used by applications to request more resources to the job scheduler. DLB was initially designed for the Lend When Idle (LeWI) module.
However, the main difference is that they are designed for evolving This module acts as a dynamic load balancer for a single application
applications, different from malleable, because changes in resources that suffers from processes’ load imbalance by adjusting the number
is demanded by the application itself, not the resource manager. of threads per process when needed. However, our claim is that in
Despite this tendency in the research, users still do not have the HPC systems there is also a necessity to dynamically balance
simple and efficient tools, neither the support from job schedulers the load among multiple jobs’ processes that are executed within
in production HPC machines that would allow them to exploit the same reservation. In this way, the system can benefit from
malleability. We propose DROM, an API that enables malleability increased utilization, which would not be the case when asking for
of applications inside computing nodes, with a negligible overhead separate job submissions. For this reason, we propose the DROM
for developers and applications. We integrated DROM APIs with API and offer an implementation of it within the DLB library.
OpenMP [33] and OmpSs [12] programming models and SLURM [8] In Figure 1 we can see the DROM module within the DLB Frame-
job scheduler. However, DROM is independent of them, and it can be work. DROM provides an API for external entities, such as a job
integrated with any other programming models or job schedulers. scheduler, a resource manager, or a user, to re-assign the resources
DROM manages computing resources by using CPUSETs, light- used by any application attached to DLB. Then, the DROM module
weight structures used at the operating system level, easy and fast running on each process will react and modify the computing re-
to use and manipulate. A similar approach was presented by [14], sources allocated for the application. This procedure depends on
based on dynamically changing the operating system CPUSETs the programming model, but in essence, it implies two steps. First,
2
Job alternative is to exploit the features that the programming
scheduler Resource User
manager model already provides. For instance, a private array where
its scope is limited to the parallel construct or a reduction
clause where the programming model manages the auxiliary
DROM API memory are both two solutions that solve this issue and keep
Application
the application malleable.
Prog.Model (MPI)
• Hardware is finite. During the job co-allocation, DLB may
Prog. Model (OpenMP) DROM LeWI reduce the number of active threads of other processes and
Operating System Sh. Mem. may rearrange the pinning of each thread to a new CPU,
Hardware but it will not reduce the amount of allocated memory of
any application. Therefore, the total memory capacity and
Figure 1: DLB Framework bandwidth will be shared among applications.
App 1 – process 2
its data is statically partitioned according to the maximum number
Serial
App 1 – process 3
of computational resources during initialization, as explained in
App 1 – process 4 Simulation
4 MPI x 4 threads Section 3, when applying malleability to shrink NEST, the tasks
App 1 – process 1
Analytics
2 MPI x 2 threads
not computed by the removed thread are computed by some of the
remaining resources, creating imbalance, as shown in Figure 5. This
16 cores
App 1 – process 2
DROM
App 1 – process 3
is a limitation of the application and not an overhead introduced
App 2 – process 1
App 1 – process 4
by DROM. In fact, increasing the number of stolen computing
App 2 – process 2
resources, like in the case of Pils Conf. 3, the number of excess tasks
(a) (b) (c) (d) (d) (e)
increases, and they are better distributed among the remaining
Time
resources. In this situation, we improve total run time up to 2.5%
Figure 3: In situ analytics example. with respect to Pils Conf. 1, while for Conf. 2 it can reach -2.6%. A
fully malleable NEST version that doesn’t partition data according
In Figure 3 we can see a graphical representation of this use case. to initial number of threads would improve this result.
The horizontal axis represents time while the vertical axis compu-
tational resources, i.e., number of cores. At time (a) the simulation
is launched and started in the available resources. At time (b) the
analytics is submitted.
We compare two scenarios, the Serial one considers that the
analytics must wait for the simulation to finish before it can start
at point (d) because no resources are available. Thus the simulation
runs using all the available cores and only when it finishes the
analytic is executed. The second scenario is using DROM to start
analytics immediately at time (b), reducing the number of resources Figure 5: Trace showing simulator’s threads on Y-axis. When
assigned to the simulation. Once the analytics finishes at point (c) thread 16 is removed, its data is computed by first 4 threads,
the simulation gets its resources back. while the others report lower utilization (white idle spaces).
We evaluated the following two-applications workloads. We
will use the notation simulation application+analytics application Figure 6 shows single application’s response time. Pils’s response
when naming the workloads, i.e. NEST + Pils, NEST + STREAM, time, painted in lines pattern, decreases up to 96% due to waiting
CoreNeuron + Pils, CoreNeuron + STREAM. For this use case, we time reduced to zero, while its run time is approximately the same.
evaluate and analyze the total run time, average response time, The cost for this reduction is a small increase in NEST’s response
individual application’s response time, followed by a discussion on time, varying from 0% to 4.2%. A 0% increase is due to increasing
the differences between Serial and DROM scenarios and the various IPC when running on a reduced number of threads.
configurations. In Figure 7 we can see workload’s total run time and each appli-
Figure 4 shows the difference in the workload’s total run time cation’s response time for NEST and STREAM use case. In this case
when running the two versions of NEST in one node with Pils. Y- we remove 2 CPUs from the simulation to run a memory intensive
axis represents total run time in seconds, while on X-axis we have application, gaining in average 1.84% (up to 3.5%) in terms of to-
the different configurations of the applications involved. For both tal run time. STREAM’s response time decreases up to 92% while
7
3000 NEST - Serial NEST - DROM 3500 Serial
Pils - Serial Pils - DROM
2500 3000 DROM
2500
2000
2000
1500
1500
1000
1000
500 500
0 0
Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3 Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3
NEST Conf. 1 Conf. 2 Neuron Conf. 1 Conf. 2
Figure 6: Individual response time of NEST and Pils in the Figure 9: Execution time of CoreNeuron + Pils workload.
NEST + Pils workload. 3000 NEST - Serial NEST - DROM
3000 Pils - Serial Pils - DROM
Serial 2500
3000 Neuron - Serial
DROM
2000 1500
1500
1500 1000
1000 1000 500
500 500 0
0 0 Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3
STREAM Conf. 1 Conf. 1 STREAM Conf. 1 Conf. 1 Neuron Conf. 1 Conf. 2
NEST Conf. 1 Conf. 2 NEST Conf. 1 Conf. 2
Stream - Serial
DROM Stream - DROM
2500
Total run time (s)
2500
NEST’s increases up to 6.7% in the worst case. Total run time is 2500
DROM
workloads, as also CoreNeuron presents the same data partition Conf. 1 Conf. 2 Conf. 3 Conf. 1
problem. Figure 9 shows improved run time when comparing with Pils Stream