0% found this document useful (0 votes)
44 views10 pages

Drom Preprint

Uploaded by

Tjerk Kostelijk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views10 pages

Drom Preprint

Uploaded by

Tjerk Kostelijk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DROM: Enabling Efficient and Effortless Malleability for

Resource Managers
Marco D’Amico Marta Garcia-Gasulla Víctor López
Barcelona Supercomputing Center Barcelona Supercomputing Center Barcelona Supercomputing Center
Barcelona, Spain Barcelona, Spain Barcelona, Spain
[email protected] [email protected] [email protected]

Ana Jokanovic Raül Sirvent Julita Corbalan


Barcelona Supercomputing Center Barcelona Supercomputing Center Universitat Politecnica de Catalunya
Barcelona, Spain Barcelona, Spain Barcelona, Spain
[email protected] [email protected] [email protected]

ABSTRACT to speed up their application even if some of the resources are left
In the design of future HPC systems, research in resource manage- underutilized. We claim that these two objectives must coexist and
ment is showing an increasing interest in a more dynamic control that cooperation between the different stack layers is the way to
of the available resources. It has been proven that enabling the reach this goal.
jobs to change the number of computing resources at run time, i.e. We propose to provide resource managers with more tools that
their malleability, can significantly improve HPC system perfor- will give them a dynamic control of resources allocated to the
mance. However, job schedulers and applications typically do not application and a particular feedback about the utilization of these
support malleability due to the common belief that it introduces resources. In this paper we will extend the DLB [17] [18] library
additional programming complexity and performance impact. This with a new API designed to be used by the resource managers. This
paper presents DROM, an interface that provides efficient malleabil- new API will offer a transversal layer in the HPC software stack to
ity with no effort for program developers. The running application coordinate the resource manager and the parallel runtime. We call
is enabled to adapt the number of threads to the number of as- this API Dynamic Resource Ownership Management (DROM). DROM
signed computing resources in a completely transparent way to the has been implemented as a part of DLB distribution and integrated
user through the integration of DROM with standard programming with well know programming models, i.e. MPI [31], OpenMP [33]
models, such as OpenMP/OmpSs, and MPI. We designed the APIs and OmpSs[12] and with the SLURM [8] node manager.
to be easily used by any programming model, application and job By integrating DROM with above programming models, the
scheduler or resource manager. Our experimental results from two API will work transparently to the application, and thus, to de-
realistic use cases analysis, based on malleability by reducing the velopers. By integrating the API with SLURM, we enable efficient
number of cores a job is using per node and jobs co-allocation, co-scheduling and co-allocation of jobs. This means that jobs are
show the potential of DROM for improving the performance of scheduled to share compute nodes by dynamically partitioning
HPC systems. In particular, the workload of two MPI+OpenMP in an effective way the available resources, improving hardware
neuro-simulators are tested, reporting improvement in system met- utilization and job’s response time.
rics, such as total run time and average response time, up to 8% and This paper presents the following contributions:
48%, respectively. • Definition of DROM, an API that allows cooperation between
any job manager and any programming model.
CCS CONCEPTS • Integration of DROM with SLURM node manager for effec-
• Software and its engineering → Software libraries and repos- tive resources distribution in the case of co-allocation.
itories; • Integration of DROM with MPI, OpenMP and OmpSs pro-
gramming models.
• Evaluation of DROM with real use cases and applications mo-
1 INTRODUCTION
tivated based on needs in the Human Brain Project (HBP) [35].
In High Performance Computing (HPC) systems the software stack
consists of different layers, from parallel runtime to the workload The rest of the paper is organized as follows: Section 2 presents
manager, each one being responsible for a specific task. Application the related work, Section 3 describes the DROM API, Sections 4
developers focusing on the individual performance of their applica- and 5 present the DROM integration with programming models
tions use different programming models. This approach is a must to and SLURM, Section 6 shows the experiments done to validate the
hide low-level architectural details from the application developers integration and demonstrate the potential of this proposal, and
and users and extract the maximum performance of new systems. finally Section 7 presents the conclusions and future work.
On the other hand, the objective of the workload manager is to
maximize the efficient utilization of the computing resources. How- 2 RELATED WORK
ever, improving the system efficiency is, typically, not well accepted Malleable Parallel Task Scheduling (MPTS) problem has been ex-
by users and application developers, since their only objective is plored for many years. The theoretical research shows its potential
benefits [27] [13] [32]. These works mainly pick the number of for MPI processes, but in this case, there was no integration with
resources that best improves the performance of the parallel task the programming model. This approach is equivalent to oversub-
based on a model of its performance given at schedule time. Feitel- scription of resources, i.e., more than one process running in the
son [16] classify a malleable job as a job that can adapt to changes same core, which in general has a negative impact on the applica-
in the number of processors at run time. Deciding on resizing a job tions’ performance, as demonstrated in [26]. In our integration, we
at run time is not an easy task for a scheduler, and it is still not used OpenMP/OmpSs programming models to adapt the number
fully supported by any standard programming model. However, of threads to the change in the number of computing resources.
job scheduling simulations [21] showed the potential benefits of OpenMP and OmpSs use threads instead of processes, easier to cre-
malleability concerning response time. ate and destroy, more efficient, lighter than MPI processes. At the
Several studies propose malleability based on MPI [31], that same time, we support hybrid MPI+OpenMP/OmpSs applications,
allows, in different ways, to spawn new MPI processes at run time or which allow the expansion of DROM capabilities to multi-node
use moldability and folding techniques [36]. These approaches are environments.
limited by the inherent program data partition between processes.
Data partition and redistribution is application dependent, so it 3 DROM: DYNAMIC RESOURCE OWNERSHIP
needs to be done by application’s developers. Furthermore, data MANAGEMENT
transfer among nodes has a high impact on performance, making
DROM is a new module included in the DLB library; it offers a new
malleability very costly, especially if using checkpoint and restart
API to change the computing resources assigned to a process at run
techniques. To limit the amount of extra code, for the users to
time. This module provides a communication channel between an
have malleable applications the structure of MPI application is
administrator process and other processes to adjust the number of
usually constrained, iterative applications using split/merge of MPI
threads accordingly.
processes are used in [28], master/slave applications are needed
In this section we will explain the structure of the DLB library
in [11]. Martin et al. [29] try to automatize data redistribution, but
briefly to understand how DROM is integrated into it, we will
only for vectors and matrices.
present the proposed DROM API, and we will detail how we have
Recent work includes an effort on Charm++ [19] programming
integrated it with SLURM.
model to support malleability. Charm++ allows malleability for ap-
plications by implementing fine-grained threads encapsulated into
Charm++ objects. This solution is not transparent to developers, 3.1 DLB Framework
i.e., they need to rewrite their applications using this programming DLB is a dynamic library that aims at improving the performance
model. Adaptive MPI [20] tries to solve this issue by virtualizing of individual applications, and at the same time to maximize the
MPI processes into Charm++ objects, partially supporting MPI stan- utilization of the computational resources within a node.
dard. Charm++ lacks a set of API that would allow communicating The DLB Framework is transversal to the different layers of the
with the job scheduler, because malleability features were studied HPC software stack, from the job scheduler to the operating system.
for load balancing purpose. There was an effort to implement a The interaction with the different layers is always done through
Charm++ to Torque [15] communication protocol to enable mal- standard mechanisms such as PMPI [2], or OMPT [3] explained in
leability, but they are not comparable with DROM because DROM more detail in Section 4. Thus, as a general rule, applications do not
gives generalized APIs that can serve to communicate with any job need to be modified or recompiled to be run with DLB as long as
scheduler or programming model. they use a supported programming model (MPI + OpenMP/OmpSs).
Castain et al. in [9] presented an extensive set of APIs, part of Simply by pre-loading the library, these standard mechanisms can
PMIx project, including job’s expanding and shrinking features. It be used to intercept the calls to the programming models and modify
is an interesting attempt to create standardized APIs that can be the number of required resources as needed.
used by applications to request more resources to the job scheduler. DLB was initially designed for the Lend When Idle (LeWI) module.
However, the main difference is that they are designed for evolving This module acts as a dynamic load balancer for a single application
applications, different from malleable, because changes in resources that suffers from processes’ load imbalance by adjusting the number
is demanded by the application itself, not the resource manager. of threads per process when needed. However, our claim is that in
Despite this tendency in the research, users still do not have the HPC systems there is also a necessity to dynamically balance
simple and efficient tools, neither the support from job schedulers the load among multiple jobs’ processes that are executed within
in production HPC machines that would allow them to exploit the same reservation. In this way, the system can benefit from
malleability. We propose DROM, an API that enables malleability increased utilization, which would not be the case when asking for
of applications inside computing nodes, with a negligible overhead separate job submissions. For this reason, we propose the DROM
for developers and applications. We integrated DROM APIs with API and offer an implementation of it within the DLB library.
OpenMP [33] and OmpSs [12] programming models and SLURM [8] In Figure 1 we can see the DROM module within the DLB Frame-
job scheduler. However, DROM is independent of them, and it can be work. DROM provides an API for external entities, such as a job
integrated with any other programming models or job schedulers. scheduler, a resource manager, or a user, to re-assign the resources
DROM manages computing resources by using CPUSETs, light- used by any application attached to DLB. Then, the DROM module
weight structures used at the operating system level, easy and fast running on each process will react and modify the computing re-
to use and manipulate. A similar approach was presented by [14], sources allocated for the application. This procedure depends on
based on dynamically changing the operating system CPUSETs the programming model, but in essence, it implies two steps. First,
2
Job alternative is to exploit the features that the programming
scheduler Resource User
manager model already provides. For instance, a private array where
its scope is limited to the parallel construct or a reduction
clause where the programming model manages the auxiliary
DROM API memory are both two solutions that solve this issue and keep
Application
the application malleable.
Prog.Model (MPI)
• Hardware is finite. During the job co-allocation, DLB may
Prog. Model (OpenMP) DROM LeWI reduce the number of active threads of other processes and
Operating System Sh. Mem. may rearrange the pinning of each thread to a new CPU,
Hardware but it will not reduce the amount of allocated memory of
any application. Therefore, the total memory capacity and
Figure 1: DLB Framework bandwidth will be shared among applications.

3.2 DROM API for managing the co-allocation


the application will modify the number of active threads running
of applications
within the shared memory programming model (OpenMP, OmpSs,
etc.). Lastly, each active thread will be pinned to a specific CPU Processes attached to the DLB system can be managed from an-
to avoid any oversubscription during the coexistence of the many other process, referred as administrator process from now on. In
processes in the node. this paper, we consider SLURM as the main candidate for the admin-
DLB uses node’s shared memory to communicate the different istrator process, but the implementation of the interface presented
processes, implemented as a common, lock protected, address space in this section allows users to program their own administrator
where all processes attached to DLB can read and write. While the process. In this case, the administrator process always runs as the
communication can be asynchronous for the sender, the receiver, same user doing the submission and the co-allocations are always
by default, will use a polling mechanism based on the interception limited to other applications of the same user.
interfaces. This mechanism produces a negligible overhead but The administrator process can manage other processes by com-
relies exclusively on the frequency of the programming model municating with the DLB system, which mainly consists of a single
invocation. Alternatively, DLB also implements an asynchronous shared memory per node. Therefore, if the submission allocates
mode for the receiver using a helper thread and a callback system. more than one node, one administrator process must be created for
This design of the framework allows user, or developer, to add each node that requires management, and eventual synchronization
DLB support to an application with minimal effort. However, there need to be implemented within those processes.
are some considerations to take into account before running an The proposed DROM interface is presented below, and its code
application with DLB support, i.e. how the application reacts to an is Open Source and available at [6]:
unintended change of the number of running threads, and whether
other hardware resources, apart from CPUs, can perform when int DROM_Attach(void)
Attach current process to the system as DROM administra-
other applications are co-allocated. The former only depends on the
tor. Once attached, the process is able to query or modify
application implementation, and the latter may depend on several
the process mask of other processes running with DROM
factors such as total memory consumed, I/O bandwidth, etc.:
support.
• Application’s inherent non-malleability. DLB may change the
number of active threads, or max_threads in OpenMP nomen- int DROM_Detach(void)
clature, at any time during the execution. For this reason, Detach current process from DROM system. If previously
the application should be malleable. We consider an applica- attached, a process must call this function to correctly close
tion completely malleable when its design allows changing file descriptors and clean data.
the number of threads at any time, and its completion is int DROM_GetPidList(int *pidlist, int *nelems,
still valid and successful. This condition requires a thread
int max_len)
based programming model with some level of malleability, al- Obtain the list of running processes registered in the DROM
though its effect does not need to be immediate. For example, system.
OpenMP is not able to modify the number of threads until
the next parallel construct, but we consider it acceptable. int DROM_GetProcessMask(int pid, dlb_cpu_set_t
An application is not malleable when, at any arbitrary point mask, dlb_drom_flags_t flags)
of the execution, obtains the number of threads and assumes int DROM_SetProcessMask(int pid,
it will not change in the future. For instance, a common const_dlb_cpu_set_t mask, dlb_drom_flags_t
practice in OpenMP applications is to allocate some auxiliary flags)
memory based on the current number of threads. At a later Getter and Setter of the process mask for a given PID.
time, inside a parallel region, this memory will be indexed
by the current thread identification number and may cause int DROM_PreInit(int pid, const_dlb_cpu_set_t
different errors depending whether the team size is smaller mask, dlb_drom_flags_t flags, char ***
or larger than the assumed by the application. The suggested next_environ)
3
Preinitialize a starting process into the DROM system, re- 4.2 Integration with OmpSs
serving some CPUs or making room in the node by shrinking OmpSs is a task based programming model developed also at BSC,
other running processes according to mask. The usual work- forerunner of many features accepted in the OpenMP specification.
flow for this function is to register the current PID, then fork The OmpSs runtime includes DLB support, and if enabled, any
and exec into the new process keeping next_environ variable compiled application can enable the DLB features provided by the
that permits the child process to be able to register using the runtime by setting the appropriate option.
parent’s process ID.
4.3 Integration with MPI
int DROM_PostFinalize(int pid, dlb_drom_flags_t
HPC applications often request several computing nodes, thus,
flags) shared memory programming models are not enough. A message
Finalize a previously preinitialized process. This function
passing interface is required for the communication among the dif-
should be called after a preinitialized child process has fin-
ferent processes of the application and the MPI standard is probably
ished its execution. The child process may have cleaned the
the most used for that purpose. Being aware of its importance, DLB
shared memory if runs a supported programming model but
implements an interception mechanism by using the MPI standard
this is not known from the job scheduler perspective. Is is
profiling interface, PMPI. PMPI allows any profiler, in this case DLB,
always recommended to call this function to clean the data.
to intercept any standard MPI call, and run custom code before and
Non-standard C types used in this interface are: after the real MPI call.
DLB supports MPI interception and acts as an application pro-
• dlb_cpu_set_t is actually a void pointer provided as an filer but it does not implement malleability at process level, i.e.,
opaque type and it is casted back internally to cpu_set_t. MPI processes are never decreased or increased, nor any program
This data set is a bitset where each bit represents a CPU. It data is ever moved between processes. For DROM purposes, MPI
is defined in the GNU C library [7]. interception is only used to poll DLB and check if there are some
• dlb_drom_flags_t is a custom bitset provided by DLB. This pending actions to be taken. If the program runs with a new version
argument adds some flexibility to the interface by allowing of OpenMP implementing OMPT or with OmpSs, the MPI layer is
some options like: whether the function call is synchronous completely optional.
or asynchronous, whether to steal the CPUs from other pro-
cesses, etc.
4.4 Integration with applications without a
supported programming model
4 INTEGRATION OF DROM WITH
DROM has been designed to be easily used even for applications
PROGRAMMING MODELS that do not run a supported programming model. The DLB library
As previously explained, DLB applications need a shared memory includes an interface for applications in order to become DROM-
programming model to modify the number of running threads, thus responsive and react to the reallocations performed by the manager
achieving the resource co-allocation. Currently supported thread process. Using the DLB interface in the application implies that
based programming models are OpenMP and OmpSs. MPI intercep- it has to be recompiled, but it also offers more flexibility to only
tion is also supported by DLB to add more synchronization points call DLB on those safe points where the application can change the
between the application and DLB, as well as to gather more infor- number of threads if it is not completely malleable.
mation about the application structure and improve the resource Listing 1 shows an example of an iterative application manu-
scheduling policies. ally modified to support DROM. The effort for developers is min-
imal. First, the application needs to initialize and finalize DLB
4.1 Integration with OpenMP correctly when appropriate. Then, just before entering the mal-
leable parallel code, it should poll the DROM module to check if
Any OpenMP application can use the DLB library without having to
the resources need to be readjusted and, if needed, perform the
be recompiled as long as the OpenMP runtime used supports OMPT.
necessary actions. This adjustment needs to be done by the appli-
OMPT is a new interface introduced in the OpenMP Technical
cation. In case of an OpenMP application, it may include a call to
Report 4 [3] and it will probably be included in the next OpenMP
omp_set_num_threads and, optionally, a rebind of threads if the
5.0 specification. The interface allows external tools to monitor the
runtime is configured to bind them to CPUs.
execution of an OpenMP program. Even though the interface is
not yet part of the OpenMP standard, several OpenMP runtimes # include " dlb . h "
already include it in their latests versions, such as the proprietary i n t main ( i n t a r g c , char ∗ ∗ a r g v ) {
branch of Intel (2018.2.046) and their open-source branch based on /∗ initialization ∗/
LLVM’s runtime [1]. DLB_Init ( ) ;
If the OpenMP runtime implements this interface, DLB can reg- ...
ister itself as a monitoring tool when the library is loaded. Then, / ∗ main l o o p ∗ /
DLB can set callbacks that will be automatically invoked for each f o r ( i = 0 ; i < end ; i + + ) {
parallel construct and implicit task creation allowing to modify the i f ( DLB_PollDROM (& ncpus , &mask )
number of resources accordingly. == DLB_SUCESS ) {
4
m o d i f y _ n u m _ r e s o u r c e s ( ncpus , &mask ) ;
}
# pragma omp p a r a l l e l
...
}
/∗ Finalization ∗/
DLB_Finalize ( ) ;
...
return 0 ;
}
Listing 1: Iterative application manually invoking DROM

5 INTEGRATION OF DROM WITH SLURM


The resources of an HPC machine are managed by a job scheduler Figure 2: SLURM job launch procedure for DROM malleable
and a resource manager. These two pieces of software allow for fast applications in two computational nodes.
and efficient exploitation of the systems’ computing resources. To
execute their applications on the part of the machine, users need
to submit a job in which they specify the type and the amount of
the resources they need, and the period of time during which they each involved component, red boxes are modified SLURM parts,
need these resources. All users requests are collected as jobs, and blue boxes are unmodified parts, green boxes are DROM calls.
they are usually managed into a priority queue. One of the most Starting from the top, node 1’s slurmd executes the submitted
used job schedulers in research, as well as in production systems is batch script, that uses srun to launch a parallel malleable application,
SLURM. It stands out for its efficiency and scalability, and because i.e., job 2. Srun sends requests of launching the tasks to the two
it is open source software. slurmd involved in job 2 allocation. Both slurmds call launch_request
DROM APIs were integrated into SLURM, to automatize the (1) function, that calculates the CPU mask for the starting task.
placement of jobs’ tasks inside computing nodes, whenever one or In this part of the code, since job 1 is running in the node, our
more malleable jobs are scheduled inside the same nodes. implementation calculates a new mask for both the new and the
The following implementation only affects jobs placement inside running job, where the mask of the running job is a subset of its
nodes, i.e. selecting for each node on which CPUs job will run. original mask. In this case, CPUs distribution is done to maintain
Slurmctld, the cluster controller in charge of scheduling jobs and running and new processes balanced in the number of CPUs for
selecting on which compute nodes they will run, is unchanged, as each task, assuming that imbalance in hybrid MPI+OpenMP/OmpSs
the purpose is to give a proof of integration of DROM APIs, not to applications degrade performance. The algorithm also distributes
present new scheduling policies. CPUs trying to keep applications in separate sockets in order to
SLURM structure grants portability of the code by the use of improve data locality. In this scenario, for fairness, computational
plugins, dynamic libraries that allows system administrators to resources are equally partitioned among running jobs.
avoid recompiling the SLURM core. For this reason the implemen- After calculating masks for both new and running tasks, slurmd
tation is enclosed in the SLURM’s task/affinity plugin, in charge of forks and executes slurmstepd. Slurmstepd calls a pre_launch (2)
distributing the resources assigned by slurmctld to the job’s tasks. function, in charge of setting the mask calculated by slurmd to the
Task/affinity is dynamically loaded by slurmd and slurmstepd, controlled task, and eventually update the other running task 1.1
dividing the code flow in two parts. The first is done inside slurmd, mask, if necessary. This is done using DROM_PreInit (2.1) function.
in charge of managing single computing node resources, and thanks At next malleability point, when task 1.1 runs DLB_PollDROM (3), it
to the plugin, calculating and distributing CPU masks to tasks of the gets a new CPU mask from shared memory and applies it, reducing
scheduled job. The second part is called by slurmstepd, a daemon the number of assigned CPUs per task. In Figure 2 we see the
that controls correct task launch and execution. At launch point, reduction in CPUs as shrinkage of the blue line. If job 1 runs on
the plugin picks the mask assigned by slurmd and actually sets it. node 2, coordination is implicit in slurmd’s CPUs distribution, that
In Figure 2 we give an example that clarifies the actions of DROM gives the same placement for both nodes.
within SLURM. It illustrates the steps performed within DROM- When a task ends post_term (4) is invoked, that involves a call
enabled slurmd and slurmstepd. We present a scenario of two jobs to DROM_PostFinalize (4.1). This function can return CPUs to the
starting to share a computing node. The job 1, to simplify the figure, job that is initial owner of the CPUs, i.e., job 1. Of course, this is
is a one-task job already running in the node 1, while the job 2 is a only possible in the case this job is still running and keep calling
two-task job just submitted and given resources on both node 1 and DLB_PollDROM (3).
node 2. Initially, job 1 uses all the resources of node 1 until a part of When a job completes, slurmd calls release_resources (5), that
them is taken by DROM and given to job 2. redistributes free CPUs to still running tasks. In the case the job
On the left we have job 1 running task 1.1 into node 1, on the owner of the CPUs, in this case task 1.1, completes before the job
right the start procedure for job 2. Vertical axis represents time for 2, CPUs will be acquired by the job 2, that will expand its mask to
5
increase node utilization. This is done by using DROM_GetPidList, • STREAM is a benchmark intended to measure sustainable
DROM_GetProcessMask and DROM_SetProcessMask (5.1) APIs. memory bandwidth[30]. The used dataset size can be ad-
justed, we configured it to run multiple iterations with an
6 EVALUATION OF DROM-ENABLED 8GB dataset. The application is parallelized with MPI + OpenMP.
SYSTEM’S PERFORMANCE We used this benchmark to simulate a memory bound ana-
lytics software.
To evaluate the potential and utility of the DROM API we perform
two types of experiments that follow two realistic use case scenarios, Pils and STREAM benchmarks are used to reproduce the behavior
supported by HBP: of in-situ visualizers and analytics used in HBP project, at this point
still at an early stage.
(1) In-Situ Analytics. The workload consists of two jobs: 1) a big All the experiments are real-machine workload runs. For that pur-
and long job that we will refer to as simulation and 2) small pose, we used MareNostrum III (MN3) supercomputer [10], based
and short job that we will refer to as analytics. This scenario on Intel SandyBridge processors, with each node containing two
corresponds to a use case of HBP in HPC machines, where sockets with eight cores per socket and 128 GB of DDR3 mem-
a neuro-simulation is running, and a visualizer or a data ory. The operating system is a SLES distribution, with Platform
analytics program can periodically check partial simulation LSF [22] resource manager. NEST and CoreNeuron were compiled
results, instead of waiting simulation to complete. To run using Intel 2017.1 compilers and OpenMPI libraries version 1.10,
an analytics application, within a standard system, the user Pils and STREAM were compiled with Mercurium 2.0.0 and Nanos
would launch a second job asking for resources and wait until 0.13a. We run the experiments using the original SLURM based
they are available. Using DROM, the analytics would use part on version 15.08.11 and the modified version that uses DROM to
of the resources allocated to the simulation, by temporarily exploit malleability as described in Section 5. To run the modi-
shrinking its the number of used resources. This permits fied SLURM version, we created an environment where we can
running analytics in the same node, avoiding reading and launch the real SLURM as an LSF job on a portion of 3 nodes of
writing data to disk in case the analytics is able to exchange the production-machine, one for controller, two for computing
data with the simulation in-memory, or data transfer in case nodes. This environment allows us to run SLURM as a regular job
the analytics runs on a local machine. scheduler for real-application jobs submitted to it and being freely
(2) High-priority job. In the second use case, also part of HBP configurable by us.
use cases, we consider the scenario of two jobs: 1) a long- All the reported results are an average of at least 3 runs per-
running simulation and 2) a new high-priority long-running formed in two MN3 nodes, we observed a maximum coefficient of
job, e.g., an interactive job or urgent simulation, arriving in variation of 3.4% in run time measurements. We analyzed the use
the queue. In the absence of available resources, the high- cases from a system and application perspective, by measuring:
priority job needs to wait in the job queue, or the already
• Total run time: time to complete the workload, calculated as
running job needs to be preempted or oversubscribed, which
last job end time minus first job submission time.
would degrade the performance, as previously explained in
• Response time: calculated as a sum of job’s wait time in
Section 2. With other malleability implementations, the simu-
scheduler’s queue and job’s execution time.
lation would need to shrink in the number of nodes, creating
• Average response time: arithmetic mean of response times
overhead due to data movement and checkpoint/restart op-
of all the jobs in the workload.
erations. In DROM case, the application can keep executing
• IPC: number of instructions completed per processor cycle
on the same number of nodes, but on a reduced number of
by a specific thread.
resources per node, while the high-priority job is scheduled
• Cycles per microsecond: number of processor’s cycles per
to run in the same job allocation.
microsecond dedicated to the specific thread.
We use a set of real applications - two neuro-simulation applica-
We obtained system metrics from SLURM logs and application’s
tions and two synthetic benchmarks:
metrics by tracing the use cases using Extrae [25] and visualizing
• NEST [24] is a simulator for spiking neural network models. traces with Paraver [34]. We compared the baseline and DROM
It is parallelized with MPI and OpenMP. We have modi- enabled implementation by measuring run time on two exclusive
fied the code of NEST, based on version 2.12.0, to make it MN3 nodes. We didn’t find any visible overhead between them, so
malleable[5]. Additionally, we have added calls to poll_DROM we can compare the two versions in our experiments.
in the safe points where the number of threads can be changed. Each of the use cases is evaluated for several different configura-
• CoreNeuron is a simulator for modeling neurons and net- tions regarding the number of MPI processes and OpenMP threads
works [23]. It is parallelized with MPI and OpenMP. We have per MPI process, as summarized in Table 1. All applications ask for
modified the code to add calls to poll_DROM in safe points 2 nodes and distribute MPI processes among them. We run NEST
for malleability[4]. and CoreNeuron with different configurations and we observed
• Pils[18] is a synthetic benchmark, doing computation-intensive increasing IPC switching from Conf. 1 to Conf. 2. This is due to a
operations. It is parallelized with MPI + OmpSs. It can be different data access pattern and better data locality. We kept both
configured to run with different numbers of MPI processes configurations to check how the use cases perform. Regarding Pils,
and OpenMP/OmpSs threads. In our experiments, we use it in Conf. 2 and Conf. 3 it does request and run only on a part of node
to simulate a compute bound parallel data analytics. resources, even if the node is free. Even though it is supposed to be
6
Conf. 1: Conf. 2: Conf. 3: 3500 Serial
Application
MPI x OpenMP MPI x OpenMP MPI x OpenMP 3000 DROM
NEST 2 x 16 4x8 -

Total run time (s)


CoreNeuron 2 x 16 4x8 - 2500
Pils 2 x 16 2x1 2x4 2000
STREAM 2x2 - -
1500
Table 1: Use cases applications configurations. 1000
500
0
Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3
a small application, we run Pils in Conf. 1 to have a reference case
NEST Conf. 1 Conf. 2
in which nodes are fully utilized, as a further case for comparison.
Concerning STREAM, we don’t need to change configuration for it
Figure 4: Run time of NEST + Pils workload. Y-axis repre-
as the application is memory bound and over two CPUs per node
sents total run time in seconds, X-axis shows the different
performance keeps constant. We will refer to the different configu-
configurations of the applications.
rations as App-name Conf. x, e.g. NEST Conf. 1 means NEST with 2
MPI processes and 16 OpenMP threads per process.
NEST configurations, run time for DROM case is in average 5.9%
6.1 Use Case 1: In Situ Analytics better than Serial case for Pils Conf 2 and Conf 3, and comparable
to the reference case Conf. 1. Average overhead of DROM scenario
App 2 – process 1
over Pils Conf. 1 is 0.6%, varying with the analyzer’s configuration.
App 1 – process 1 App 2 – process 2
We observed that this is because of NEST implementation. Since
16 cores

App 1 – process 2
its data is statically partitioned according to the maximum number
Serial

App 1 – process 3
of computational resources during initialization, as explained in
App 1 – process 4 Simulation
4 MPI x 4 threads Section 3, when applying malleability to shrink NEST, the tasks
App 1 – process 1
Analytics
2 MPI x 2 threads
not computed by the removed thread are computed by some of the
remaining resources, creating imbalance, as shown in Figure 5. This
16 cores

App 1 – process 2
DROM

App 1 – process 3
is a limitation of the application and not an overhead introduced
App 2 – process 1
App 1 – process 4
by DROM. In fact, increasing the number of stolen computing
App 2 – process 2
resources, like in the case of Pils Conf. 3, the number of excess tasks
(a) (b) (c) (d) (d) (e)
increases, and they are better distributed among the remaining
Time
resources. In this situation, we improve total run time up to 2.5%
Figure 3: In situ analytics example. with respect to Pils Conf. 1, while for Conf. 2 it can reach -2.6%. A
fully malleable NEST version that doesn’t partition data according
In Figure 3 we can see a graphical representation of this use case. to initial number of threads would improve this result.
The horizontal axis represents time while the vertical axis compu-
tational resources, i.e., number of cores. At time (a) the simulation
is launched and started in the available resources. At time (b) the
analytics is submitted.
We compare two scenarios, the Serial one considers that the
analytics must wait for the simulation to finish before it can start
at point (d) because no resources are available. Thus the simulation
runs using all the available cores and only when it finishes the
analytic is executed. The second scenario is using DROM to start
analytics immediately at time (b), reducing the number of resources Figure 5: Trace showing simulator’s threads on Y-axis. When
assigned to the simulation. Once the analytics finishes at point (c) thread 16 is removed, its data is computed by first 4 threads,
the simulation gets its resources back. while the others report lower utilization (white idle spaces).
We evaluated the following two-applications workloads. We
will use the notation simulation application+analytics application Figure 6 shows single application’s response time. Pils’s response
when naming the workloads, i.e. NEST + Pils, NEST + STREAM, time, painted in lines pattern, decreases up to 96% due to waiting
CoreNeuron + Pils, CoreNeuron + STREAM. For this use case, we time reduced to zero, while its run time is approximately the same.
evaluate and analyze the total run time, average response time, The cost for this reduction is a small increase in NEST’s response
individual application’s response time, followed by a discussion on time, varying from 0% to 4.2%. A 0% increase is due to increasing
the differences between Serial and DROM scenarios and the various IPC when running on a reduced number of threads.
configurations. In Figure 7 we can see workload’s total run time and each appli-
Figure 4 shows the difference in the workload’s total run time cation’s response time for NEST and STREAM use case. In this case
when running the two versions of NEST in one node with Pils. Y- we remove 2 CPUs from the simulation to run a memory intensive
axis represents total run time in seconds, while on X-axis we have application, gaining in average 1.84% (up to 3.5%) in terms of to-
the different configurations of the applications involved. For both tal run time. STREAM’s response time decreases up to 92% while
7
3000 NEST - Serial NEST - DROM 3500 Serial
Pils - Serial Pils - DROM
2500 3000 DROM

Total run time (s)


Response time (s)

2500
2000
2000
1500
1500
1000
1000
500 500
0 0
Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3 Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3
NEST Conf. 1 Conf. 2 Neuron Conf. 1 Conf. 2

Figure 6: Individual response time of NEST and Pils in the Figure 9: Execution time of CoreNeuron + Pils workload.
NEST + Pils workload. 3000 NEST - Serial NEST - DROM
3000 Pils - Serial Pils - DROM
Serial 2500
3000 Neuron - Serial
DROM

Response time (s)


2500 Neuron - DROM
2500 Stream - Serial 2000
Total run time (s)

2000 Stream - DROM


Response time (s)

2000 1500
1500
1500 1000
1000 1000 500
500 500 0
0 0 Pils Conf. 1 Conf. 2 Conf. 3 Conf. 1 Conf. 2 Conf. 3
STREAM Conf. 1 Conf. 1 STREAM Conf. 1 Conf. 1 Neuron Conf. 1 Conf. 2
NEST Conf. 1 Conf. 2 NEST Conf. 1 Conf. 2

Figure 10: Individual response time of CoreNeuron and Pils


Figure 7: Run time (Left) and Response time (Right) of NEST in the CoreNeuron + Pils workload.
+ STREAM workload varying NEST configuration. 3500 Serial 3500 Neuron - Serial
3000 Serial Neuron - DROM
3000 DROM 3000
Average response time (s)

Stream - Serial
DROM Stream - DROM
2500
Total run time (s)

2500

Response time (s)


2500
2000 2000 2000
1500 1500 1500
1000 1000 1000
500
500 500
0
0 0
Nest Conf. 1 Conf. 2 Conf. 1 Conf. 2 Conf. 1 Conf. 2 Conf. 1 Conf. 2 STREAM Conf. 1 Conf. 1 Conf. 1 Conf. 1
STREAM
Conf. 1 Conf. 2 Conf. 3 Conf. 1
Neuron Conf. 1 Conf. 2 Neuron Conf. 1 Conf. 2
Pils Stream

Figure 11: Execution time (Left) and Response time (Right)


Figure 8: Average response time of NEST workloads. of CoreNeuron + STREAM workload varying CoreNeuron
configuration.
3000 Serial
Average response time (s)

NEST’s increases up to 6.7% in the worst case. Total run time is 2500
DROM

always better because of benefits of memory bound and a compute 2000


bound applications sharing the nodes. 1500
Figure 8 shows average response time. Gain in DROM case is up 1000
to 48% and never less than 37% with respect to the Serial case. 500
Figures 9, 10, 11 and 12 show the same set of experiments but 0
with CoreNeuron neuro-simulator. Results are very similar to NEST Neuron Conf. 1 Conf. 2 Conf. 1 Conf. 2 Conf. 1 Conf. 2 Conf. 1 Conf. 2

workloads, as also CoreNeuron presents the same data partition Conf. 1 Conf. 2 Conf. 3 Conf. 1

problem. Figure 9 shows improved run time when comparing with Pils Stream

Pils Conf. 2 and Conf. 3, and a maximum overhead of 5% compared


to Pils Conf. 1. Compared to NEST, CoreNeuron shows slightly Figure 12: Average response time of CoreNeuron workloads.
worse results when sharing with compute intensive analytics like
Pils, even if less affected by the number of requested resources,
showing 2% of variation versus 5% of NEST. In Figure 11 total run
time is always better then the Serial cases for STREAM workloads
(up to 8%), response time decreases up to 91% while CoreNeuron’s applications like STREAM, with an average run time gain of 5.3%
increase is 4% in the worst case. Compared to NEST, it slightly vs 1.84% for NEST. Average response time in Figure 12 shows an
performs better when sharing the node with memory intensive average gain of 46.5% for DROM scenario with respect to the Serial.
8
6.2 Use Case 2: High-priority job
In this use case we analyze a single workload made up of two jobs,
a long NEST and a long CoreNeuron simulation running on 2 MN3
nodes. Both jobs request Conf. 1 presented in Table 1. NEST Serial CoreNeuron Serial

Again, we compare a Serial scenario in which the high-priority


job can only start after the running job ends, and DROM scenario
where the same job starts immediately by freeing some resources
using DROM interface. NEST DROM CoreNeuron DROM
Figure 13 presents traces for both scenarios. X-axes represent
time, with same scale to compare total run time, while Y-axes show
Figure 14: Histogram of instruction per cycle for CoreNeu-
application’s threads. At time a) NEST is submitted and runs on the
ron and NEST runing in serial and with DROM.
entire two nodes allocation. At time b) CoreNeuron is submitted. In
the top trace, representing the Serial scenario, CoreNeuron needs to
wait for all the resources to be freed to start, starting at time c). The
bottom trace represents the DROM scenario, in which CoreNeuron
starts at submission time, sharing nodes with NEST. At time d)
NEST ends, freeing half of the available resources, and CoreNeuron
expands its allocation to keep maximum nodes utilization. In the
DROM scenario, as both applications ask for two entire nodes,
SLURM will apply the implemented automatic resource partition by
reducing both new and running jobs used resources. Equipartition
is applied, giving 16 CPUs per application on a total of 32. Figure 15: Average response time for use case 2 workload.
We present total run time and response time to discuss about DROM scenario improves response time by 10% with respect
system benefits of malleability for this use case, and application to the Serial scenario.
related performance counters, like IPC and cycles per µs, to demon-
strate applications are not interfering each other when DROM is
represent the main information of the histograms. They demon-
used for malleability.
strate that Serial and DROM scenario are comparable in terms of
IPC. Regarding NEST, we distinguish some noise in the Serial sce-
3300
nario, and two color variants for the most frequent IPC for DROM.
This is due to the fact that threads corresponding to the lighter color
are removed to accommodate CoreNeuron at time (b) of Figure 13,
1500
distributing more computation on the darker part of the graph. For
CoreNeuron, IPC in Serial scenario is constant but for DROM, we
can distinguish two blue zones, in correspondence to the threads
0
a) b) c) d) in which application starts, reporting slightly higher IPC. This is
due to higher parallel efficiency when running on less number of
Figure 13: Traces showing cycles per µs of use case 2. Serial OpenMP threads per MPI rank, improving total run time.
scenario is presented on the top, DROM on the bottom. Finally, Figure 15 presents average response time for this use
case. Response time improves by 10% with respect to the Serial
Looking at the total workload duration in Figure 13, in the case scenario due to gain in run time and because the high-priority
of DROM, better resource utilization leads to a total run time im- job can start earlier, improving at the same time user experience
provement of the 2.5%. The same figure shows the cycles per µs when the job is interactive or giving earlier partial results when a
using colors. Showing the same color for both scenarios means simulation is able to start earlier.
there is no difference between Serial and DROM scenarios and
constant color during run time shows that there is no variation 7 CONCLUSIONS AND FUTURE WORK
in this metric when applying malleability to expand and shrink In this paper, we presented an interface, named DROM, that allows
applications. Green color at beginning of CoreNeuron simulator exploiting malleability by creating communication between two, at
shows lower cycles in memory intensive initialization phase. In the the moment, unconnected parts of HPC software stack that should
Serial scenario, during initialization, all computational resources work in a more connected and coordinated way. The presented
are underutilized, while in DROM case, NEST keep running, in- API permits to change the computational resources allocated to a
creasing utilization and contributing to reduce total workload run running application efficiently, without any overhead.
time. We implemented the proposed APIs within the DLB framework,
Figure 14 shows the number of instruction per cycle for both designing the interface to be easily integrated into any program-
configuration of the use case 2. Figures are grouped by application ming model or directly into the application. We integrated them
to be easily comparable. X-axes represent IPC in increasing order, with different and widely used programming models, such as MPI
Y-axes application’s threads, blue dots show more frequent IPC and and OpenMP. Additionally, we presented an integration of the API
9
with SLURM node manager, to achieve automatic distribution and Distributed Processing Symposium. 429–438.
placement of co-scheduled jobs inside nodes. We presented two [16] Dror G. Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers
for parallel supercomputers. In Job Scheduling Strategies for Parallel Processing.
use cases as a proof of concept, and we analyzed results from the Berlin, Heidelberg, 1–26.
workload point of view and application point of view. Our results [17] M. Garcia, J. Corbalan, and J. Labarta. 2009. LeWI: A Runtime Balancing Algorithm
for Nested Parallelism. In International Conference on Parallel Processing.
show up to 48% improvement in average response time, and up to [18] M. Garcia, J. Labarta, and J. Corbalan. 2014. Hints to improve automatic load
8% in total run time, by comparing to the serial case. balancing with LeWI for hybrid applications. J. Parallel and Distrib. Comput.
With this study, we open future work in two directions. On (2014).
[19] Abhishek Gupta, Bilge Acun, Osman Sarood, and Laxmikant V. Kale. 2014. To-
one side we want to expand the potential of DROM, with new wards Realizing the Potential of Malleable Parallel Jobs. In Proceedings of the IEEE
functionalities, like the collection of useful data from applications International Conference on High Performance Computing (HiPC ’14). Goa, India.
at run time. The collected information can be consulted by an [20] C. Huang, O. Lawlor, and L. V. Kalé. 2003. Adaptive MPI. In Proceedings of the
16th International Workshop on Languages and Compilers for Parallel Computing.
external to get info about applications performance and send them [21] J. Hungershofer. 2004. On the combined scheduling of malleable and rigid jobs.
to the job scheduler to be taken into account for further scheduling In 16th Symposium on Computer Architecture and High Performance Computing.
[22] IBM. 2014. Platform LSF. (2014). www.ibm.com/support/knowledgecenter/en/
decisions. On the other side, we want to tight the communication SSETD4_9.1.2/lsf_welcome.html
between the different layers of the HPC software stack, i.e., by [23] P. Kumbhar and M. Hines. 2016. CoreNeuron Neuronal Network Simulator
developing DROM-aware scheduling and resource management Optimization Opportunities and Early Experience. In GPU Technology Conference.
[24] Kunkel, S. et. al. 2017. NEST 2.12.0. (2017). https://doi.org/10.5281/zenodo.259534
policies. The simplicity of DROM APIs gives more freedom to the [25] G. Llort, H. Servat, J. González, J. Giménez, and J. Labarta. 2013. On the usefulness
scheduler, that can implement malleable scheduling techniques, of object tracking techniques in performance analysis. In 2013 SC - International
for instance by choosing one or multiple specific jobs to share Conference for High Performance Computing, Networking, Storage and Analysis
(SC).
computational nodes, or at resource management level, by choosing [26] V. Lopez, A. Jokanovic, M. D’Amico, M. Garcia, R. Sirvent, and J. Corbalan.
as "victim" nodes the ones with lower utilization. Combined with 2017. DJSB: Dynamic Job Scheduling Benchmark. In Job Scheduling Strategies
for Parallel Processing: 21st International Workshop, JSSPP 2017, Orlando, FL, USA,
a job scheduler/resource manager, DROM can be used in many June 2, 2017, Revised Selected Papers.
different ways, including implementing new scheduling policies [27] Walter Ludwig and Prasoon Tiwari. 1994. Scheduling Malleable and Nonmalleable
based on malleability, e.g. policies based on co-scheduling, or as Parallel Tasks.. In SODA, Vol. 94. 167–176.
[28] K. El Maghraoui, T. J. Desell, B. K. Szymanski, and C. A. Varela. 2007. Dynamic
alternative to jobs preemption. Malleability in Iterative MPI Applications. In Seventh IEEE International Sympo-
sium on Cluster Computing and the Grid (CCGrid ’07). 591–598.
Acknowledgments. This work is partially supported by the Span- [29] Gonzalo Martín, David E. Singh, Maria-Cristina Marinescu, and Jesús Carretero.
ish Government through Programa Severo Ochoa (SEV-2015-0493), 2015. Enhancing the Performance of Malleable MPI Applications by Using
Performance-aware Dynamic Reconfiguration. Parallel Comput. 46 (July 2015).
by the Spanish Ministry of Science and Technology through TIN2015- [30] John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current
65316-P project, by the Generalitat de Catalunya (contract 2017- High Performance Computers. Technical Committee on Computer Architecture
SGR-1414) and from the European Union’s Horizon 2020 under (1995).
[31] Message Passing Interface Forum. 2015. MPI Specifications 3.1, http://mpi-
grant agreement No 785907 (HBP SGA2). forum.org/docs/mpi-3.1/mpi31-report.pdf. (2015).
[32] G. Mounie, C. Rapine, and D. Trystram. 1999. Efficient Approximation Algorithms
for Scheduling Malleable Tasks. In Proceedings of the Eleventh Annual ACM
REFERENCES Symposium on Parallel Algorithms and Architectures (SPAA ’99). 23–32.
[1] 2016. Intel©OpenMP Runtime Library. (2016). https://www.openmprtl.org/ [33] OpenMP. 27/4/2018. OpenMP 4.5 Specifications, http://www.openmp.org/wp-
[2] 2016. PMPI profiling interface. (2016). https://www.open-mpi.org/faq/?category= content/uploads/openmp-4.5.pdf. (27/4/2018).
perftools [34] Vincent Pillet, Jesús Labarta, Toni Cortes, and Sergi Girona. 1995. Paraver: A tool
[3] 2016. TR4: OpenMP Version 5.0 Preview 1. (2016). http://www.openmp.org/ to visualize and analyze parallel code. In Proceedings of WoTUG-18: transputer
wp-content/uploads/openmp-tr4.pdf and occam developments, Vol. 44. IOS Press, 17–31.
[4] 2017. Malleable CoreNeuron source code. (2017). https://github.com/BlueBrain/ [35] Human Brain Project. 2017. https://www.humanbrainproject.eu/en/. (2017).
CoreNeuron/tree/hbp_dlb [36] G. Utrera, J. Corbalan, and J. Labarta. 2004. Implementing Malleability on MPI
[5] 2017. Malleable NEST source code. (2017). https://github.com/mggasulla/ Jobs. In Proceedings of the 13th International Conference on Parallel Architectures
nest-simulator/tree/malleability and Compilation Techniques.
[6] 2018. DLB-DROM source code. (2018). https://github.com/bsc-pm/dlb/
[7] 2018. The GNU C library: CPU Affinity. (2018). https://www.gnu.org/software/
libc/manual/html_node/CPU-Affinity.html
[8] Yoo A. B., Jette M. A., and Grondona M. 2003. SLURM: Simple Linux Utility for
Resource Management. In Job Scheduling Strategies for Parallel Processing. 44–60.
[9] R. H. Castain, D. Solt, J. Hursey, and A. Bouteiller. 2017. PMIx: Process Manage-
ment for Exascale Environments. In Proceedings of the 24th European MPI Users’
Group Meeting (EuroMPI ’17). ACM, New York, NY, USA.
[10] Barcelona Supercomputing Center. 2014. Marenostrum 3. (2014). https://www.
bsc.es/marenostrum/marenostrum/mn3
[11] I. Comprés, A. Mo-Hellenbrand, M. Gerndt, and H. Bungartz. 2016. Infrastructure
and API Extensions for Elastic Execution of MPI Applications. In Proceedings of
the 23rd European MPI Users’ Group Meeting (EuroMPI 2016). 82–97.
[12] A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J.
Planas. 2011. Ompss: a Proposal for Programming Heterogeneous Multi-Core
Architectures. Parallel Processing Letters 21 (2011).
[13] J. Turek et al. 1994. Scheduling Parallelizable Tasks to Minimize Average Response
Time. In Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms
and Architectures. 200–209.
[14] M. Cera et al. 2010. Supporting Malleability in Parallel Architectures with Dy-
namic CPUSETs Mapping and Dynamic MPI. In Proceedings of the 11th Interna-
tional Conference on Distributed Computing and Networking. 242–257.
[15] S. Prabhakaran et al. 2015. A Batch System with Efficient Adaptive Scheduling
for Malleable and Evolving Applications. In 2015 IEEE International Parallel and
10

You might also like