ProgrammingModelExamples ECMWF
ProgrammingModelExamples ECMWF
OVERVIEW
Building an application to use multiple processors (cores, cpus, nodes) can be done in various
ways and in particular the High Performance Computing (HPC) community has a long
experience in using multithreaded and multi-process programming models to develop fast
scalable applications on a range of hardware platforms. Typical APIs in use are OpenMP (source
threading markup) and MPI (message-passing). The examples provided and described here can
be used to compare these models when used to tackle a very simple problem and as a platform
to learn how to build and launch parallel applications on the Cray XC supercomputer.
The demonstrator problem was chosen for its simplicity and lack of subtle issues across the
range of programming models.
Ivybridge/PBS12+ECMWF 1
CODE EXAMPLES PROVIDED
The algorithm previously mentioned was coded using various programming models. You will
need to unpack Cray_pm_examples.tar or access a location provided to you. Once unpacked you
will find a directory with a README, sample job scripts and subdirectories C and Fortran which
contain source for multiple versions of the pi program. The following versions are available:
Note that the pthreads version uses OMP_NUM_THREADS to pick up the number of threads to
use so that it can be run from a job set up to run an OpenMP application.
Note in particular that no attempt has been made to optimize these examples as the goal was to
provide the simplest possible code to illustrate the programming models. In some cases idioms
are used that illustrate a feature of the programming model even if you would not normally
code that way (for example the SHMEM version uses atomic update instead of reduction.)
In each language directory (Fortran and C) is a Makefile that will build most of the examples
with the Cray CCE compiler. The SHMEM and OpenACC examples need specific modules to be
loaded before they will compile. In addition a more complex version of the source is supplied
(Fortran_timing and C_timing) with timer code added, this is more useful for scaling studies as
the internal timing is the elapsed time of the computation only.
Job scripts are provided in the jobscripts directory which you can use as templates to create
your own scripts. They work with one particular installation of PBS and the PBS headers and
directory definition may need changed for your environment. These scripts are…
Ivybridge/PBS12+ECMWF 2
In the following sections there is more information concerning the programming environment
which you can use if you need it. There is also some very basic information about the
programming models covered.
Have a look over those sections or alternatively just dive in or work as suggested by your
instructor or guided by your own interests.
MULTITHREADED MODELS
The model familiar to a C Systems programmer would be pthreads but this does not have much
traction with the HPC community due to the requirement to explicitly manage thread creation
and requirement for C. More prevalent is the approach to add compiler directives that advise
the compiler how to target computation to a set of threads. OpenMP is the example we choose
and only one OpenMP directive/pragma is shown which parallelizes the outer I loop.
If you look at the examples they follow the same pattern. The API is used to determine the
number of processes/ranks/PEs (npes,size) and which one is executing (mype,rank). The API is
Ivybridge/PBS12+ECMWF 3
also used to coordinate the processes (barriers and synchronization) and to add up the
contribution to the final count from each process.
PGAS MODELS
These models are more complex and are typically languages that support access to local and
remote data directly (this is why we put SHMEM in the previous section because it is an explicit
API).
Fortran coarrays is the simplest, UPC allows data distribution across “threads” and provides
special syntax to operate collectively on distributed (or shared) variables and Chapel is the most
featured with advanced support a range of parallel programming paradigms.
Only one on the available tool chains, or Programming Environments (PrgEnvs), can be active at
any one time; to see the list of currently loaded modules type:
module list
A list will be displayed on screen, one of which will be of the form PrgEnv-*, e.g.
20) PrgEnv-cray/5.2.14
To see which other modules are available to load run the command
module avail
On most systems this may be a very long list, to be more selective, run:
which will list all available modules that start with PrgEnv.
To swap from the current PrgEnv to another PrgEnv (e.g. to select a different tool chain), run:
This will unload the Cray PrgEnv and replace it with the Intel PrgEnv.
COMPILING AN APPLICATION
To build applications to run on the XC30 users should always use the Cray compiler driver
wrappers, ftn (Fortran), cc (C) and CC (C++) to compile their applications. These wrappers
Ivybridge/PBS12+ECMWF 4
interface with the Cray Programming Environment and make sure the correct libraries and flags
are supplied to the underlying library.
When compiling an application make sure that the appropriate modules are loaded before
starting to compile, then modify the build system to call the appropriate wrapper scripts in
place of the direct compilers.
If you wish to try other compilers you may need to add compiler flags, for example to enable
OpenMP.
Example scripts for the PBS Pro (version 12+) batch scheduler plus very simple ECMWF job
directives are provided in the example directory,
./jobscripts/
These scripts includes special #PBS comments which are used to provide information to the
batch system about the job resource requirements, e.g.
#PBS -N pi
#PBS -l EC_nodes=1
specifies the number of nodes that are required (in this case just one node).
#PBS -l walltime=0:05:00
specifies the length of time the job will take (no longer than five minutes).
Some systems, like ECMWF, may require additional queue information, e.g.
#PBS –q np
Jobs are then submitted for execution using the qsub command:
qsub jobscripts/pi.job
Ivybridge/PBS12+ECMWF 5
which will return a job id that uniquely identifies the job. Any of the #PBS settings contained
within a file can be overridden at submission time, e.g.
To view the status of all the jobs in the system use the qstat command. To find information
about a specific job using the jobid:
qstat <jobid.sdb>
qstat –u <username>
By default, when a run has completed, the output from the run is left in the run directory in the
form:
<jobname>.o<jobid>
<jobname>.e<jobid>
for stderr.
cd ${PBS_O_WORKDIR}
(which changes the current working directory to the same directory that the original qsub
command was executed) is actually run in serial on the PBS MOM node.
Only programs/commands that are preceded by an aprun statement will run in parallel on the
XC30 compute nodes, e.g.
aprun –n 16 –N 16 ./pi_mpi
Therefore users should take care that long or computationally intensive jobs are only ever run
inside an appropriate aprun command.
Each batch job may contain multiple programs with aprun statements, as long as:
Ivybridge/PBS12+ECMWF 6
An individual parallel program does not request more resources than have been
allocated by the batch scheduler
The program does not exceed the wall clock time limit specified in the submission script.
See the lecture notes and the manual page, man aprun for a full explanation of the command line
arguments to aprun.
For a typical OpenMP application the script should include the following settings:
export OMP_NUM_THREADS=4
aprun –n ${NPROC} –N ${NTASK} –d ${OMP_NUM_THREADS} ./pi_mpi_openmp
Ivybridge/PBS12+ECMWF 7