Introductory Supercomputing PDF
Introductory Supercomputing PDF
Supercomputing
Course Objectives
• Understand basic parallel computing concepts and
workflows
• Understand the high-level architecture of a
supercomputer
• Use a basic workflow to submit a job and monitor it
• Understand a resource request and know what to
expect
• Check basic usage and accounting information
Iron Man 2
WORKFLOWS
eResearch Workflows
Research enabled by computers is called eResearch.
eResearch examples:
– Simulations (e.g. prediction, testing theories)
– Data analysis (e.g. pattern matching, statistical analysis)
– Visualisation (e.g. interpretation, quality assurance)
– Combining datasets (e.g. data mashups, data mining)
Collect
• Kilobytes of parameters, or petabytes of data from a telescope or atomic collider
Stage
• Data transfer from manual transfer of small files, to custom data management frameworks for large
projects
Process
• Single large parallel job with check pointing, or many smaller parallel or serial jobs
Visualise
• Examining values, generating images or movies, or interactive 3D projections
Archive
• Storage of published results, or publicly accessible research datasets
The Pawsey Centre
• Provides multiple components of the workflow in one location
• Capacity to support significant computational and data requirements
Vis cluster
Display
Archive,
Pawsey Centre databases
Parallelism in Workflows
Exploiting parallelism in a workflow allows us to
• get results faster, and
• break up big problems into manageable sizes.
Supermarket
Kitchen pantry Kitchen bench
Baking a Cake - Process
Serial
Parallel
Mix icing
Ice cake
Baking a Cake - Finished
Either:
• Sample it for quality.
• Put it in a shop for others to browse and buy.
• Store it for later.
• Eat it!
Clean up!
Then make next cake.
Levels of Parallelism
Coarse-grained parallelism (high level)
• Different people baking cakes in their own kitchens.
• Preheating oven while mixing ingredients.
Greater autonomy, can scale to large problems and many
helpers.
High Throughput:
• For many cakes, get many people to bake
independently in their own kitchens – minimal
coordination.
• Turn it into a production line. Use specialists and
teams in some parts.
SUPERCOMPUTERS
Abstract Supercomputer
Login Nodes
Data Movers
Scheduler
Compute Nodes
• Data movers are shared, but most users will not notice.
• Performance depends on the other end, distance, encryption
algorithms, and other concurrent transfers.
• Data movers see all the global filesystems
hpc-data.pawsey.org.au
Scheduler
The scheduler feeds jobs into the compute nodes. It has a queue
of jobs and constantly optimizes how to get them all done.
PAWSEY SYSTEMS
Pawsey Supercomputers
Name System Nodes Cores/node RAM/node Notes
Magnus Cray XC40 1488 24 64 GB Aries interconnect
Galaxy Cray XC30 472 20 64 GB Aries interconnect, +64 K20X GPU nodes
Zeus HPE Cluster 90 28 98-128 GB +20 vis nodes, +15 K20/K40/K20X GPU nodes,
+11 P100 GPU nodes, +80 KNL nodes, +1 6TB
node
LOGGING IN
Command line SSH
Within a terminal window, type:
ssh [email protected]
https://support.pawsey.org.au/documentation/display/US/Logging+in+with+SSH+keys
ssh -X [email protected]
Pawsey
Supercomputers
Common login problems
• Forgot password
• Self service reset
https://support.pawsey.org.au/password-reset/
• Scheduled maintenance
• Check your email or
https://support.pawsey.org.au/documentation/display/
US/Maintenance+and+Incidents
• Blacklisted IP due to too many failed login attempts
• This is a security precaution.
• Email the helpdesk with your username and the
machine you are attempting to log in to.
Exercise: Log in
Log in to Zeus via ssh:
ssh [email protected]
Use git to download the exercise material:
Applications that are widely used by a large number of groups may also
be provided.
Most modules depend on an architecture and compiler, to allow the use of one
module for multiple combinations.
On Zeus and Zythos, switch to the desired architecture and compiler first:
module swap sandybridge broadwell
module swap gcc intel
In the Pawsey Centre there are multiple SLURM clusters, each with
multiple partitions.
• The clusters approximately map to systems (e.g. magnus, galaxy, zeus).
• You can submit a job to a partition in one cluster from another cluster.
• This is useful for pre-processing, post-processing or staging data.
Querying SLURM Partitions
To list the partitions when logged into a machine:
sinfo
For example:
username@magnus-1:~> sinfo
PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST
workq* up 1-1366 1-00:00:00 24 2:12:1 2 idle* nid00[543,840]
workq* up 1-1366 1-00:00:00 24 2:12:1 1 down* nid00694
workq* up 1-1366 1-00:00:00 24 2:12:1 12 reserved nid000[16-27]
workq* up 1-1366 1-00:00:00 24 2:12:1 1457 allocated nid0[0028-0063, … workq*
up 1-1366 1-00:00:00 24 2:12:1 8 idle nid0[0193, …
debugq up 1-6 1:00:00 24 2:12:1 4 allocated nid000[08-11]
debugq up 1-6 1:00:00 24 2:12:1 4 idle nid000[12-15]
Pawsey Partitions
It is important to use the correct system and partition for each part of a workflow:
System Partition Purpose
Magnus workq Large distributed memory (MPI) jobs
Magnus debugq Debugging and compiling on Magnus
Galaxy workq ASKAP operations astronomy jobs
Galaxy gpuq MWA operations astronomy jobs
Zeus workq Serial or shared memory (OpenMP) jobs
Zeus zythos Large shared memory (OpenMP) jobs
Zeus debugq Debugging and development jobs
Zeus gpuq, gpuq-dev GPU-accelerated jobs
Zeus knlq, knlq-dev Knights Landing evaluation jobs
Zeus visq Remote visualisation jobs
Zeus copyq Data transfer jobs
Zeus askap ASKAP data transfer jobs
Querying the Queue
squeue displays the status of jobs in the local cluster
squeue
squeue –u username
squeue –p debugq
charris@zeus-1:~> squeue
JOBID USER ACCOUNT PARTITION NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY
2358518 jzhao pawsey0149 zythos SNP_call_zytho zythos R None Ystday 11:56 Thu 11:56 3-01:37:07 1 1016
2358785 askapops askap copyq tar-5182 hpc-data3 R None 09:20:35 Wed 09:20 1-23:01:09 1 3332
2358782 askapops askap copyq tar-5181 hpc-data2 R None 09:05:13 Wed 09:05 1-22:45:47 1 3343
2355496 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Priority Tomorr 01:53 Wed 01:53 1-00:00:00 2 1349
2355495 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Resources Tomorr 01:52 Wed 01:52 1-00:00:00 4 1356
2358214 yyuan pawsey0149 workq runGet_FQ n/a PD Priority 20:19:00 Tomorr 20:19 1-00:00:00 1 1125
2358033 yyuan pawsey0149 gpuq 4B_2 n/a PD AssocMaxJo N/A N/A 1-00:00:00 1 1140
2358709 pbranson pawsey0106 workq backup_RUN19_P n/a PD Dependency N/A N/A 1-00:00:00 1 1005
Querying the Queue (cont’d)
charris@zeus-1:~> squeue
JOBID USER ACCOUNT PARTITION NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY
2358518 jzhao pawsey0149 zythos SNP_call_zytho zythos R None Ystday 11:56 Thu 11:56 3-01:37:07 1 1016
2358785 askapops askap copyq tar-5182 hpc-data3 R None 09:20:35 Wed 09:20 1-23:01:09 1 3332
2358782 askapops askap copyq tar-5181 hpc-data2 R None 09:05:13 Wed 09:05 1-22:45:47 1 3343
2355496 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Priority Tomorr 01:53 Wed 01:53 1-00:00:00 2 1349
2355495 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Resources Tomorr 01:52 Wed 01:52 1-00:00:00 4 1356
2358214 yyuan pawsey0149 workq runGet_FQ n/a PD Priority 20:19:00 Tomorr 20:19 1-00:00:00 1 1125
2358033 yyuan pawsey0149 gpuq 4B_2 n/a PD AssocMaxJo N/A N/A 1-00:00:00 1 1140
2358709 pbranson pawsey0106 workq backup_RUN19_P n/a PD Dependency N/A N/A 1-00:00:00 1 1005
2. What to run.
• You cannot submit an application directly to SLURM.
Instead, SLURM executes on your behalf a list of shell
commands.
• In batch mode, SLURM executes a jobscript which
contains the commands.
• In interactive mode, type in commands just like when
you log in.
• These commands can include launching programs onto
the compute nodes assigned for the job.
Jobscripts
A jobscript is a bash or csh script.
To use a reservation:
sbatch --reservation=reservation-name myscript
Or in your jobscript:
#SBATCH --reservation=reservation-name
SLURM Output
Standard output and standard error from your jobscript
are collected by SLURM, and written to a file in the
directory you submitted the job from when the job
finishes/dies.
slurm-jobid.out
Exercise: Hostname
Launch the job with sbatch:
cd hostname
sbatch --reservation=courseq hostname.slurm
# load modules
module load python/3.6.3
The program can be compiled and the script can be submitted to the scheduler with:
cd hello-openmp
make
sbatch hello-openmp.slurm
MPI example
This will run 28 MPI processes on 1 node on Zeus:
#!/bin/bash -l
#SBATCH --job-name=hello-mpi
#SBATCH --nodes=1
#SBATCH --tasks-per-node=28
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --export=NONE
For example:
salloc --tasks=16 --time=00:10:00
srun make -j 16
Interactive Jobs
If there are no free nodes, you may need to wait while the job is in the queue.
For small interactive jobs on Magnus use the debugq to wait less.
salloc --tasks=1 --time=10:00 -p debugq
Exercise: Interactive session
Run hello-serial.py interactively on a Zeus compute node.
Start an interactive session (you may need to wait while it is in the queue):
salloc --reservation=courseq --tasks=1 --time=00:10:00
Merit allocations are divided evenly between the four quarters of the year, to avoid end-
of-year congestion. Priorities reset at the start of the quarter for merit allocations
Director share allocations are typically awarded for up to 12 months, or until the time is
consumed, and do not reset automatically.
For example:
charris@magnus-2:~> pawseyAccountBalance -p pawsey0001 -u
Compute Information
-------------------
Project ID Allocation Usage % used
---------- ---------- ----- ------
pawsey0001 250000 124170 49.7
--mcheeseman 119573 47.8
--mshaikh 2385 1.0
--maali 1109 0.4
--bskjerven 552 0.2
--ddeeptimahanti 292 0.1
Job Information
The sacct tool provides high-level information on the jobs that have been run:
sacct
For example:
charris@magnus-1:~> sacct -a -A pawsey0001 -S 2017-12-01 -E 2017-12-02 -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2461157 bash debugq pawsey0001 24 COMPLETED 0:0
2461543 bubble512 debugq pawsey0001 24 FAILED 1:0
2461932 bash workq pawsey0001 24 FAILED 2:0
2462029 bash workq pawsey0001 24 FAILED 127:0
2462472 bash debugq pawsey0001 24 COMPLETED 0:0
2462527 jobscript+ workq pawsey0001 960 COMPLETED 0:0
DATA TRANSFER
Data Transfer Nodes
• All transfers handled via secure copies
• scp, rsync, etc.
Supercomputer Hostname
Magnus / Zeus / Zythos hpc-data.pawsey.org.au
Galaxy hpc-data.pawsey.org.au
Data Transfer Nodes
For file transfers, run scp from the remote system using the data transfer nodes.
There are many scp clients programs with graphical interfaces, such as MobaXTerm,
FileZilla, and WinSCP, particularly for Windows users.
Ensure username and passwords are correct for these programs, as they can
automatically retry and trigger IP blocking. This setting should be disabled in the
software.
copyq
• Batch job access to data #!/bin/bash –login
• Serial job
• No srun needed
Exercise: Backing Up
Practice data transfer by backing up the course material.
Exit the session, then use scp to copy via a data transfer node:
exit
scp -r cou###@hpc-data.pawsey.org.au:/group/courses01/cou### $PWD
Cost
Amount of resources used.
Cost = walltime * nodes.
E.g. 12 hours * 100 nodes = 1,200 node hours.
(= 28,800 core hours on a 24 cores per node system)
• This is also what you have prevented other people
from being able to use!!
Real-world Example
NWChem timings on Galaxy.
C60 molecule heat of formation using
double-hybrid DFT.
Collaborative project w/ Amir Karton, UWA
128 8 11.2 90
256 16 4.4 70
512 32 2.1 67
2x 0.8x
4096 256 0.75 192 Fastest and highest cost
So How Many Cores to Use?
Fast turnaround High throughput
• Weigh up between • Use an efficient core count
turnaround time and cost (usually low)
• Total time = runtime + • Each job may run longer
queue time • Run many jobs
Tackle Larger Problems
• For many workloads the parallel portion expands
faster than the serial portion when the problem size
is increased
• Humans tend to accept a certain delay for answers
• Many computational workloads can expand to
consume more compute power if provided
E.g.
• Which resource
• Error messages
• Location of files
• SLURM job id
• Your username and IP address if having login issues
• Never tell us (or anyone) your password!
Applications Support Team
Team expertise:
• Access to Pawsey
facilities
• High-performance
computing
• Parallel programming
• Computational science
• GPU accelerators
• Cloud computing
• Scientific visualisation
• Data-intensive computing
• Filesystem performance
• Compiling code
• Scientific libraries
• Advanced job launching
• SLURM workflows
• Job arrays