Introduction To The Gadi Supercomputer
Introduction To The Gadi Supercomputer
respects to the Ngunnawal and Ngambri people of the Canberra region and to all First
Nations Australians on whose traditional lands we meet and work, and whose cultures
nci.org.au 2
Agenda
• Introduction
• Account
• Login
• Storage and Data Transfer
• Applications
• Jobs
• Q&A
nci.org.au 3
About NCI
nci.org.au 4
Data centre
nci.org.au 5
Gadi Artwork
Artist: Lynnice Letty Church – Tribes: Ngunnawal, Wiradjuri & Kamilaroi (ACT and NSW)
Gadi - "to search for" in Ngunnawal language - January 2020 for NCI Gadi Supercomputer
nci.org.au 6
a
Interfaces to Gadi a
r
n
NIRIN
e
> ssh
t
nci.org.au 7
Gadi Specifications
• 640 NVIDIA V100 GPUs in 160 nodes, 2 NVIDIA DGX A100 nodes Gadi
Compute
• 22 PiB of high speed scratch storage with max IO speeds of 490 GiB/s Node Specs
• 15.14 (peak) / 9.26 (sustained) Pflop system ranked 24th fastest in the world on debut in 2020
(currently #83) – https://www.top500.org/lists/top500/2023/11/
data massdata
NIRIN
services
nci.org.au 9
External systems
• Global data filesystems (gdata)
• A collection of Lustre parallel filesystem blocks to store large data files for longer period
• Massdata
• 70 Petabytes of archival project data in state-of-the-art magnetic tape libraries
nci.org.au 10
myNCI
nci.org.au 11
NCI Account
• Account is for a lifetime
• Recertify once a year. This includes changing your password and accepting Conditions of Use
agreement. A reminder email sent to registered email address one month prior to “Recertification due
date”
• If not recertified in time, account will go into suspended mode for 120 days. Beyond that it will be
deactivated
nci.org.au 13
Login
nci.org.au 14
ssh to gadi
ssh
Mac
XQuartz
nci.org.au 15
Login Environment
• Round-robin login
• Message of the day (motd)
• Account status information
• Environment check : whoami, hostname, default shell, gadi project, home dir
• Linux commands quick reference : pwd, ls, cd, mkdir, cp, mv, cat, less, vim, man, etc.
• Access restrictions
• …
nci.org.au 17
Storage and Data Transfer
nci.org.au 18
Storage areas
Filesystems Path Critical Info
/home/institutionCode/username
• Personal space
home
• Backed up
/home/777/jjj777
• Tape based backup system for archiving large data files of a project
mass data mdss –P c25 ls -l • Need to use mdss utility to access dirs in massdata store
/apps/software/version
• Centrally installed software applications and their module files
applications
• Readonly access
/apps/python3/3.10.4
nci.org.au 19
Storage areas
Data
Filesystems Allocations iNode Limit
Ownership
• 1TiB default
• Managed by NCI
scratch Project • Limited
• More space allocated if reasonable justification is
provided
• Managed by sponsoring scheme/institution
global data Project • For more space discuss with project CI / scheme • Limited
manager
applications NCI
nci.org.au 20
Storage utilities
Util Information
• Provides quota and usage for all connected project spaces on scratch
lquota and/or gdata filesystem
lquota
• Gives the data foorprint for a project data on scratch and/or gdata
nci-files-report
nci-files-report -p c25 -f scratch
nci.org.au 22
Data transfer utilities
• Filezilla, WinSCP
• …
nci.org.au 23
Applications
nci.org.au 24
Applications
• For an application not in central repository you can download and install in
home/project dir
• NCI recommends Intel compilers and OpenMPI to compile and run applications
nci.org.au 25
Applications: Modules
• module load
• modifies search/exec path
• loads dependencies
• handles conflicts
• configures environment to define how the application runs
• Do:
• Always start working in a clean environment
• Always load specific version of application
nci.org.au 26
Applications: License module and software group
• Restricted modules available to specific group of users
• Software groups control access to license modules
• Example: matlab, ansys
• License modules tell the application where to checkout license
• Software groups control access to applications
• Example: vasp
nci.org.au 28
Data transfer example
#!/bin/bash
#PBS -P c25
#PBS -q copyq
#PBS -l ncpus=1
#PBS -l mem=4GB
#PBS -l walltime=00:30:00
#PBS -l storage=gdata/c25
#PBS -l wd
export SOURCEDIR=/g/data/c25/jjj777/archive
export DSTDIR=/scratch/c25/jjj777/test
nci.org.au 29
PBS commands
• qcat is useful to see job error and output files during the jobrun
nci.org.au 30
Compute resource
• In order to run a job, a project needs to have compute allocation i.e. service units (SU)
• PBS will calculate and reserve the total number of SUs required to run your job:
Charging rate in SU X Number of Cpus (or MemUnits) X Walltime
• Once compute allocations are exhausted, a job will be held in the queue until project gets more SU
• Compute allocations are usually made on quarterly basis, but can be increased/decreased/transferred to another project (under
same stakeholder) anytime of the quarter:
• Discuss with project chief investigator (CI) and/or allocation scheme manager of your institution
• If it is expected, allocations will not be used with-in a quarter, they can be rolled-over to next quarter in first two weeks of current
quarter
expresssr 6 6 (5GiB)
nci.org.au 32
Compute resource : Accounting with nci_account
• Provides compute allocation and usage to-date for a project for a given quarter
• Also prints:
• Total storage allocation and usage for scratch and/or gdata project space
• Massdata usage
• Application(s)
module load openmpi/4.1.3
nci.org.au 34
Job monitoring
• nqstat_anu <job id>
%CPU WallTime Time Lim RSS mem memlim cpus
12345678 R abc123 x11 myTest 33 10:53:56 20:00:00 58.7GB 58.7GB 200GB 96
19145286 R abc123 x11 atmos_ma 96 01:32:41 03:30:00 369GB 369GB 2625GB 768
19149497 R abc123 x11 coupled. 84 00:34:25 04:30:00 320GB 320GB 1440GB 720
19149708 R abc123 x11 netcdf_c 71 00:36:30 02:00:00 12.0GB 12.0GB 12.0GB 1
19150248 R abc123 x11 atmos_ma 86 00:22:27 03:30:00 345GB 345GB 2625GB 768
• qps <job id>
• prints the snapshot of the current processes in the job
• launches a ps query on each node running the job
• accepts most flags ps would take • Compile program with -g (gcc) or -g -traceback
• qps_gpu <job id> (Intel compilers).
module load padb
• qcat <job id> padb -X pbs_job_id
• print the job’s standard streams
• Realtime using top • pstack
• Login to the compute node and run top utility
• attach gdb and get a stacktrace
nci.org.au 35
Jobs submission options
nci.org.au 36
Why my job…
• has waited so long ? qstat -u $USER -Esw
• Insufficient amount of resource: ncpus
• Project doesn’t have sufficient allocation to run job
• One of the project areas is already over disk quota
• Waiting for software licenses
• Job would not finish before dedicated time
nci.org.au 38
Help us help you J
• Gadi User Guide
• [email protected]
• When writing to helpdesk, always include following information:
• Username, project code
• For job related queries:
• Include job id or absolute paths to jobscript, error and output files
• Avoid attachments; Screenshots are ok
• For additional allocations:
• Compute – discuss with project chief investigator (CI) / scheme manager
• Storage – gdata/massdata – discuss with CI / scheme manager
scratch – discuss with NCI
nci.org.au 39
NCI Training and Educational Events
Click to provide feedback on this session