PBS-Documentation May17
PBS-Documentation May17
Introduction
Most jobs will require greater resources than are available on individual nodes. All jobs must
be scheduled via the batch job system. The batch job system in use is PBS Pro. Jobs are
submitted to PBS specifying required resources, including the queue, number of CPUs, the
amount of memory, and the length of time needed. PBS will then run a job or jobs when the
resources are available, subject to constraints on maximum resource usage.
Command Description
qsub commandfile Submit jobs to the queues. The simplest use of the qsub
command is typified by the following example:
qstat –u username Displays the status of PBS jobs and queues for the user username. See
man qstat for details of options
qdel jobid Delete your job from a queue. The jobid is returned by qsub at job
submission time, and is also displayed in the qstat output.
qhold jobid Place a hold on your job in the queue and stops it from running.
qrls –h u jobid Release a user hold on your job and allows it to be run.
qsub Options
#!/bin/bash -l
#PBS -N Example_Job
#PBS -q default
#PBS -l select=2:ncpus=16
#PBS -l walltime=<hh:mm:ss>
#PBS -o <output-file>
#PBS -e <error-file>
Where the line "-l select=2:ncpus=16 " is the number of processors required for the
job. select specifies the number of nodes (or chunks of resource) required; ncpus indicates
the number of CPUs per chunk required.
As this is not the most intuitive command, the following table is provided as guide to how
this command works:
The line "-l walltime=<hh:mm:ss>" is the time limit for the job. If your job exceeds this time
the scheduler will terminate the job. It is recommended to find a usual runtime for the job
and add some more (say 20%) to it. For example, if a job took approximately 10 hours, the
walltime limit could be set to 12 hours, e.g. "-l walltime=12:00:00". By setting the walltime
the scheduler can perform job scheduling more efficiently and also reduces occasions where
errors can leave the job stalled but still taking up resource for the default much longer
walltime limit (for queue walltime defaults run "qstat -q" command).
Job management
The qstat command displays the status of the PBS scheduler and queues. Using the flags -Qa
shows the queue partitions available. If no queue is defined, it will use the queue called
default. The following table shows the commonly using queues:
express:
serial:
short:
long:
Job Description
State
B Job arrays only: job array has Begun.
E Job is Exiting after having run.
F Job has Finished exiting and execution. The job was completed
successfully and had no application errors.
Job has Finished exiting and execution; however, the job experienced
application errors.
H Job is Held. A job is put into a held state by the server or by a user or
administrator. A job stays in a held state until it is released by a user or
administrator.
Q Job is Queued, eligible to run or be routed.
R Job is Running.
S Job is Suspended by server. A job is put into the suspended state when a
higher priority job needs the resources.
T Job is in Transition (being moved to a new location).
U Job is User-suspended.
W Job is Waiting for its requested execution time to be reached or job
specified a staging request which failed for some reason.
X Sub jobs only; sub job is finished (expired).
resources are strictly allocated so jobs will not start unless there is sufficient free
resources (e.g. cpus and memory).
queued jobs are shuffled so that jobs from different users are "interleaved". This
means your first job should appear near the top of the queue even if there are many
jobs in the queue.
From a user's perspective, it is very important that you minimize your requests for resources
(e.g. walltime, memory and cpus). Otherwise your job may be queued or suspended longer
than necessary. Of course, make sure you ask for sufficient resources - a little
experimentation might help.
PBS Variables
PBS sets multiple environment variables at submission time. The following PBS variables are
commonly used in command files:
Use of PBS is not limited to batch jobs only. It also allows users to use the compute nodes
interactively, when needed. For example, users can work with the developer environments
provided by Matlab or R on compute nodes, and run their jobs (until the walltime expires).
Instead of preparing a submission script, users pass the job requirements directly to the
qsub command. For instance, the following PBS script:
#PBS -l nodes=7:ppn=4
#PBS -l mem=2gb
#PBS -l walltime=15:00:00
#PBS -q default
Here, -I (as in 'I'ndia) stands for 'interactive' and -X allows for GUI applications.
In some situations a job or jobs will be dependent on the output of another job in order to
run. To add a job dependency, the option -W [additional attributes] is used when submitting
a job. In the example below the afterok rule will be used, but there are several other rules
that may be useful. In this example two PBS command files will be used:
the error output from order.pbs will be order: open failed: number.list: No such file or
directory If order.pbs was submitted with a dependency on number.pbs as in:
[username@hpc-login-prd-t1 ~]$ qsub number.pbs
4674.hpc-admin-prd-t1
[username@hpc-login-prd-t1 ~]$ qsub -W depend=afterok:4674 order.pbs
4675.hpc-admin-prd-t1
[username@hpc-login-prd-t1 ~]$ qstat -u $USER
hpc-admin-prd-t1.usq.edu.au:
Notice the order.pbs is in a hold state however once the dependent job completes the order
job run as:
hpc-admin-prd-t1.usq.edu.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
4675.hpc-admi username short order.pbs 1 1 -- 48:00 R --
afterany:jobid[:jobid...] implies that the job may be scheduled for execution after
jobs jobid have terminated, with or without errors.
afterok:jobid[:jobid...] implies that job may be scheduled for execution only after
jobs jobid have terminated with no errors.
afternotok:jobid[:jobid...] implies that job may be scheduled for execution only after
jobs jobid have terminated with errors.
References:
1. PBS Professional 14 User Guide
2. PBS Professional 14 Administrator's Guide
3. PBS Professional - HPC Cluster Workload Manager