7. Job Management

On HPC, you don’t run your program on the shell directly (on login nodes). Instead you submit your program as a job to run on compute nodes. These jobs are managed by SLURM and will be queued to the system and executed eventually. Conceptually, each job is a 2-step process:

  1. You request certain resources from the system. The most common resources are CPU cores.

  2. With the assigned resources, you run your computational tasks.

HPC provides flexibility in types of jobs as per the resource and computer requirements of its users. Down below are the most commonly used job types.

7.1. Batch Jobs

Users should use batch jobs for the most part unless your requirements can’t be met without a direct shell access.
A complete batch job workflow:

  1. Write a job script, which consists of 2 parts:

    1. Resources requirement.

    2. Commands to be executed.

  2. Submit the job.

  3. Relax, have a coffee, log off if you wish. The computer will do the work.

  4. Come back to examine the result.

7.1.1. Batch Job Script

A job script is a text file describing the job. As discussed, the first part tells how much resources you want. The second part is what you want to run.

Points to be noted:

  • Request only what you need

  • Serial jobs would need only one CPU (#SBATCH -n 1)

  • Make sure the walltime specified is not greater than the allowed time limit.

Difference between CPUs, Cores and Tasks

  • On Greene HPC, One CPU is equivalent to one Core.

  • In Slurm, the resources (CPUs) are allocated in terms of tasks which are denoted by -n or --natsks.

  • By Default, the value of -n or --ntasks is one if left undefined.

  • By Default, Each task is equivalent to one CPU.

  • But if you have defined -c or --cpus-per-task in your job script, then the total number of CPUs allocated to you would be the multiple of -n and -c.

7.1.1.1. Syntax

#!/bin/bash

# Define the resource requirements here using #SBATCH tag, 
# ('#' before SBATCH is required and you do not remove it).

#------ Resource requiremenmt commands start here

#SBATCH <option> <value>
#SBATCH <option> <value>
#SBATCH <option> <value>
...

#------ Resource requiremenmt commands end here

#------ Commmands to be executed

<command executable on shell>
<command executable on shell>
<command executable on shell>
...

Save the job scripts with .sbatch file extension.

7.1.1.2. Options

The options tell SLURM information about the job, such as what resources will be needed. These can be specified in the job-script as SBATCH directives, or on the command line as options while submitting a job, or both (in which case the command line options take precedence should the two contradict each other).

For each option there is a corresponding SBATCH directive with the syntax:

#!/bin/bash
#SBATCH --nodes=2
sbatch abc.sbatch

or as a command-line option to sbatch when you submit the job:

sbatch --nodes=2 abc.sbatch

Available options:

Option

Description

-J, --job-name=<str>

Give the job a name. The default is the filename of the job script.
Within the job, $SBATCH_JOB_NAME
expands to the job name

-o, --output <path>

Send stdout to path/for/stdout. The default filename is slurm-${SLURM_JOB_ID}.out, e.g. slurm-12345.out,
in the directory from which the job was mitted

-e <path>

Send stderr to specified path

--mail-user=<email>

Send email to specified email id when certain events occur

--mail-type=<type>

Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL…

--export=[VARS]

Pass variables to the job, either with a specific value (the VAR= form) or from the submittingenvironment
(without “=”)

-t, --time=<time>

Set a limit on the total run time. Acceptable formats include “minutes”, “minutes:seconds”,
“hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”

--mem=<int><unit>

Maximum memory per node the job will need, unit can be MB, GB, ex- 3GB

--mem-per-cpu=<int><unit>

Memory required per allocated CPU, unit can be MB, GB

-N, --nodes=<int>

Number of nodes are required. Default is 1 node

-n, --ntasks=<int>

Specify the number of tasks to run, e.g. -n4. Default is one CPU core per task. Don’t just submitthe job,
but also wait for it to start and connect stdout, stderr and stdin to the current terminal, default is one task
per node

--ntasks-per-node=<int>

Request that ntasks be invoked on each node

-c, --cpus-per-task=<int>

Require ncpus number of CPU cores per task. Without this option, allocate one core per task

--pty

Execute the first task in pseudo terminal mode, e.g. –pty /bin/bash, to start a bash command shell

--x11

Enable X forwarding, so programs using a GUI can be used during the session (provided you have X
forwarding to your workstation set up)

--begin=<time>

Delay starting this job until after the specified date and time, e.g. –begin=9:42:00, to start the job at
9:42:00 am

-a, --array=[indexes]

Submit an array of jobs with array ids as specified. Array ids can be specified as a numerical range, a
comma-separated list of numbers, or as some combination of the two. Each job instance will have an
environment variable SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID.

Note

A more comprehensive list of options can be found here.

7.1.2. Submitting a batch Job

To submit a job you need to use sbatch command.

# sbatch [options] <filename>
sbatch abc.sbatch

Attention

Requesting the resources you need, as accurately as possible, allows your job to be started at the earliest opportunity as well as helping the system to schedule work efficiently to everyone’s benefit.

7.1.3. Reading Outputs

After a job execution finishes, a file is generated with the name specified in sbatch file which contains the stdout or the shell output. You can read the output by opening the output file.

cat job_1323234.out

7.1.4. Examples

Job with one 1 core:

Create a file python_program.sbatch to run abc.py.

#!/bin/bash

#SBATCH --ntasks=1              # Set number of tasks to run
#SBATCH --time=00:30:00         # Walltime format hh:mm:ss
#SBATCH -o job_%J.out           # Output file (%J: JobID)
#SBATCH -e job_%J.err           # Error file

# ---- Put all #SBATCH directives above this line! ---- #
# ---- Otherwise they will not be effective!       ---- #

# ---- Actual commands start here ---- #
# Load modules here (safety measure)
module purge
module load python

python abc.py

Submit the job:

sbatch python_program.sbatch

A typical batch job script with GPU:

#!/bin/bash

#SBATCH --nodes=1               
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=5:00:00          # Walltime, expected time of job
                                # completion, SLURM can kill it 
                                # after this time
#SBATCH --mem=2GB               # RAM
#SBATCH --gres=gpu:1            # 1 GPU

#SBATCH --job-name=myTest       # Assigned Job Name
#SBATCH --mail-type=END                 # Email on completion
#SBATCH --mail-user=bob.smith@nyu.edu   # Email
#SBATCH --output=slurm_%j.out           # Output file
 
# Clean environment
module purge
module load cuda/11.3.1
module load anaconda3/2020.07 
eval "$(conda shell.bash hook)"

conda activate pytorch_env

cd $SCRATCH/Projects/DLProject
python resnet.py

Tip

If you want to select a specific GPU, you can specify that by gpu:rtx8000:1 for Quadro RTX8000 and gpu:v100:1 for Tesla V100.

Warning

A job having a GPU resource is terminated if GPU usage is very low for 2 hours continuously.

7.2. Interactive Sessions

Instead of submitting a job, you could get an interactive session from your terminal on compute nodes, where you can run your program on the shell directly.

Warning

Only short interactive jobs should be used (e.g., experimenting with new hyper-parameters in your source code taking a short runtime on each execution).

To start an interactive session srun command is used.
Request a session with 4 CPU cores:

srun -c 4  --pty /bin/bash

Expected output:

[wz22@login-0-1]$ srun -c 4  --pty /bin/bash
srun: job 775175 queued and waiting for resources
srun: job 775175 has been allocated resources
[wz22@compute-21-1 ~]$

Then you can run your applications on the terminal directly.

Warning

In a real scenario, the system might be exhausted with no available resources to you. You need to wait in this circumstance.

With interactive session you can have the same arguments passed to srun as you pass into the job script with sbatch.

To request a GPU session with 32 GB RAM and 10 CPU cores for 1 hour:

srun -c 10 --mem=32GB --gres=gpu:1 -t 1:00:00 --pty /bin/bash

To leave an interactive batch session, type exit at the command prompt.

7.3. Checking Job Status

7.3.1. Running or Pending Job

This command shows all your current jobs.

squeue -u $USER

Example output:

JOBID PARTITION     NAME     USER ST       TIME NODES NODELIST(REASON)
31408   ser_std  job1.sh     wz22  R       0:02     1 compute-21-4

It means the job with Job ID 31408, has running status (ST: R) and its runtime is 2 minutes on compute-21-4 cluster node.

You can also get information about job status by running:

sstat --format=TresUsageInMax%80,TresUsageInMaxNode%80 -j <JobID> --allsteps

For more verbose information, use scontrol show job.

scontrol show job <jobid>

7.3.2. Completed Job

Once the job is finished, the job can’t be inspected by squeue or scontrol show job. At this point, you could inspect the job by sacct.

sacct -j <jobid>

The following commands give you extremely verbose information on a job.

sacct -j <jobid> -l

7.4. Cancelling a Job

If you decide to end a job prematurely, use scancel commmand.

scancel <jobid>

Use with caution !

To cancel all jobs from your account. Run this on the HPC terminal.

scancel -u $USER

7.5. Job Resource Usage Statistics

7.5.1. Completed Job

A useful command that allows you to better understand how resources were utilized by completed jobs is seff:

seff <job-ID>

Example Output:

Job ID: 8932105
Cluster: greene
User/Group: NetID/GROUPID
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 02:22:45
CPU Efficiency: 99.99% of 02:22:46 core-walltime
Job Wall-clock time: 02:22:46
Memory Utilized: 2.18 GB
Memory Efficiency: 21.80% of 10.00 GB

This example shows statistics on a completed job, that was ran with a request of 1 cpu core and 10Gb of RAM. While CPU utilization was 100%, RAM utilization was very poor – only 2.2GB out of requested 10GB was used. This job’s batch script should definitely be adjusted to something like #SBATCH –mem=2250MB

7.5.2. Running job

From login node, run:

top -u $USER

Take a look how fully you use CPUs and how much RAM your jobs are using.
To exit hit Ctrl+C
For a GPU job also run:

nvidia-smi

Take a look how much GPU processing power your job is using.

7.5.3. Visualize Job Statistics

You can use the below dashboard to visualize the efficiency and utilization of your jobs:

Note

You will have to login using your NYU email and you need to be on NYU Network or connected to NYU VPN.

7.6. Resource Limitations

Within SLURM there are multiple resource limits defined on different levels and applied to different objects. Some of the important limits are listed below:

Note

Note, that these limits are frequently updated by the HPC team, based on the cluster usage patterns. Due to this, the numbers above are not exact and should only be used as general guidelines.

7.6.1. Job Limitations

Resource / Object per User

Limit

Concurrent Jobs

2000

Job Lifetime

7 days / 168 hours (extendible)

7.6.2. CPU, GPU, RAM Limitations

These limitations are account specific and you need to run the command below to check yours:

sacctmgr list qos format=name,maxwall,maxtresperuser%40,flags%40 where name=interact,cpu48,cpu168,gpu48,gpu168,gpuamd,cds,cpuplus,gpuplus

Example Output:

      Name     MaxWall                MaxTRESPU           Flags
---------- ----------- ------------------------ ---------------
     cpu48  2-00:00:00       cpu=3000,mem=6000G
    cpu168  7-00:00:00       cpu=1000,mem=2000G
     gpu48  2-00:00:00              gres/gpu=24
    gpu168  7-00:00:00               gres/gpu=4

From this you can see that in the “short queue” (under 48 hours, or 2 days) each user is allowed to utilize up to 3000 cores. For jobs that are running in the “long queue” (under 168 hours, or 7 days) you can use up to 1000 cores. Basic idea behind this – users can run more short jobs, and fewer long jobs. The same logic applies to GPU resources.

7.6.3. CPU with GPU Limitations

For Tesla V100:

# GPUs

Max CPUs

Max Memory

1

20

200

2

24

300

3

44

350

4

48

369

For Quadro RTX8000:

# GPUs

Max CPUs

Max Memory

1

20

200

2

24

300

3

44

350

4

48

369

From this table you can for example see, that a job asking for 8 V100 GPUs will not be queued. Another example is that requests for 2 V100s and 48 cores will also not be granted.

7.7. HPC Resource Status

You can check current compute resource status of whole HPC using the below dashboards:

Note

You need to be on NYU Network or connected to NYU VPN.