CIS Computing & Information Services

Running Jobs

A "job" refers to a program running on the compute nodes of the Oscar cluster. Jobs can be run on Oscar in two different ways:

  • An interactive job allows you to interact with a program by typing input, using a GUI, etc. But if your connection is interrupted, the job will abort. These are best for small, short-running jobs where you need to test out a program, or where you need to use the program's GUI.
  • A batch job allows you to submit a script that tells the cluster how to run your program. Your program can run for long periods of time in the background, so you don't need to be connected to Oscar. The output of your program is continuously written to an output file that you can view both during and after your program runs.

Jobs are scheduled to run on the cluster according to your account priority and the resources you request (e.g. cores, memory and runtime). For batch jobs, these resources are specified in a script referred to as a batch script, which is passed to the scheduler using a command. When you submit a job, it is placed in a queue where it waits until the required computes nodes become available.

NOTE: please do not run CPU-intense or long-running programs directly on the login nodes! The login nodes are shared by many users, and you will interrupt other users' work.

We use the Simple Linux Utility for Resource Management (SLURM) from Lawrence Livermore National Laboratory as the job scheduler on Oscar. With SLURM, jobs that only need part of a node can share the node with other jobs (this is called "job packing"). When your program runs through SLURM, it runs in its own container, similar to a virtual machine, that isolates it from the other jobs running on the same node. By default, this container has 1 core and a portion of the node's memory.

The following sections have more details on how to run interactive and batch jobs through SLURM, and how to request more resources (either more cores or more memory).


Batch jobs

To run a batch job on the Oscar cluster, you first have to write a script that describes what resources you need and how your program will run. Example batch scripts are available in your home directory on Oscar, in the directory:

~/batch_scripts

To submit a batch job to the queue, use the sbatch command:

$ sbatch <jobscript>

A batch script starts by specifying the bash shell as its interpreter, with the line:

#!/bin/bash

Next, a series of lines starting with #SBATCH define the resources you need, for example:

#SBATCH -n 4
#SBATCH -t 1:00:00
#SBATCH --mem=16G

Note that all the #SBATCH instructions must come before the commands you want to run. The above lines request 4 cores (-n), an hour of runtime (-t), and 16GB of memory per node (--mem). By default, a batch job will reserve 1 core and 2.8GB of memory per core.

Alternatively, you could set the resources as command-line options to sbatch:

$ sbatch -n 4 -t 1:00:00 --mem=16G <jobscript>

The command-line options will override the resources specified in the script, so this is a handy way to reuse an existing batch script when you just want to change a few of the resource values.

The sbatch command will return a number, which is your job ID. You can view the output of your job in the file slurm-<jobid>.out in the directory where you ran the sbatch command. For instance, you can view the last 10 lines of output with:

$ tail -10 slurm-<jobid>.out

Alternatively, you can mention the file names where you want to dump the standard output and errors using the -o and -e flags.

Useful sbatch options:

-J Specify the job name that will be displayed when listing the job.
-n Number of tasks (= number of cores, if "--cpus-per-task" or "-c" option is not mentioned).
-c Number of CPUs or cores per task (on the same node).
-N Number of nodes.
-t Runtime, as HH:MM:SS.
--mem= Requested memory per node.
-p Request a specific partition.
-o Filename for standard output from the job.
-e Filename for standard error from the job.
-C Add a feature constraint (a tag that describes a type of node). You can view the available features on Oscar with the nodes command.

--mail-type=

Specify the events that you should be notified of by email: BEGIN, END, FAIL, REQUEUE, and ALL.

--mail-user=

Email ID where you should be notified.

You can read the full list of options at http://slurm.schedmd.com/sbatch.html or with the command:

$ man sbatch

Interactive jobs

To start an interactive session for running serial or threaded programs on an Oscar compute node, simply run the command interact from the login node:

$ interact

By default, this will create an interactive session that reserves 1 core, 4GB of memory, and 30 minutes of runtime.

You can change these default limits with the following command line options:

usage: interact [-n cores] [-t walltime] [-m memory] [-q queue]
                [-o outfile] [-X] [-f featurelist] [-h hostname] [-g ngpus]

Starts an interactive job by wrapping the SLURM 'salloc' and 'srun' commands.

options:
  -n cores        (default: 1)
  -t walltime     as hh:mm:ss (default: 30:00)
  -m memory       as #[k|m|g] (default: 4g)
  -q queue        (default: 'batch')
  -o outfile      save a copy of the session's output to outfile (default: off)
  -X              enable X forwarding (default: no)
  -f featurelist  CCV-defined node features (e.g., 'e5-2600'),
                  combined with '&' and '|' (default: none)
  -h hostname     only run on the specific node 'hostname'
                  (default: none, use any available node)
  -a account     user SLURM accounting account name

For example:

$ interact -n 20 -t 01:00:00 -m 10g

This will request 20 cores, 1 hour of time and 10 GB of memory (per node).


Managing jobs

Canceling a job:

$ scancel <jobid>

Listing running and queued jobs:

The squeue command will list all jobs scheduled in the cluster. We have also written wrappers for squeue on Oscar that you may find more convenient:

myq List only your own jobs.
myq <user> List another user's jobs.
allq List all jobs, but organized by partition, and a summary of the nodes in use in the partition.
allq <partition> List all jobs in a single partition.
myjobinfo Get the time and memory used for your jobs.

Listing completed jobs

The sacct command will list all of your running, queued and completed jobs since midnight of the previous day. To pick an earlier start date, specify it with the -S option:

$ sacct -S 2012-01-01

To find out more information about a specific job, such as its exit status or the amount of runtime or memory it used, specify the -l ("long" format) and -j options with the job ID:

$ sacct -lj <jobid>

The myjobinfo command uses the sacct command to display "Elapsed Time", "Requested Memory" and "Maximum Memory used on any one Node" for your jobs. This can be used to optimize the requested time and memory to have the job started as early as possible. Make sure you request a conservative amount based on how much was used.

$ myjobinfo

Info about jobs for user 'mdave' submitted since 2017-05-19T00:00:00
Use option '-S' for a different date
 or option '-j' for a specific Job ID.

       JobID    JobName              Submit      State    Elapsed     ReqMem     MaxRSS
------------ ---------- ------------------- ---------- ---------- ---------- ----------
1861                ior 2017-05-19T08:31:01  COMPLETED   00:00:09     2800Mc      1744K
1862                ior 2017-05-19T08:31:11  COMPLETED   00:00:54     2800Mc     22908K
1911                ior 2017-05-19T15:02:01  COMPLETED   00:00:06     2800Mc      1748K
1912                ior 2017-05-19T15:02:07  COMPLETED   00:00:21     2800Mc      1744K

'ReqMem' shows the requested memory:
 A 'c' at the end of number represents Memory Per CPU, a 'n' represents Memory Per Node.
'MaxRSS' is the maximum memory used on any one node.
Note that memory specified to sbatch using '--mem' is Per Node.

Partitions

When submitting a job to the Oscar compute cluster, you can choose different partitions depending on the nature of your job. You can specify one of the partitions listed below either in your sbatch command:

$ sbatch -p <partition> <batch_script>

or as an SBATCH option at the top of your batch script:

#SBATCH -p <partition>

Partitions available on Oscar:

batch Default partition with most of the compute nodes: 16-, 20- or 24-core; 55GB to 188GB of memory; all Intel based.
gpu Specialized compute nodes with GPU accelerators.
gpu-debug Specialized compute nodes with GPU accelerators for debugging GPU code, i.e. fast turn-around, but with a short time limit of 30 minutes
bibs-gpu Specialized compute nodes with GPU accelerators for BIBS use only
debug Dedicated nodes for fast turn-around, but with a short time limit of 30 minutes and CPU limit of 16.
smp AMD architecture nodes with large number of CPUs and memory primarily meant for shared memory parallel programs.

You can view a list of all the Oscar compute nodes broken down by type with the command:

$ nodes

Job priority

The scheduler considers many factors when determining the run order of jobs in the queue. These include the:

  • size of the job;
  • requested walltime;
  • amount of resources you have used recently (e.g., "fair sharing");
  • priority of your account type.

The account priority has three tiers:

  • Low (Exploratory)
  • Medium (Premium)
  • High (Condo)

Both Exploratory and Premium accounts can be affiliated with a Condo, and the Condo priority only applies to a portion of the cluster equivalent in size to the Condo. Once the Condo affiliates have requested more nodes than available in the Condo, their priority drops down to either medium or low, depending on whether they are a Premium or Exploratory account.

Backfilling: When a large or long-running job is near the top of the queue, the scheduler begins reserving nodes for it. If you queue a smaller job with a walltime shorter than the time required for the scheduler to finish reserving resources, the scheduler can backfill the reserved resources with your job to better utilize the system. Here is an example:

  • User1 has a 64-node job with a 24 hour walltime waiting at the top of the queue.
  • The scheduler can't reserve all 64 nodes until other currently running jobs finish, but it has already reserved 38 nodes and will need another 10 hours to reserve the final 26 nodes.
  • User2 submits a 16-node job with an 8 hour walltime, which is backfilled into the pool of 38 reserved nodes and runs immediately.

By requesting a shorter walltime for your job, you increase its chances of being backfilled and running sooner. In general, the more accurately you can predict the walltime, the sooner your job will run and the better the system will be utilized for all users.