- About CCV
- Support & Education
- More Resources
A "job" refers to a program running on the compute nodes of the Oscar cluster. Jobs can be run on Oscar in two different ways:
Jobs are scheduled to run on the cluster according to your account priority and the resources you request (e.g. cores, memory and runtime). For batch jobs, these resources are specified in a script referred to as a batch script, which is passed to the scheduler using a command. When you submit a job, it is placed in a queue where it waits until the required computes nodes become available.
NOTE: please do not run CPU-intense of long-running programs directly on the login nodes! The login nodes are shared by many users, and you will interrupt other users' work.
We use the Simple Linux Utility for Resource Management (SLURM) from Lawrence Livermore National Laboratory as the job scheduler on Oscar. With SLURM, jobs that only need part of a node can share the node with other jobs (this is called "job packing"). When your program runs through SLURM, it runs in its own container, similar to a virtual machine, that isolates it from the other jobs running on the same node. By default, this container has 1 core and a portion of the node's memory.
The following two sections have more details on how to run interactive and batch jobs through SLURM, and how to request more resources (either more cores or more memory).
To start an interactive session for running serial or threaded programs on an Oscar
compute node, simply run the command
interact from the login node:
By default, this will create an interactive session that reserves 1 core, 4GB of memory, and 30 minutes of runtime.
You can change these default limits with the following command line options:
usage: interact [-n cores] [-t walltime] [-m memory] [-q queue] [-o outfile] [-X] [-f featurelist] [-h hostname] [-g ngpus] Starts an interactive job by wrapping the SLURM 'salloc' and 'srun' commands. options: -n cores (default: 1) -t walltime as hh:mm:ss (default: 30:00) -m memory as #[k|m|g] (default: 4g) -q queue (default: 'batch') -o outfile save a copy of the session's output to outfile (default: off) -X enable X forwarding (default: no) -f featurelist CCV-defined node features (e.g., 'e5-2600'), combined with '&' and '|' (default: none) -h hostname only run on the specific node 'hostname' (default: none, use any available node) -a account user SLURM accounting account name
$ interact -n 20 -t 01:00:00 -m 10g
This will request 20 cores, 1 hour of time and 10 GB of memory (per node).
If you need access to GPUs, see https://www.ccv.brown.edu/doc/gpu.
To run an MPI program interactively, first create an allocation from the login
nodes using the
$ salloc -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes>
Once the allocation is fulfilled, it will place you in a new shell where you
can run MPI programs with the
$ srun ./my-mpi-program ...
When you are finished running MPI commands, you can release the allocation by exiting the shell:
For more info on MPI programs, see https://www.ccv.brown.edu/doc/mpi.
To run a batch job on the Oscar cluster, you first have to write a script that describes what resources you need and how your program will run. Example batch scripts are available in your home directory on Oscar, in the directory:
To submit a batch job to the queue, use the
$ sbatch <jobscript>
A batch script starts by specifying the
bash shell as its interpreter, with
Next, a series of lines starting with
#SBATCH define the resources you need,
#SBATCH -n 4 #SBATCH -t 1:00:00 #SBATCH --mem=16G
Note that all the
#SBATCH instructions must come before the commands you want to run.
The above lines request 4 cores (
-n), an hour of runtime (
-t), and 16GB
of memory per node (hence, for all cores) (
--mem). By default, a batch job will reserve 1
core and a proportional amount of memory on a single node.
Alternatively, you could set the resources as command-line options to
$ sbatch -n 4 -t 1:00:00 --mem=16G <jobscript>
The command-line options will override the resources specified in the script, so this is a handy way to reuse an existing batch script when you just want to change a few of the resource values.
sbatch command will return a number, which is your job ID. You can view the output
of your job in the file
slurm-<jobid>.out in the directory where you ran the
sbatch command. For instance, you can view the last 10 lines of output with:
$ tail -10 slurm-<jobid>.out
Alternatively, you can mention the file names where you want to dump the
standard output and errors using the
||Specify the job name that will be displayed when listing the job.|
||Number of cores.|
||Number of nodes.|
||Runtime, as HH:MM:SS.|
||Requested memory per node.|
||Request a specific partition.|
||Add a feature constraint (a tag that describes a type of node). You can
view the available features on Oscar with the |
||Specify the events that you should be notified of by email: BEGIN, END, FAIL, REQUEUE, and ALL.|
You can read the full list of options at http://slurm.schedmd.com/sbatch.html or with the command:
$ man sbatch
$ scancel <jobid>
squeue command will list all jobs scheduled in the cluster. We have also
written wrappers for
squeue on Oscar that you may find more convenient:
||List only your own jobs.|
||List another user's jobs.|
||List all jobs, but organized by partition, and a summary of the nodes in use in the partition.|
||List all jobs in a single partition.|
sacct command will list all of your running, queued and completed jobs
since midnight of the previous day. To pick an earlier start date, specify
it with the
$ sacct -S 2012-01-01
To find out more information about a specific job, such as its exit status or
the amount of runtime or memory it used, specify the
-l ("long" format) and
-j options with the job ID:
$ sacct -lj <jobid>
When submitting a job to the Oscar compute cluster, you can choose different
partitions depending on the nature of your job. You can specify one of the
partitions listed below either in your
$ sbatch -p <partition> <batch_script>
or as an
SBATCH option at the top of your batch script:
#SBATCH -p <partition>
Partitions available on Oscar:
|batch||Default partition with most of the compute nodes: 8-, 12-, 16-, 20-core or SMP; 64GB to 128GB of memory (505GB on SMP); all Intel based except the SMP nodes.|
|gpu||Specialized compute nodes (8-core, 24GB, Intel) each with 2 NVIDIA GPU accelerators.|
|debug||Dedicated nodes for fast turn-around, but with a short time limit of 40 node-minutes.|
You can view a list of all the Oscar compute nodes broken down by type with the command:
The scheduler considers many factors when determining the run order of jobs in the queue. These include the:
The account priority has three tiers:
Both Exploratory and Premium accounts can be affiliated with a Condo, and the Condo priority only applies to a portion of the cluster equivalent in size to the Condo. Once the Condo affiliates have requested more nodes than available in the Condo, their priority drops down to either medium or low, depending on whether they are a Premium or Exploratory account.
Backfilling: When a large or long-running job is near the top of the queue, the scheduler begins reserving nodes for it. If you queue a smaller job with a walltime shorter than the time required for the scheduler to finish reserving resources, the scheduler can backfill the reserved resources with your job to better utilize the system. Here is an example:
By requesting a shorter walltime for your job, you increase its chances of being backfilled and running sooner. In general, the more accurately you can predict the walltime, the sooner your job will run and the better the system will be utilized for all users.
Users who are affiliated with a Condo group will automatically use that
Condo's priority when submitting jobs with
Users who are Condo members and also have Premium accounts will by default use their Premium priority when submitting jobs. This is because the core limit for a Premium account is per user, while the limit for a Condo is per group. Submitting jobs under the Premium account therefore leaves more cores available to the Condo group.
Since Premium accounts have slightly lower priority, a user in this situation
may want to instead use the Condo priority. This can be accomplished with the
--qos option, which stands for "Quality of Service" (the mechanism in SLURM
that CCV uses to assign queue priority).
Condo QOS names are typically
<groupname>-condo, and you can view a full list
condos command on Oscar. The command to submit a job with Condo
$ sbatch --qos=<groupname>-condo ...
Alternatively, you could place the following line in your batch script:
To be pedantic, you can also select the priority QOS with:
$ sbatch --qos=pri-<username> ...
although this is unnecessary, since it is the default QOS for all Premium accounts.
A job array is a collection of jobs that all run the same program, but on different values of a parameter. It is very useful for running parameter sweeps, since you don't have to write a separate batch script for each parameter setting.
To use a job array, add the option:
in your batch script. The range can be a comma separated list of integers, along with ranges separated by a dash. For example:
A job will be submitted for each value in the range.
The values in the range will be substituted for the variable
in the remainder of the script. Here is an example of a script for running a
serial Matlab script on 16 different parameters by submitting 16 different jobs as an array:
#!/bin/bash #SBATCH -J MATLAB #SBATCH -t 1:00:00 #SBATCH --array=1-16 # Use '%A' for array-job ID, '%J' for job ID and '%a' for task ID #SBATCH -e arrayjob-%a.err #SBATCH -o arrayjob-%a.out echo "Starting job $SLURM_ARRAY_TASK_ID on $HOSTNAME" matlab -r "MyMatlabFunction($SLURM_ARRAY_TASK_ID); quit;"
You can then submit the multiple jobs using a single
$ sbatch <jobscript>
For more info: https://slurm.schedmd.com/job_array.html
How is a job identified?
By a unique JobID, e.g.
Which of my jobs are running/pending?
Use the command
How do I check the progress of my running job?
You can look at the output file. The default output file is slurm-%j.out" where %j is the JobID. If you specified and output file using
#SBATCH -o output_filename and/or an error file
#SBATCH -e error_filename you can check these files for any output from your job.
You can view the contents of a text file using the program
less , e.g.
spacebar to move down the file,
b to move back up the file, and
q to quit.
My job is not running how I indented it too. How do I cancel the job?
scancel <JobID> where
<JobID> is the job allocation number, e.g.
How do I save a copy of an interactive session?
You can use
interact -o outfile to save a copy of the session's output to "outfile"
I've submitted a bunch of jobs. How do I tell which one is which?
myq will list the running and pending jobs with their JobID and the name of the job. The name of the job is set in the batch script with
#SBATCH -J jobname.
For jobs that are in the queue (running or pending) you can use the command
scontrol show job <JobID> where
<JobID> is the job allocation number, e.g.
13180139 to give you more detail about what was submitted.
How do I ask for a haswell node?
What are the nodes with names starting as "smp"? eg. "smp013"
SMP stands for symmetric multiprocessing. These nodes are meant to be useful with jobs which use a large numbers of CPUs on the same node for shared memory parallelism. However, comparing sequentially they can be much slower because their architecture is quite old.
How do I avoid running on the SMP nodes?
The SMP nodes are all AMD nodes. All others are Intel architecture. Hence, you can avoid SMP nodes by asking for just Intel nodes:
Why won't my job start?
When your job is pending (PD) in the queue, SLURM will display a reason why your job is pending. The table below shows some common reasons for which jobs are kept pending.
|Reason||What this means|
|(None)||You may see this for a short time when you first submit a job|
|(Resources)||There are not enough free resources to fulfill your request|
|(QOSGrpCpuLimit)||All your condo cores are currently in use|
|(JobHeldUser)||You have put a hold on the job. The job will not run until you lift the hold.|
|(Priority)||Jobs with higher priority are using the resources|
|(ReqNodeNotAvail)||The resources you have requested are not available. Note this normally means you have requested something impossible, e.g. 100 cores on 1 node, or a 24 core sandy bridge node. Double check your batch script for any errors. Your job will never run if you are requesting something that does not exist on Oscar.|
|(PartitionNodeLimit)||You have asked for more nodes than exist in the partition. For example if you make a typo and have specified -N (nodes) but meant -n (tasks) and have asked for more than 64 nodes. Your job will never run. Double check your batch script.|