Oscar User Manual (RedHat7)

Welcome to CCV's user manual!

This manual is primarily a guide for using Oscar, a compute cluster maintained by CCV for use by Brown researchers.

All new users are recommended to read through the "Getting Started" page.

Conventions

We use angular brackets to denote command-line options that you should replace with an appropriate value. For example, the placeholders <user> should be replaced with your own user name. The $ sign at the beginning of commands represents the command prompt. Do not copy that while copying commands to your shell.

Getting Started

This guide assumes you have an Oscar account. To request an account see create an account

OSCAR

Oscar is the shared compute cluster operated by CCV.

Oscar runs the Linux RedHat7 operating system. General Linux documentation is available from The Linux Documentation Project. We recommend you read up on basic Linux commands before using Oscar.

It has two login nodes and several hundred compute nodes. When users log in through SSH, they are first put on one of the login nodes which are shared among several users at a time. You can use the login nodes to compile your code, manage files, and launch jobs on the compute nodes. Running computationally intensive or memory intensive programs on the login node slows down the system for all users. Any processes taking up too much CPU or memory on a login node will be killed. Please do not run Matlab on the login nodes.

What username and password should I be using?

  • If you are at Brown and have requested a regular CCV account, your Oscar login will be authenticated using your Brown credentials itself, i.e. the same username and password that you use to login to any Brown service such as "canvas". We have seen login problems with the Brown credentials for some users so accounts moved to the Red Hat 7 system after September 1st 2018 can also log in to Red Hat 7 with their CCV password.

  • If it is a temporary guest account (e.g. as part of a class), you should have been provided with a username of the format "guestxxx" along with a password.

  • If you are an external user, you will have to get a sponsored ID at Brown through the department with which you are associated, before requesting an account on Oscar. Once you have the sponsored ID at Brown, you can request an account on Oscar and use your Brown username and password to login.

Connecting to Oscar for the first time

As of 9am September 24th 2018 the Red Hat 7 system will be the default login from ssh.ccv.brown.edu

To log in to Oscar you need Secure Shell (SSH) on your computer. Mac and Linux machines normally have SSH available. To login in to Oscar, open a terminal and type

ssh <username>@ssh.ccv.brown.edu

Windows users need to install an SSH client. We recommend PuTTY, a free SSH client for Windows. In PuTTY, use <username>@ssh.ccv.brown.edu or <username>@ssh4.ccv.brown.edu for the Host Name.

The first time you connect to Oscar you will see a message like:

The authenticity of host 'ssh.ccv.brown.edu (138.16.172.8)' can't be established.
RSA key fingerprint is SHA256:Nt***************vL3cH7A.
Are you sure you want to continue connecting (yes/no)? 

You can type yes . You will be prompted for your password. Note nothing will show up on the screen when you type in your password, just type it in and press enter. You will now be in your home directory on Oscar. In your terminal you will see a prompt like this:

[mhamilton@login004 ~]$ 

Congratulations, you are now on one of the Oscar login nodes.

Note: Please do not run computations or simulations on the login nodes, because they are shared with other users. You can use the login nodes to compile your code, manage files, and launch jobs on the compute nodes.

Changing Passwords

This section is only relevant for guest accounts as regular accounts will simply use their Brown password.

To change your Oscar login password, use the command:

$ passwd

You will be asked to enter your old password, then your new password twice.

Password reset rules:

  • minimum length: 8 characters
  • should have characters from all 4 classes: upper-case letters, lower-case letters, numbers and special characters
  • a character cannot appear more than twice in a row
  • cannot have more than 3 upper-case, lower-case, or number characters in a row
  • at least 3 characters should be different from the previous password
  • cannot be the same as username
  • should not include any of the words in the user's "full name"

File system

Users on Oscar have three places to store files.

  • home
  • scratch
  • data

Note guest and class accounts may not have a data directory. Users who are members of more than one research group may have access to multiple data directories.

To see how much space you have you can use the command myquota. Below is an example output

                   Block Limits                              |           File Limits              
Type    Filesystem           Used    Quota   HLIMIT    Grace |    Files    Quota   HLIMIT    Grace
-------------------------------------------------------------|--------------------------------------
USR     home               8.401G      10G      20G        - |    61832   524288  1048576        -
USR     scratch              332G     512G      12T        - |    14523   323539  4194304        -
FILESET data+apollo        11.05T      20T      24T        - |   459764  4194304  8388608        -

A good practice is to configure your application to read any initial input data from ~/data and write all output into ~/scratch. Then, when the application has finished, move or copy data you would like to save from ~/scratch to ~/data. For more information on which directories are backed up and best practices for reading/writing files, see Managing Files. You can go over your quota up to the hard limit for a grace period (14days). This grace period is to give you time to manage your files. When the grace period expires you will be unable to write any files until you are back under quota.


Software modules

CCV uses the PyModules package for managing the software environment on OSCAR. To see the software available on Oscar use the command module avail. The command module list shows what modules you have loaded. Below is an example of checking which versions of the module 'workshop' are available and loading a given version.

[mhamilton@login001 ~]$ module avail workshop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ name: workshop*/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
workshop/1.0  workshop/2.0  
[mhamilton@login001 ~]$ module load workshop/2.0
module: loading 'workshop/2.0'
[mhamilton@login001 ~]$ 

For a list of all PyModule commands see Software Modules. If you have a request for software to be installed on Oscar, email support@ccv.brown.edu.


Using a Desktop on Oscar

You can connect remotely to a graphical desktop environment on Oscar using CCV's VNC client. The CCV VNC client integrates with the scheduling system on Oscar to create dedicated, persistent VNC sessions that are tied to a single user.

Using VNC, you can run graphical user interface (GUI) applications like Matlab, Mathematica, etc. while having access to Oscar's compute power and file system.

For download and installation instructions, click here.


Running Jobs

You are on Oscar's login nodes when you log in through SSH. You should not (and would not want to) run your programs on these nodes as these are shared by all active users to perform tasks like managing files and compiling programs.

With so many active users, a shared cluster has to use a "job scheduler" to assign compute resources to users for running programs. When you submit a job (a set of commands) to the scheduler along with the resources you need, it puts your job in a queue. The job is run when the required resources (cores, memory, etc.) become available. Note that since Oscar is a shared resource, you must be prepared to wait for your job to start running and it can't be expected to start running straight away.

Oscar uses the SLURM job scheduler. Batch jobs is the preferred mode of running programs, where all commands are mentioned in a "batch script" along with the required resources (number of cores, wall-time, etc.). However, there is also a way to run programs interactively.

For information on how to submit jobs on Oscar, see Running Jobs. There is also extensive documentation on the web on using SLURM. For instance, here is a quick start guide.


Where to get help

Managing Files

CCV offers a high-performance storage system for research data called RData, which is accessible as the /gpfs/data file system on all CCV systems.

You can transfer files between Department File Servers (Isilon) and Oscar with smbclient. To use see Copying files from Department File Servers

Note: RData is not designed to store confidential data (information about an individual or entity). If you have confidential data that needs to be stored please contact support@ccv.brown.edu.


File systems

CCV uses IBM's General Parallel File System (GPFS) for users' home directories, data storage, scratch/temporary space, and runtime libraries and executables. A separate GPFS file system exists for each of these uses, in order to provide tuned performance. These file systems are mounted as:

~
→ /gpfs/home/<user>
Your home directory:
optimized for many small files (<1MB)
nightly backups (30 days)
10GB quota
~/data
→ /gpfs/data/<group>
Your data directory
optimized for reading large files (>1MB)
nightly backups (30 days)
quota is by group (usually >=256GB)
~/scratch
→ /gpfs/scratch/<user>
Your scratch directory:
optimized for reading/writing large files (>1MB)
NO BACKUPS
purging: files older than 30 days may be deleted
512GB quota: contact us to increase on a temporary basis

A good practice is to configure your application to read any initial input data from ~/data and write all output into ~/scratch. Then, when the application has finished, move or copy data you would like to save from ~/scratch to ~/data.

Note: class or temporary accounts may not have a ~/data directory!

To see how much space you have on Oscar you can use the command myquota. Below is an example output

                   Block Limits                              |           File Limits              
Type    Filesystem           Used    Quota   HLIMIT    Grace |    Files    Quota   HLIMIT    Grace
-------------------------------------------------------------|--------------------------------------
USR     home               8.401G      10G      20G        - |    61832   524288  1048576        -
USR     scratch              332G     512G      12T        - |    14523   323539  4194304        -
FILESET data+apollo        11.05T      20T      24T        - |   459764  4194304  8388608        -

You can go over your quota up to the hard limit for a grace period (14days). This grace period is to give you time to manage your files. When the grace period expires you will be unable to write any files until you are back under quota.


File transfer

To transfer files from your computer to Oscar, you can use:

  1. command line functions like scp and rsync, or
  2. GUI software

Use transfer nodes for file transfer to/from Oscar:

transfer3.ccv.brown.edu
transfer4.ccv.brown.edu

If you have access to a terminal like on a Mac or Linux computer, you can conveniently use scp to transfer files. For example to copy a file from your computer to Oscar:

 scp /path/to/source/file <username>@transfer3.ccv.brown.edu:/path/to/destination/file

To copy a file from Oscar to your computer:

 scp <username>@transfer3.ccv.brown.edu:/path/to/source/file /path/to/destination/file

On Windows, if you have PuTTY installed, you can use its pscp function from the terminal.

There are also GUI programs for transfering files using the SCP or SFTP protocol, like WinSCP for Windows and Fugu or Cyberduck for Mac. FileZilla is another GUI software for FTP which is available on all platforms.

Globus Online provides a transfer service for moving data between institutions such as Brown and XSEDE facilities. You can transfer files using the Globus web interface or the command line interface.


Restoring files

Nightly snapshots of the file system are available for the trailing seven days.

Home directory snapshot

/gpfs_home/.snapshots/<date>/<username>/<path_to_file> 

Data directory snapshot

/gpfs/.snapshots/<date>/data/<groupname>/<path_to_file> 

Scratch directory snapshot

/gpfs/.snapshots/<date>/scratch/<username>/<path_to_file> 

Do not use the links in your home directory snapshot to try and retrieve snapshots of data and scratch. The links will always point to the current versions of these files. An easy way to check what a link is pointing to is to use ls -l

e.g.

ls -l /gpfs_home/.snapshots/April_03/ghopper/data 
lrwxrwxrwx 1 ghopper navy 22 Mar  1  2016 /gpfs_home/.snapshots/April_03/ghopper/scratch -> /gpfs/data/navy 

If files to be restored were modified/deleted more than 7 days (and less than 30 days) ago and were in the HOME or DATA directory, you may contact us to retrieve them from nightly backups by providing the full path. Note that home and data directory backups are saved for trailing 30 days only.


Best Practices for I/O

Efficient I/O is essential for good performance in data-intensive applications. Often, the file system is a substantial bottleneck on HPC systems, because CPU and memory technology has improved much more drastically in the last few decades than I/O technology.

Parallel I/O libraries such as MPI-IO, HDF5 and netCDF can help parallelize, aggregate and efficiently manage I/O operations. HDF5 and netCDF also have the benefit of using self-describing binary file formats that support complex data models and provide system portability. However, some simple guidelines can be used for almost any type of I/O on Oscar:

  • Try to aggregate small chunks of data into larger reads and writes. For the GPFS file systems, reads and writes in multiples of 512KB provide the highest bandwidth.
  • Avoid using ASCII representations of your data. They will usually require much more space to store, and require conversion to/from binary when reading/writing.
  • Avoid creating directory hierarchies with thousands or millions of files in a directory. This causes a significant overhead in managing file metadata.

While it may seem convenient to use a directory hierarchy for managing large sets of very small files, this causes severe performance problems due to the large amount of file metadata. A better approach might be to implement the data hierarchy inside a single HDF5 file using HDF5's grouping and dataset mechanisms. This single data file would exhibit better I/O performance and would also be more portable than the directory approach.

Software

Many scientific and HPC software packages are already installed on Oscar, and additional packages can be requested by submitting a ticket to support@ccv.brown.edu. If you want a particular version of the software, do mention it in the email along with a link to the web page from where it can be downloaded.

CCV cannot, however, supply funding for the purchase of commercial software. This is normally attributed as a direct cost of research, and should be purchased with research funding. CCV can help in identifying other potential users of the software to potentially share the cost of purchase and maintenance. Several commercial software products that are licensed campus-wide at Brown are available on Oscar, however.

For software that requires a Graphical User Interface (GUI) we recommend using CCV's VNC client rather than X-Forwarding.

All programs are installed under /gpfs/runtime/opt/<software-name>/<version>. Example files and other files can be copied to your home, scratch or data directory if needed.


Software modules

CCV uses the PyModules package for managing the software environment on OSCAR. The advantage of the modules approach is that it allows multiple versions of the same software to be installed at the same time. With the modules approach, you can "load'' and "unload'' modules to dynamically control your environment.

Module commands

module list Lists all modules that are currently loaded in your software environment.
module avail Lists all available modules on the system. Note that a module can have multiple versions.
module help name Prints additional information about the given software.
module load name Adds a module to your current environment. If you load using just the name of a module, you will get the default version. To load a specific version, load the module using its full name with the version: "module load gcc/6.2"
module unload name Removes a module from your current environment.

Notes

Looking for available modules and versions

Note that the module avail command allows searching modules based on partial names. For example:

 $ module avail bo

will list all available modules whose name starts with "bo".

Output:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ name: bo*/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
boost/1.49.0        boost/1.63.0        bowtie2/2.3.0
boost/1.62.0-intel  bowtie/1.2.0

This feature can be used for finding what versions of a module are available.

Auto-completion using tab key

Moreover, the module load command supports auto-completion of the module name using the "tab" key. For example, writing "module load bo" on the shell prompt and hitting "tab" key a couple of times will show results similar to that shown above. Similarly, the module unload command also auto completes using the names of modules which are loaded.

Modules loaded at startup

You can also customize the default environment that is loaded when you login. Simply put the appropriate module commands in the .modules file in your home directory. For instance, if you edited your .modules file to contain

module load matlab

then the default module for Matlab will be available every time you log in.

What modules actually do...

They simply set the relevant environment variables like PATH, LD_LIBRARY_PATH and CPATH. For example, PATH contains all the directory paths (colon separated) where executable programs are searched for. So, by setting PATH through a module, now you can execute a program from anywhere in the file-system. Otherwise, you would have to mention the full path to the executable program file to run it which is very inconvenient. Similarly, LD_LIBRARY_PATH has all the directory paths where the run time linker searches for libraries while running a program, and so on. To see the values in an environment variable, use the echo command. For instance, to see what's in PATH:

$ echo $PATH
/gpfs/runtime/opt/perl/5.18.2/bin:/gpfs/runtime/opt/python/2.7.3/bin:/gpfs/runtime/opt/java/7u5/bin:
/gpfs/runtime/opt/intel/2013.1.106/bin:/gpfs/runtime/opt/centos-updates/6.3/bin:/usr/lib64/qt-3.3/bin:
/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/ibutils/bin:/gpfs/runtime/bin

Installing Python Packages

We get a lot of requests to install python packages as environment modules. While we continue doing that, we often find that it would be better for the user to install python packages locally instead of turning to us. Possible motivations for this include -

  • No need to wait for us to install the package. This is especially true for smaller packages which require minimal effort.
  • Some packages need frequent updating. Installing locally will avoid waiting for us to install it as a module.
  • Using virtual environments can be a cleaner way to install software for a specific workflow.

Following are some ways users can install python packages on Oscar by themselves.


Install locally: pip install --user

The locations where we install software for environment modules is not writable by users. Hence using a command such as pip install <package-name> will give an error saying you do not have permissions to install software in that location. For example:

IOError: [Errno 13] Permission denied: '/gpfs/runtime/opt/python/2.7.3/lib/python2.7/site-packages/ordereddict.py'

The solution is to use the --user option with pip so as to signify that the package has to be installed "locally":

pip install --user <package>

This will install the package under the following path in user's HOME directory:

~/.local/lib/python<version>/site-packages

Install at custom location: pip install --target=

Users have a limit of 10GB for their home directories on Oscar. Hence, users might want to use their data directory instead for installing software. Another motivation to do that is to have shared access to the software among the whole research group.

pip can install software to a "custom" location by using the --target option:

pip install --target=</path/to/install/location> <package>

This path to install location will have to be added to the PYTHONPATH environment variable so that python can find the python modules to be used. Note: This is not necessary for software installed using the --user option.

export PYTHONPATH=</path/to/install/location>:$PYTHONPATH

This can be added at the end of your .bashrc file in your home directory. This will update the PYTHONPATH environment variable each time during startup. Alternatively, you can update PYTHONPATH in your SLURM batch script as required. This can be cleaner as compared to the former method. If you have a lot of python installs at different locations, adding everything to PYTHONPATH can create conflicts and other issues.

A caveat of using this method is that pip will install the packages (along with its requirements) even if the package required is already installed under the global install or the default local install location. Hence, this is more of a brute force method and not the most efficient one.

Example: if your package depends on numpy or scipy, you might want to use the numpy and scipy under our global install as those have been compiled with MKL support. Using the --target option will reinstall numpy with default optimizations and without MKL support at the specified location.


Using virtualenv (recommended)

Virtual environments are a more efficient and cleaner way to install python packages for a specific workflow. This webpage gives a good explanation of the use cases.

"virtualenv creates an environment that has its own installation directories, that doesn’t share libraries with other virtualenv environments (and optionally doesn’t access the globally installed libraries either)."

virtualenv user guide

To create a virtual environment, use the following command where you want to create the corresponding directory:

virtualenv --system-site-packages <env-name>

This will create a directory with name <env-name> with sub-directories bin, lib, etc. The --system-site-packages option is used so that the virtual environment has access to packages from the global install and hence overcomes the caveat mentioned in the previous section. Skip this option if you want a completely isolated environment.

Run the corresponding "activate" shell script to use the virtual environment:

source <path-to-env>/bin/activate

Now, you can access all corresponding packages. This can of course, also be done in the batch script in case of batch jobs. After activating the virtual environment, any new installs done using pip will be installed under that. Users can have multiple such environments for different workflows.

The deactivate script is used to clear the currently active virtual environment.

Example of using virtualenv with python3

  1. Create the virtual environment with access to globally installed packages
[guest099@login002 ~]$ module load python/3.6.1
module: unloading 'python/2.7.3'
module: loading 'python/3.6.1'
[guest099@login002 ~]$ virtualenv --system-site-packages myenv
Using base prefix '/gpfs/runtime/opt/python/3.6.1'
New python executable in /gpfs_home/guest099/myenv/bin/python3.6
Also creating executable in /gpfs_home/guest099/myenv/bin/python
Installing setuptools, pip, wheel...done.
  1. Activate the environment and check
[guest099@login002 ~]$ source myenv/bin/activate
(myenv) [guest099@login002 ~]$ which python
/gpfs_home/guest099/myenv/bin/python
(myenv) [guest099@login002 ~]$ which python3
/gpfs_home/guest099/myenv/bin/python3
(myenv) [guest099@login002 ~]$ which pip3
/gpfs_home/guest099/myenv/bin/pip3
(myenv) [guest099@login002 ~]$ which pip
/gpfs_home/guest099/myenv/bin/pip
  1. Install required software (note how it does not re-install the "six" package which is a requirement)
(myenv) [guest099@login002 ~]$ pip install nltk
Collecting nltk
  Using cached nltk-3.2.5.tar.gz
Requirement already satisfied: six in /gpfs/runtime/opt/python/3.6.1/lib/python3.6/site-packages (from nltk)
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... done
  Stored in directory: /users/guest099/.cache/pip/wheels/18/9c/1f/276bc3f421614062468cb1c9d695e6086d0c73d67ea363c501
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.5
(myenv) [guest099@login002 ~]$ ls myenv/lib/python3.6/site-packages/
easy_install.py  nltk-3.2.5.dist-info  pip-9.0.1.dist-info  __pycache__  setuptools-38.5.1.dist-info  wheel-0.30.0.dist-info
nltk             pip                   pkg_resources        setuptools   wheel
  1. Run python, using the required packages
(myenv) [guest099@login002 ~]$ python3
Python 3.6.1 (default, Apr  6 2017, 14:28:51)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>>
>>> exit()
(myenv) [guest099@login002 ~]$ deactivate
[guest099@login002 ~]$


Installing from source

Sometimes, python software is not packaged by the developers to be installed by pip. Or, you may want to use the development version which has not been packaged. In this case, the python module can be installed by downloading the source code itself.

Most python packages can be built and installed from source by running the setup.py script that should be included in the downloaded files. By default, it will try to install the package as part of the global install which will fail because of permission errors. You can provide a "prefix path" to work around this issue:

python setup.py install --prefix=</path/to/install/location>

This will create the sub-directories bin, lib, etc. at the location provided above and install the packages there. The environment will have to be set up accordingly to use the package:

export PATH=</path/to/install/location>/bin:$PATH
export PYTHONPATH=</path/to/install/location>/lib/python<version>/site-packages:$PYTHONPATH

Installing R packages

We get a lot of requests to install R packages. While we will continue to do this, users should also be aware that they can install R packages for themselves. This documentation shows you how to install R packages locally (without root access) on Oscar.

Installing an R package

First load the R version that you want to use the package with:

module load R/3.4.3_mkl

Start an R session

R

Note some packages will require code to be compiled so it is best to do R packages installs on the login node.

To install the package 'wordcloud':

> install.packages("wordcloud", repos="http://cran.r-project.org")

You will see a warning:

Warning in install.packages("wordcloud", repos = "http://cran.r-project.org") :
  'lib = "/gpfs/runtime/opt/R/3.4.2/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n) 

Answer y . If you have not installed any R packages before you will see the following message:

Would you like to create a personal library
~/R/x86_64-pc-linux-gnu-library/3.4
to install packages into?  (y/n) 

Answer y . The package will then be installed. If the install is successful you will see a message like:

** R
** data
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (wordcloud)

If the installation was not successful you will see a message like:

Warning message:
In install.packages("wordcloud", repos = "http://cran.r-project.org") :
  installation of package ‘wordcloud’ had non-zero exit status

There is normally information in the message that gives the reason why the install failed. Look for the word ERROR in the message.

Possible reasons for an installation failing include:

  • Other software is needed to build the R package, e.g. the R package rgdal needs gdal so you have to do module load gdal
  • A directory needs deleting from a previous failed installation.

Removing an R package

Start an R session:

R

To remove the 'wordcloud' package:

> remove.packages("wordcloud")

MATLAB

MATLAB is very popular as a scientific computing tool because of it's IDE, ease of programmability and comprehensive library of high level functions. It is used extensively on clusters for post processing of simulation results, analysis of large amounts of experimental data, etc.

Matlab is available as a software module on Oscar. The default version of Matlab is loaded automatically when you log in.

Kindly make sure you do not run Matlab on a login node.


matlab-threaded command

On Oscar, the command matlab is actually a wrapper that sets up MATLAB to run as a single-threaded, command-line program, which is the optimal way to pack multiple Matlab scripts onto the Oscar compute nodes.

To run the actual multi-threaded version with JVM and Display enabled, use:

$ matlab-threaded

Similarly, to run this without the display enabled:

$ matlab-threaded -nodisplay

MATLAB GUI

VNC

The VNC client provided by CCV is the best way to launch GUI applications on Oscar, including Matlab. From the terminal emulator in VNC, first load the module corresponding to the intended version of Matlab. Then use the matlab-threaded command to launch the Matlab GUI. For example,

$ module load matlab/R2016a
$ matlab-threaded

Here is a snapshot of what it looks like:

X11 Forwarding

You can also run the MATLAB GUI in an X-forwarded interactive session. This requires installing an X server on your workstation/PC and logging in to Oscar with X forwarding enabled. Use the interact command to get interactive access to a compute node. Again, for launching the GUI, you need to use the matlab-threaded command, which enables the display and JVM. You may however experience a lag in response from the Matlab GUI in an X forwarded session. Note that if Matlab does not find the X window system available, it will launch in command line mode (next section).

CIFS

A workaround in some situations may be to use CIFS to mount the Oscar filesystem on your PC and using the Matlab installation on your computer. For example, if you have your simulation results residing on Oscar, this might be a quick way to do post-processing on the data instead of having to move the data to your computer or using the Matlab GUI on Oscar. Note that users can connect to CIFS only from Brown computers or on Brown WiFi.


Matlab on Interactive Command Line

As mentioned earlier, please do not run Matlab on login nodes. You can request interactive access to compute nodes using the interact command.

Instead of the GUI, Matlab’s interpreter can be launched interactively on command line (text based interface):

$ matlab-threaded -nodisplay

This way, the user does not have to worry about launching the display and sluggish response from the GUI. The startup time is also much less. It might take some time to get used to the command line interface. We recommend that unless users need to use tools like debugger, profiler which are more convenient on the GUI, or need to see live plots, they can use the command line version. Ultimately, it is a personal choice.

Notes:

Set the $EDITOR environment variable prior to launching Matlab to be able to use edit command, e.g.

$ export EDITOR=nano

nano is a basic command line editor. There are other command line editors like vim and emacs that users can choose.

From the Matlab command line (represented by the >> symbol below), you can directly type the command to run a script or function after changing the directory to where it is located:

>> cd path/to/work/dir
>> myscript

To check version, license info and list all toolboxes available with version:

>> ver

To run a Matlab function myfunc.m directly from the shell:

$ matlab-threaded –nodisplay –r “myfunc(arg1,arg2)”

Batch Jobs

GUI and command line interpreter may be suitable only for visualization and debugging or optimization. Batch jobs should be the preferred way of running programs (actual production runs) on a cluster. The reason being high wait times because of the amount of resources required and higher run times typical of these programs. Moreover, batch jobs are much more convenient for running many programs simultaneously. Batch scripts are used for submitting jobs to the scheduler (SLURM) on Oscar, which are described in detail here.

Example Batch Script

Here is an example batch script for running a serial Matlab program on an Oscar compute node:

#!/bin/bash

# Request an hour of runtime:
#SBATCH --time=1:00:00

# Default resources are 1 core with 2.8GB of memory.

# Use more memory (4GB):
#SBATCH --mem=4G

# Specify a job name:
#SBATCH -J MyMatlabJob

# Specify an output file
#SBATCH -o MyMatlabJob-%j.out
#SBATCH -e MyMatlabJob-%j.out

# Run a matlab function called 'foo.m' in the same directory as this batch script.
matlab -r "foo(1), exit"

This is also available in your home directory as the file:

~/batch_scripts/matlab-serial.sh

Note the exit command at the end which is very important to include either there or in the Matlab function/script itself. If you don't make Matlab exit the interpreter, it will keep waiting for the next command until SLURM cancels the job after running out of requested walltime. So for example, if you requested 4 hours of walltime and your actual program completes in 1 hour, the SLURM job will not complete until the designated 4 hours which results in idle cores and wastage of resources and also blocks up your other jobs.

If the name of your batch script file is matlab-serial.sh, the batch job can be submitted using the following command:

$ sbatch matlab-serial.sh

Job Arrays

SLURM job arrays can be used to submit multiple jobs using a single batch script. E.g. when a single Matlab script is to be used to run analyses on multiple input files or using different input parameters. An example batch script for submitting a Matlab job array:

#!/bin/bash

# Job Name
#SBATCH -J arrayjob

# Walltime requested
#SBATCH -t 0:10:00

# Provide index values (TASK IDs)
#SBATCH --array=1-4

# Use '%A' for array-job ID, '%J' for job ID and '%a' for task ID
#SBATCH -e arrayjob-%a.err
#SBATCH -o arrayjob-%a.out

# single core
#SBATCH -n 1

# Use the $SLURM_ARRAY_TASK_ID variable to provide different inputs for each job

echo "Running job array number: "$SLURM_ARRAY_TASK_ID

module load matlab/R2016a

matlab-threaded -nodisplay -nojvm -r "foo($SLURM_ARRAY_TASK_ID), exit"

Index values are assigned to each job in the array. The $SLURM_ARRAY_TASK_ID variable represents these values and can be used to provide a different input to each job in the array. Note that this variable can be accessed from Matlab too using the getenv function:

getenv('SLURM_ARRAY_TASK_ID')

The above script can be found in your home directory as the file:

~/batch_scripts/matlab-array.sh

Improving Performance & Memory Management

Matlab programs often suffer from poor performance and running out of memory. Among other things, you can refer the following web pages for best practices for an efficient code:

The first step to speeding up Matlab applications is identifying the part which takes up most of the run time. Matlab's "Profiling" tool can be very helpful in doing that:

Further reading from Mathworks:


Parallel Programming in Matlab

You can explore GPU computing through Matlab if you think your program can benefit from massively parallel computations:

Finally, parallel computing features like parfor and spmd can be used by launching a pool of workers on a node. Note that the Parallel Computing Toolbox by itself cannot span across multiple nodes. Hence, requesting more than one node for a job will result in wastage of resources.

Running Jobs

A "job" refers to a program running on the compute nodes of the Oscar cluster. Jobs can be run on Oscar in two different ways:

  • An interactive job allows you to interact with a program by typing input, using a GUI, etc. But if your connection is interrupted, the job will abort. These are best for small, short-running jobs where you need to test out a program, or where you need to use the program's GUI.
  • A batch job allows you to submit a script that tells the cluster how to run your program. Your program can run for long periods of time in the background, so you don't need to be connected to Oscar. The output of your program is continuously written to an output file that you can view both during and after your program runs.

Jobs are scheduled to run on the cluster according to your account priority and the resources you request (e.g. cores, memory and runtime). For batch jobs, these resources are specified in a script referred to as a batch script, which is passed to the scheduler using a command. When you submit a job, it is placed in a queue where it waits until the required computes nodes become available.

NOTE: please do not run CPU-intense or long-running programs directly on the login nodes! The login nodes are shared by many users, and you will interrupt other users' work.

We use the Simple Linux Utility for Resource Management (SLURM) from Lawrence Livermore National Laboratory as the job scheduler on Oscar. With SLURM, jobs that only need part of a node can share the node with other jobs (this is called "job packing"). When your program runs through SLURM, it runs in its own container, similar to a virtual machine, that isolates it from the other jobs running on the same node. By default, this container has 1 core and a portion of the node's memory.

The following sections have more details on how to run interactive and batch jobs through SLURM, and how to request more resources (either more cores or more memory).


Batch jobs

To run a batch job on the Oscar cluster, you first have to write a script that describes what resources you need and how your program will run. Example batch scripts are available in your home directory on Oscar, in the directory:

~/batch_scripts

To submit a batch job to the queue, use the sbatch command:

$ sbatch <jobscript>

A batch script starts by specifying the bash shell as its interpreter, with the line:

#!/bin/bash

Next, a series of lines starting with #SBATCH define the resources you need, for example:

#SBATCH -n 4
#SBATCH -t 1:00:00
#SBATCH --mem=16G

Note that all the #SBATCH instructions must come before the commands you want to run. The above lines request 4 cores (-n), an hour of runtime (-t), and 16GB of memory per node (--mem). By default, a batch job will reserve 1 core and 2.8GB of memory per core.

Alternatively, you could set the resources as command-line options to sbatch:

$ sbatch -n 4 -t 1:00:00 --mem=16G <jobscript>

The command-line options will override the resources specified in the script, so this is a handy way to reuse an existing batch script when you just want to change a few of the resource values.

The sbatch command will return a number, which is your job ID. You can view the output of your job in the file slurm-<jobid>.out in the directory where you ran the sbatch command. For instance, you can view the last 10 lines of output with:

$ tail -10 slurm-<jobid>.out

Alternatively, you can mention the file names where you want to dump the standard output and errors using the -o and -e flags.

Useful sbatch options:

-J Specify the job name that will be displayed when listing the job.
-n Number of tasks (= number of cores, if "--cpus-per-task" or "-c" option is not mentioned).
-c Number of CPUs or cores per task (on the same node).
-N Number of nodes.
-t Runtime, as HH:MM:SS.
--mem= Requested memory per node.
-p Request a specific partition.
-o Filename for standard output from the job.
-e Filename for standard error from the job.
-C Add a feature constraint (a tag that describes a type of node). You can view the available features on Oscar with the nodes command.

--mail-type=

Specify the events that you should be notified of by email: BEGIN, END, FAIL, REQUEUE, and ALL.

--mail-user=

Email ID where you should be notified.

You can read the full list of options at http://slurm.schedmd.com/sbatch.html or with the command:

$ man sbatch

Interactive jobs

To start an interactive session for running serial or threaded programs on an Oscar compute node, simply run the command interact from the login node:

$ interact

By default, this will create an interactive session that reserves 1 core, 4GB of memory, and 30 minutes of runtime.

You can change these default limits with the following command line options:

usage: interact [-n cores] [-t walltime] [-m memory] [-q queue]
                [-o outfile] [-X] [-f featurelist] [-h hostname] [-g ngpus]

Starts an interactive job by wrapping the SLURM 'salloc' and 'srun' commands.

options:
  -n cores        (default: 1)
  -t walltime     as hh:mm:ss (default: 30:00)
  -m memory       as #[k|m|g] (default: 4g)
  -q queue        (default: 'batch')
  -o outfile      save a copy of the session's output to outfile (default: off)
  -X              enable X forwarding (default: no)
  -f featurelist  CCV-defined node features (e.g., 'e5-2600'),
                  combined with '&' and '|' (default: none)
  -h hostname     only run on the specific node 'hostname'
                  (default: none, use any available node)
  -a account     user SLURM accounting account name

For example:

$ interact -n 20 -t 01:00:00 -m 10g

This will request 20 cores, 1 hour of time and 10 GB of memory (per node).


Managing jobs

Canceling a job:

$ scancel <jobid>

Listing running and queued jobs:

The squeue command will list all jobs scheduled in the cluster. We have also written wrappers for squeue on Oscar that you may find more convenient:

myq List only your own jobs.
myq <user> List another user's jobs.
allq List all jobs, but organized by partition, and a summary of the nodes in use in the partition.
allq <partition> List all jobs in a single partition.
myjobinfo Get the time and memory used for your jobs.

Listing completed jobs

The sacct command will list all of your running, queued and completed jobs since midnight of the previous day. To pick an earlier start date, specify it with the -S option:

$ sacct -S 2012-01-01

To find out more information about a specific job, such as its exit status or the amount of runtime or memory it used, specify the -l ("long" format) and -j options with the job ID:

$ sacct -lj <jobid>

The myjobinfo command uses the sacct command to display "Elapsed Time", "Requested Memory" and "Maximum Memory used on any one Node" for your jobs. This can be used to optimize the requested time and memory to have the job started as early as possible. Make sure you request a conservative amount based on how much was used.

$ myjobinfo

Info about jobs for user 'mdave' submitted since 2017-05-19T00:00:00
Use option '-S' for a different date
 or option '-j' for a specific Job ID.

       JobID    JobName              Submit      State    Elapsed     ReqMem     MaxRSS
------------ ---------- ------------------- ---------- ---------- ---------- ----------
1861                ior 2017-05-19T08:31:01  COMPLETED   00:00:09     2800Mc      1744K
1862                ior 2017-05-19T08:31:11  COMPLETED   00:00:54     2800Mc     22908K
1911                ior 2017-05-19T15:02:01  COMPLETED   00:00:06     2800Mc      1748K
1912                ior 2017-05-19T15:02:07  COMPLETED   00:00:21     2800Mc      1744K

'ReqMem' shows the requested memory:
 A 'c' at the end of number represents Memory Per CPU, a 'n' represents Memory Per Node.
'MaxRSS' is the maximum memory used on any one node.
Note that memory specified to sbatch using '--mem' is Per Node.

Partitions

When submitting a job to the Oscar compute cluster, you can choose different partitions depending on the nature of your job. You can specify one of the partitions listed below either in your sbatch command:

$ sbatch -p <partition> <batch_script>

or as an SBATCH option at the top of your batch script:

#SBATCH -p <partition>

Partitions available on Oscar:

batch Default partition with most of the compute nodes: 16-, 20- or 24-core; 55GB to 188GB of memory; all Intel based.
gpu Specialized compute nodes with GPU accelerators.
gpu-debug Specialized compute nodes with GPU accelerators for debugging GPU code, i.e. fast turn-around, but with a short time limit of 30 minutes
bibs-gpu Specialized compute nodes with GPU accelerators for BIBS use only
debug Dedicated nodes for fast turn-around, but with a short time limit of 30 minutes and CPU limit of 16.
smp AMD architecture nodes with large number of CPUs and memory primarily meant for shared memory parallel programs.

You can view a list of all the Oscar compute nodes broken down by type with the command:

$ nodes

Job priority

The scheduler considers many factors when determining the run order of jobs in the queue. These include the:

  • size of the job;
  • requested walltime;
  • amount of resources you have used recently (e.g., "fair sharing");
  • priority of your account type.

The account priority has three tiers:

  • Low (Exploratory)
  • Medium (Premium)
  • High (Condo)

Both Exploratory and Premium accounts can be affiliated with a Condo, and the Condo priority only applies to a portion of the cluster equivalent in size to the Condo. Once the Condo affiliates have requested more nodes than available in the Condo, their priority drops down to either medium or low, depending on whether they are a Premium or Exploratory account.

Backfilling: When a large or long-running job is near the top of the queue, the scheduler begins reserving nodes for it. If you queue a smaller job with a walltime shorter than the time required for the scheduler to finish reserving resources, the scheduler can backfill the reserved resources with your job to better utilize the system. Here is an example:

  • User1 has a 64-node job with a 24 hour walltime waiting at the top of the queue.
  • The scheduler can't reserve all 64 nodes until other currently running jobs finish, but it has already reserved 38 nodes and will need another 10 hours to reserve the final 26 nodes.
  • User2 submits a 16-node job with an 8 hour walltime, which is backfilled into the pool of 38 reserved nodes and runs immediately.

By requesting a shorter walltime for your job, you increase its chances of being backfilled and running sooner. In general, the more accurately you can predict the walltime, the sooner your job will run and the better the system will be utilized for all users.

Condo and Premium account (priority) Jobs

Condo jobs

Note: we do not provide users condo access by default if their group/PI has a condo on the system. You will have to explicitly request a condo access and we will ask for approval from the PI.

To use your condo account to submit jobs, include the following line in your batch script:

#SBATCH --account=<condo-name>

You can also provide this option on the command line while submitting the job using sbatch:

$ sbatch --account=<condo-name> <batch-script>

Similarly, you can change the account while asking for interactive access too:

interact -a <condo-name> ... <other_options>

Condo account names are typically <groupname>-condo, and you can view a full list with the condos command on Oscar.

Premium Account (priority) jobs

If you have a premium account, that should be your default QOS for submitting jobs. You can test this by running "interact" and then seeing the QOS used for the interactive job by running the command myq. It should be "pri-<username>"

If you are interested in seeing all your accounts and associations, you can use the following command:

sacctmgr -p list assoc where user=<username>

Job Arrays

A job array is a collection of jobs that all run the same program, but on different values of a parameter. It is very useful for running parameter sweeps, since you don't have to write a separate batch script for each parameter setting.

To use a job array, add the option:

#SBATCH --array=<range>

in your batch script. The range can be a comma separated list of integers, along with ranges separated by a dash. For example:

1-20
1-10,12,14,16-20

A job will be submitted for each value in the range. The values in the range will be substituted for the variable $SLURM_ARRAY_TASK_ID in the remainder of the script. Here is an example of a script for running a serial Matlab script on 16 different parameters by submitting 16 different jobs as an array:

#!/bin/bash
#SBATCH -J MATLAB
#SBATCH -t 1:00:00
#SBATCH --array=1-16

# Use '%A' for array-job ID, '%J' for job ID and '%a' for task ID
#SBATCH -e arrayjob-%a.err
#SBATCH -o arrayjob-%a.out

echo "Starting job $SLURM_ARRAY_TASK_ID on $HOSTNAME"
matlab -r "MyMatlabFunction($SLURM_ARRAY_TASK_ID); quit;"

You can then submit the multiple jobs using a single sbatch command:

$ sbatch <jobscript>

For more info: https://slurm.schedmd.com/job_array.html

MPI Programs

Resources from the web on getting started with MPI:

MPI is a standard that dictates the semantics and features of "message passing". There are different implementations of MPI. Those installed on Oscar are

  • MVAPICH2
  • OpenMPI

We recommend using MVAPICH2 as it is integrated with the SLURM scheduler and optimized for the Infiniband network.


MPI modules on Oscar

The MPI module is called "mpi". The different implementations (mvapich2, openmpi, different base compilers) are in the form of versions of the module "mpi". This is to make sure that no two implementations can be loaded simultaneously, which is a common source of errors and confusion.

$ module avail mpi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ name: mpi*/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mpi/cave_mvapich2_2.3b_gcc    mpi/openmpi_1.10.7_gcc
mpi/cave_mvapich2_2.3b_intel  mpi/openmpi_1.8.3_gcc
mpi/mvapich2-2.3a_gcc         mpi/openmpi_2.0.3_gcc
mpi/mvapich2-2.3a_intel       mpi/openmpi_2.0.3_intel
mpi/mvapich2-2.3a_pgi         mpi/openmpi_2.0.3_pgi
mpi/mvapich2-2.3b_gcc

You can just use "module load mpi" to load the default version which is mpi/mvapich2-2.3a_gcc. This is the recommended version.

The module naming format is

mpi/<implementation>-<version>_<base compiler>

srun instead of mpirun

Use srun --mpi=pmi2 to run MPI programs. All MPI implementations listed above except openmpi_1.8.3_gcc and openmpi_1.10.7_gcc are built with SLURM support. Hence, the programs need to be run using SLURM's srun command, except if you are using the above mentioned legacy versions.

The --mpi=pmi2 flag is also required to match the configuration with which MPI is installed on Oscar.


Running MPI programs - Interactive

To run an MPI program interactively, first create an allocation from the login nodes using the salloc command:

$ salloc -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes>

For example, to request 4 cores to run 4 tasks (MPI processes):

$ salloc -n 4 

Once the allocation is fulfilled, you can run MPI programs with the srun command:

$ srun --mpi=pmi2 ./my-mpi-program ...

When you are finished running MPI commands, you can release the allocation by exiting the shell:

$ exit

Also, if you only need to run a single MPI program, you can skip the salloc command and specify the resources in a single srun command:

$ srun -N <# nodes> -n <# MPI tasks> -p <partition> -t <minutes> --mpi=pmi2 ./my-mpi-program

This will create the allocation, run the MPI program, and release the allocation.

Note: It is not possible to run MPI programs on compute nodes by using the interact command.

salloc documentation: https://slurm.schedmd.com/salloc.html

srun documentation: https://slurm.schedmd.com/srun.html


Running MPI programs - Batch Jobs

Here is a sample batch script to run an MPI program:

#!/bin/bash

# Request an hour of runtime:
#SBATCH --time=1:00:00

# Use 2 nodes with 8 tasks each, for 16 MPI tasks:
#SBATCH --nodes=2
#SBATCH --tasks-per-node=8

# Specify a job name:
#SBATCH -J MyMPIJob

# Specify an output file
#SBATCH -o MyMPIJob-%j.out
#SBATCH -e MyMPIJob-%j.err

# Load required modules
module load mpi

srun --mpi=pmi2 MyMPIProgram

Hybrid MPI+OpenMP

If your program has multi-threading capability using OpenMP, you can have several cores attached with a single MPI task using the --cpus-per-task or -c option with sbatch or salloc. The environment variable OMP_NUM_THREADS governs the number of threads that will be used.

#!/bin/bash

# Use 2 nodes with 2 tasks each (4 MPI tasks)
# And allocate 4 CPUs to each task for multi-threading
#SBATCH --nodes=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=4

# Load required modules
module load mpi

export OMP_NUM_THREADS=4
srun --mpi=pmi2 ./MyMPIProgram

The above batch script will launch 4 MPI tasks - 2 on each node - and allocate 4 CPUs for each task (total 16 cores for the job). Setting OMP_NUM_THREADS governs the number of threads to be used, although this can also be set in the program.


Performance Scaling

The maximum theoretical speedup that can be achieved by a parallel program is governed by the proportion of sequential part in the program (Amdahl's law). Moreover, as the number of MPI processes increases, the communication overhead increases i.e. the amount of time spent in sending and receiving messages among the processes increases. For more than a certain number of processes, this increase starts dominating over the decrease in computational run time. This results in the overall program slowing down instead of speeding up as number of processes are increased.

Hence, MPI programs (or any parallel program) do not run faster as the number of processes are increased beyond a certain point.

If you intend to carry out a lot of runs for a program, the correct approach would be to find out the optimum number of processes which will result in the least run time or a reasonably less run time. Start with a small number of processes like 2 or 4 and first verify the correctness of the results by comparing them with the sequential runs. Then increase the number of processes gradually to find the optimum number beyond which the run time flattens out or starts increasing.

GPU Computing

To gain access to GPU nodes, please submit a support ticket and ask to be added to the 'gpu' group.


Interactive Use

To start an interactive session on a GPU node, use the interact command and specify the gpu partition. You also need to specify the requested number of GPUs using the -g option:

$ interact -q gpu -g 1

GPU Batch Job

For production runs with access to GPU nodes, please submit a batch job to the gpu partition. E.g. for using 1 GPU, you can include the following line in your batch script:

#SBATCH -p gpu --gres=gpu:1

You can view the status of the gpu partition with:

$ allq gpu

Sample batch script for CUDA program:

~/batch_scripts/cuda.sh

FAQ - Running Jobs

  • How is a job identified?
    By a unique JobID, e.g. 13180139

  • Which of my jobs are running/pending?
    Use the command myq

  • How do I check the progress of my running job?
    You can look at the output file. The default output file is slurm-%j.out" where %j is the JobID. If you specified and output file using #SBATCH -o output_filename and/or an error file #SBATCH -e error_filename you can check these files for any output from your job. You can view the contents of a text file using the program less , e.g.

    less output_filename
    

    Use the spacebar to move down the file, b to move back up the file, and q to quit.

  • My job is not running how I indented it too. How do I cancel the job?
    scancel <JobID> where <JobID> is the job allocation number, e.g. 13180139

  • How do I save a copy of an interactive session?
    You can use interact -o outfile to save a copy of the session's output to "outfile"

  • I've submitted a bunch of jobs. How do I tell which one is which?
    myq will list the running and pending jobs with their JobID and the name of the job. The name of the job is set in the batch script with #SBATCH -J jobname. For jobs that are in the queue (running or pending) you can use the command
    scontrol show job <JobID> where <JobID> is the job allocation number, e.g.13180139 to give you more detail about what was submitted.

  • How do I ask for a haswell node?

    Use the --constraint (or -C) option:

    #SBATCH --constraint=haswell
    

    You can use the --constraint option restrict your allocation according to other features too. The nodes command provides a list of "features" for each type of node.

  • What are the nodes in the "smp" partition?

    SMP stands for symmetric multiprocessing. These nodes are meant to be useful with jobs which use a large numbers of CPUs on the same node for shared memory parallelism. However, comparing sequentially they can be much slower because their architecture is quite old.

  • Why won't my job start?
    When your job is pending (PD) in the queue, SLURM will display a reason why your job is pending. The table below shows some common reasons for which jobs are kept pending.

Reason What this means
(None) You may see this for a short time when you first submit a job
(Resources) There are not enough free resources to fulfill your request
(QOSGrpCpuLimit) All your condo cores are currently in use
(JobHeldUser) You have put a hold on the job. The job will not run until you lift the hold.
(Priority) Jobs with higher priority are using the resources
(ReqNodeNotAvail) The resources you have requested are not available. Note this normally means you have requested something impossible, e.g. 100 cores on 1 node, or a 24 core sandy bridge node. Double check your batch script for any errors. Your job will never run if you are requesting something that does not exist on Oscar.
(PartitionNodeLimit) You have asked for more nodes than exist in the partition. For example if you make a typo and have specified -N (nodes) but meant -n (tasks) and have asked for more than 64 nodes. Your job will never run. Double check your batch script.