CIS Computing & Information Services

Managing Files

CCV offers a high-performance storage system for research data called RData, which is accessible as the /gpfs/data file system on all CCV systems. It can also be mounted from any computer on Brown’s campus network using CIFS.

You can transfer files to Oscar and RData through a CIFS mount, or by using command-line tools like scp or rsync.

There are also GUI programs for transfering files using the scp protocol, like WinSCP for Windows and Fugu or Cyberduck for Mac.

Note: RData is not designed to store confidential data (information about an individual or entity). If you have confidential data that needs to be stored please contact support@ccv.brown.edu.


File systems

CCV uses IBM's General Parallel File System (GPFS) for users' home directories, data storage, scratch/temporary space, and runtime libraries and executables. A separate GPFS file system exists for each of these uses, in order to provide tuned performance. These file systems are mounted as:

~
→ /gpfs/home/<user>
Your home directory:
optimized for many small files (<1MB)
nightly backups (30 days)
10GB quota
~/data
→ /gpfs/data/<group>
Your data directory
optimized for reading large files (>1MB)
nightly backups (30 days)
quota is by group (usually >=256GB)
~/scratch
→ /gpfs/scratch/<user>
Your scratch directory:
optimized for reading/writing large files (>1MB)
NO BACKUPS
purging: files older than 30 days may be deleted
512GB quota: contact us to increase on a temporary basis

A good practice is to configure your application to read any initial input data from ~/data and write all output into ~/scratch. Then, when the application has finished, move or copy data you would like to save from ~/scratch to ~/data.

Note: class or temporary accounts may not have a ~/data directory!

To see how much space you have on Oscar you can use the command myquota. Below is an example output

                   Block Limits                              |           File Limits              
Type    Filesystem           Used    Quota   HLIMIT    Grace |    Files    Quota   HLIMIT    Grace
-------------------------------------------------------------|--------------------------------------
USR     home               8.401G      10G      20G        - |    61832   524288  1048576        -
USR     scratch              332G     512G      12T        - |    14523   323539  4194304        -
FILESET data+apollo        11.05T      20T      24T        - |   459764  4194304  8388608        -

You can go over your quota up to the hard limit for a grace period (14days). This grace period is to give you time to manage your files. When the grace period expires you will be unable to write any files until you are back under quota.


File transfer

To transfer files from your computer to Oscar, you can use:

  1. command line functions like scp and rsync, or
  2. GUI software

If you need to transfer large amounts of data you can use the transfer nodes on Oscar which will speed up the process. For that, use transfer.ccv.brown.edu instead of ssh.ccv.brown.edu as the host address.

If you have access to a terminal like on a Mac or Linux computer, you can conveniently use scp to transfer files. For example to copy a file from your computer to Oscar:

 scp /path/to/source/file <username>@transfer.ccv.brown.edu:/path/to/destination/file

To copy a file from Oscar to your computer:

 scp <username>@transfer.ccv.brown.edu:/path/to/source/file /path/to/destination/file

On Windows, if you have PuTTY installed, you can use it's pscp function from the terminal.

There are also GUI programs for transfering files using the scp or sftp protocol, like WinSCP for Windows and Fugu or Cyberduck for Mac. FileZilla is another GUI software for FTP which is available on all platforms.

Globus Online provides a transfer service for moving data between institutions such as Brown and XSEDE facilities. You can transfer files using the Globus web interface or the command line interface.


Restoring files

Nightly snapshots of the file system are available for the trailing seven days.

Home directory snapshot

/gpfs_home/.snapshots/<date>/<username>/<path_to_file> 

Data directory snapshot

/gpfs/.snapshots/<date>/data/<groupname>/<path_to_file> 

Scratch directory snapshot

/gpfs/.snapshots/<date>/scratch/<username>/<path_to_file> 

Do not use the links in your home directory snapshot to try and retrieve snapshots of data and scratch. The links will always point to the current versions of these files. An easy way to check what a link is pointing to is to use ls -l

e.g.

ls -l /gpfs_home/.snapshots/April_03/ghopper/data 
lrwxrwxrwx 1 ghopper navy 22 Mar  1  2016 /gpfs_home/.snapshots/April_03/ghopper/scratch -> /gpfs/data/navy 

If files to be restored were modified/deleted more than 7 days (and less than 30 days) ago and were in the HOME or DATA directory, you may contact us to retrieve them from nightly backups by providing the full path. Note that home and data directory backups are saved for trailing 30 days only.


Best Practices for I/O

Efficient I/O is essential for good performance in data-intensive applications. Often, the file system is a substantial bottleneck on HPC systems, because CPU and memory technology has improved much more drastically in the last few decades than I/O technology.

Parallel I/O libraries such as MPI-IO, HDF5 and netCDF can help parallelize, aggregate and efficiently manage I/O operations. HDF5 and netCDF also have the benefit of using self-describing binary file formats that support complex data models and provide system portability. However, some simple guidelines can be used for almost any type of I/O on Oscar:

  • Try to aggregate small chunks of data into larger reads and writes. For the GPFS file systems, reads and writes in multiples of 512KB provide the highest bandwidth.
  • Avoid using ASCII representations of your data. They will usually require much more space to store, and require conversion to/from binary when reading/writing.
  • Avoid creating directory hierarchies with thousands or millions of files in a directory. This causes a significant overhead in managing file metadata.

While it may seem convenient to use a directory hierarchy for managing large sets of very small files, this causes severe performance problems due to the large amount of file metadata. A better approach might be to implement the data hierarchy inside a single HDF5 file using HDF5's grouping and dataset mechanisms. This single data file would exhibit better I/O performance and would also be more portable than the directory approach.