Oscar has 44 GPU nodes that are regular compute nodes with two NVIDIA
GPUs (Fermi architecture) added. Each M2050 GPU has 448 CUDA cores and
3GB GDDR5 memory. To gain access to these nodes, please submit a support ticket
and ask to be added to the 'gpu' group. Please note that these gpu nodes can only be
used for single node jobs.
To start an interactive session on a GPU node, use the
interact command and specify the
You also need to specify the requested number of GPUs using the
$ interact -q gpu -g 1
GPU Batch Job
For production runs with exclusive access to GPU nodes, please submit a
batch job to the
gpu partition. E.g. for using 1 GPU:
$ sbatch -p gpu --gres=gpu:1 <jobscript>
This can also be mentioned inside the batch script:
#SBATCH -p gpu --gres=gpu:1
You can view the status of the
gpu partition with:
$ allq gpu
Sample batch script for CUDA program:
Getting started with GPUs
While you can program GPUs directly with CUDA, a
language and runtime library from NVIDIA, this can be daunting for programmers
who do not have experience with C or with the details of computer architecture.
You may find the easiest way to tap the computation power of GPUs is to link
your existing CPU program against numerical libraries that target the GPU:
is a drop-in replacement for BLAS libraries that runs BLAS routines on the GPU
instead of the CPU.
- CULA is a similar library for LAPACK routines.
FFT, sparse matrix, and randon number generation routines that run on the GPU.
combines custom GPU kernels, CUBLAS, and a CPU BLAS library to use both the GPU
and CPU to simaultaneously use both the GPU and CPU; it is available in the
'magma' module on Oscar.
- Matlab has a
feature, available through the Parallel Computing Toolkit, for creating arrays
on the GPU and operating on them with many built-in Matlab functions. The PCT
toolkit is licensed by CIS and is available to any Matlab session running on
Oscar or workstations on the Brown campus network.
- PyCUDA is an interface to CUDA from
Python. It also has a GPUArray
feature and is available in the
cuda module on Oscar.
Introduction to CUDA
CUDA is an extension of the C
language, as well as a runtime library, to facilitate general-purpose
programming of NVIDIA GPUs. If you already program in C, you will probably find
the syntax of CUDA programs familiar. If you are more comfortable with C++, you
may consider instead using the higher-level Thrust
library, which resembles the Standard Template Library and is included with
In either case, you will probably find that because of the differences between
GPU and CPU architectures, there are several new concepts you will encounter
that do not arise when programming serial or threaded programs for CPUs. These
are mainly to do with how CUDA uses threads and how memory is arranged on the
GPU, both described in more detail below.
There are several useful documents from NVIDIA that you will want to consult as you become more proficient with CUDA:
There are also many CUDA tutorials available online:
Threads in CUDA
CUDA uses a data-parallel programming model, which allows you to program at the
level of what operations an individual thread performs on the data that it
owns. This model works best for problems that can be expressed as a few
operations that all threads apply in parallel to an array of data. CUDA allows
you to define a thread-level function, then execute this function by mapping
threads to the elements of your data array.
A thread-level function in CUDA is called a kernel. To launch a kernel on
the GPU, you must specify a grid, and a decomposition of the grid into
smaller thread blocks. A thread block usually has around 32 to 512 threads,
and the grid may have many thread blocks totalling thousands of threads. The
GPU uses this high thread count to help it hide the latency of memory
references, which can take 100s of clock cycles.
Conceptually, it can be useful to map the grid onto the data you are processing
in some meaningful way. For instance, if you have a 2D image, you can create a
2D grid where each thread in the grid corresponds to a pixel in the image. For
example, you may have a 512x512 pixel image, on which you impose a grid of
512x512 threads that are subdivided into thread blocks with 8x8 threads each,
for a total of 64x64 thread blocks. If your data does not allow for a clean
mapping like this, you can always use a flat 1D array for the grid.
The CUDA runtime dynamically schedules the thread blocks to run on the
multiprocessors of the GPU. The M2050 GPUs available on Oscar each have 14
multiprocessors. By adjusting the size of the thread block, you can control how
much work is done concurrently on each multiprocessor.
Memory on the GPU
The GPU has a separate memory subsystem from the CPU. The M2050 GPUs have GDDR5
memory, which is a higher bandwidth memory than the DDR2 or DDR3 memory used by
the CPU. The M2050 can deliver a peak memory bandwidth of almost 150 GB/sec,
while a multi-core Nehalem CPU is limited to more like 25 GB/sec.
The trade-off is that there is usually less memory available on a GPU. For
instance, on the Oscar GPU nodes, each M2050 has only 3 GB of memory shared by
14 multiprocessors (219 MB per multiprocessor), while the dual quad-core
Nehalem CPUs have 24 GB shared by 8 cores (3 GB per core).
Another bottleneck is transferring data between the GPU and CPU, which happens
over the PCI Express bus. For a CUDA program that must process a large dataset
residing in CPU memory, it may take longer to transfer that data to the GPU
than to perform the actual computation. The GPU offers the largest benefit over
the CPU for programs where the input data is small, or there is a large amount
of computation relative to the size of the input data.
CUDA kernels can access memory from three different locations with very
different latencies: global GDDR5 memory (100s of cycles), shared memory (1-2
cycles), and constant memory (1 cycle). Global memory is available to all
threads across all thread blocks, and can be transferred to and from CPU
memory. Shared memory can only be shared by threads within a thread block and
is only accessible on the GPU. Constant memory is accessible to all threads and
the CPU, but is limited in size (64KB).
Compiling with CUDA
To compile a CUDA program on Oscar, first load the CUDA module with:
$ module load cuda
The CUDA compiler is called
nvcc, and for compiling a simple CUDA program it
uses syntax simlar to
$ nvcc -o program source.cu
Optimizations for Fermi
The Oscar GPU nodes feature NVIDIA M2050 cards with the Fermi architecture,
which supports CUDA's "compute capability" 2.0. To fully utilize the hardware
optimizations available in this architecture, add the
-arch=sm_20 flag to
your compile line:
$ nvcc -arch=sm_20 -o program source.cu
This means that the resulting executable will not be backwards-compatible with
earlier GPU architectures, but this should not be a problem since CCV nodes
only use the M2050.
The Fermi architecture has two levels of memory cache similar to the L1 and L2
caches of a CPU. The 768KB L2 cache is shared by all multiprocessors, while the
L1 cache by default uses only 16KB of the available 64KB shared memory on each
You can increase the amount of L1 cache to 48KB at compile time by adding the
-Xptxas -dlcm=ca to your compile line:
$ nvcc -Xptxas -dlcm=ca -o program source.cu
If your kernel primarily accesses global memory and uses less than 16KB of
shared memory, you may see a benefit by increasing the L1 cache size.
If your kernel has a simple memory access pattern, you may have better results
by explicitly caching global memory into shared memory from within your kernel.
You can turn off the L1 cache using the flags
Mixing MPI and CUDA
Mixing MPI (C) and CUDA (C++) code requires some care during linking because of
differences between the C and C++ calling conventions and runtimes.
One option is to compile and link all source files with a C++ compiler, which
will enforce additional restrictions on C code. Alternatively, if you wish to
compile your MPI/C code with a C compiler and call CUDA kernels from within an
MPI task, you can wrap the appropriate CUDA-compiled functions with the
extern keyword, as in the following example.
These two source files can be compiled and linked with both a C and C++
compiler into a single executable on Oscar using:
$ module load mvapich2 cuda
$ mpicc -c main.c -o main.o
$ nvcc -c multiply.cu -o multiply.o
$ mpicc main.o multiply.o -lcudart
The CUDA/C++ compiler
nvcc is used only to compile the CUDA source file, and
the MPI C compiler
mpicc is used to compile the C code and to perform the
01. /* multiply.cu */
03. #include <cuda.h>
04. #include <cuda_runtime.h>
06. __global__ void __multiply__ (const float *a, float *b)
08. const int i = threadIdx.x + blockIdx.x * blockDim.x;
09. b[i] *= a[i];
12. extern "C" void launch_multiply(const float *a, const *b)
14. /* ... load CPU data into GPU buffers a_gpu and b_gpu */
16. __multiply__ <<< ...block configuration... >>> (a_gpu, b_gpu);
21. /* ... transfer data from GPU to CPU */
Note the use of
extern "C" around the function
instructs the C++ compiler (
nvcc in this case) to make that function callable
from the C runtime. The following C code shows how the function could be called
from an MPI task.
01. /* main.c */
03. #include <mpi.h>
05. void launch_multiply(const float *a, float *b);
07. int main (int argc, char **argv)
09. int rank, nprocs;
10. MPI_Init (&argc, &argv);
11. MPI_Comm_rank (MPI_COMM_WORLD, &rank);
12. MPI_Comm_size (MPI_COMM_WORLD, &nprocs);
14. /* ... prepare arrays a and b */
16. launch_multiply (a, b);
19. return 1;
OpenACC is a portable, directive-based parallel programming construct. You can parallelize loops and code segments simply by inserting directives - which are ignored as comments if OpenACC is not enabled while compiling. It works on CPUs as well as GPUs. We have the PGI compiler suite installed on Oscar which has support for compiling OpenACC directives. To get you started with OpenACC:
GPU Programming in Matlab