Bridges-AI User Guide

Early User Program

 

This User Guide is intended for the early user program of Bridges-AI. If you are not an identified early user of Bridges-AI, you cannot use the Bridges-AI nodes during this time, but you can apply for an allocation on Bridges-AI.  See "Get access to Bridges" for information on applying.

This document is not a complete guide to Bridges and does not include all of the information in the Bridges User Guide.  If you are not familar with Bridges, please refer to the Bridges User Guide. In particular, if you are not comfortable with the interact and sbatch commands and their options, refer to the Running Jobs section in the Bridges User Guide for important information.

Check back often for updates.  And please help us improve this document by sending comments and suggestions to doc@psc.edu. 

Table of Contents

Introduction

Bridges-AI is ideally suited for deep learning, other kinds of machine learning, graph analytics, and data science. This guide is intended to help users that are part of the Early User Program for Bridges-AI. It is assumed users already have an account on Bridges. AI and machine learning frameworks for Bridges-AI are provided through NVIDIA-optimized Singularity containers.

Questions regarding the program can be sent to: bridges-dl-early@psc.edu. Technical support can be reached at: bridges-dl@psc.edu.

Hardware description

Bridges-AI introduces 88 NVIDIA Volta GPUs in the following new nodes:

  • An NVIDIA DGX-2 enterprise research AI system tightly couples 16 NVIDIA Tesla V100 (Volta) GPUs with 32 GB of GPU memory each, connected by NVLink and NVSwitch, to provide maximum capability for the most demanding of AI challenges
  • 9 HPE Apollo 6500 servers, each with 8 NVIDIA Tesla V100 GPUs with 16 GB of GPU memory each, connected by NVLink 2.0, to balance great AI capability and capacity

 

Using Bridges-AI

To use the software on Bridges-AI, we strongly recommend that you use a Singularity container image. You can use Singularity images already available locally on Bridges, or download an image, either from the NVIDIA GPU Cloud (NGC) Registry or another registry, or create your own.  To get the best performance from the nodes with Volta GPUs, use software from an NGC container.

For security reasons, Bridges-AI supports Singularity as a container technology but not Docker.

 

NVIDIA GPU Cloud (NGC) Containers

NVIDIA GPU Cloud (NGC) is a GPU-accelerated cloud platform optimized for deep learning and scientific computing. NVIDIA optimizes the containers for Volta, including rigorous quality assurance.

The containers on the NGC Registry are Docker images, but we have converted many of them to Singularity for you to use on Bridges-AI. These containers may be run on Bridges-AI nodes or on Bridges’ NVIDIA Tesla P100 GPUs, but they are not compatible with Bridges’ Tesla K80 GPUs.

NVIDIA requests that you create an account at http://ngc.nvidia.com if you will use any of these containers.

Containers installed on Bridges

These containers are installed on Bridges as Singularity images for you to use. Multiple versions of each are available, which vary in the versions of the software in them.   For details on which containers are installed and the software each contains, see Singularity images on Bridges.

Packagepath on BridgesNVIDIA Documentation
Caffe /pylon5/containers/ngc/caffe https://ngc.nvidia.com/registry/nvidia-caffe
Caffe2 /pylon5/containers/ngc/caffe2 https://ngc.nvidia.com/registry/nvidia-caffe2
CNTK /pylon5/containers/ngc/cntk https://ngc.nvidia.com/registry/nvidia-cntk
DIGITS /pylon5/containers/ngc/digits https://ngc.nvidia.com/registry/nvidia-digits
 Inference Server /pylon5/containers/ngc/inferenceserver  https://ngc.nvidia.com/registry/nvidia-inferenceserver
MXNet /pylon5/containers/ngc/mxnet https://ngc.nvidia.com/registry/nvidia-mxnet
PyTorch /pylon5/containers/ngc/pytorch https://ngc.nvidia.com/registry/nvidia-pytorch
TensorFlow /pylon5/containers/ngc/tensorflow https://ngc.nvidia.com/registry/nvidia-tensorflow
TensorRT /pylon5/containers/ngc/tensorrt https://ngc.nvidia.com/registry/nvidia-tensorrt
TensorRT Inference Server /pylon5/containers/ngc/tensorrtserver https://ngc.nvidia.com/registry/nvidia-tensorrtserver
Theano /pylon5/containers/ngc/theano https://ngc.nvidia.com/registry/nvidia-theano
Torch /pylon5/containers/ngc/torch https://ngc.nvidia.com/registry/nvidia-torch

 

Containers available on the NGC Registry

The table below lists the packages in NGC Registry which are not installed on Bridges. Multiple versions of each are available. Visit the NGC registry for more information.

If you want to use a NGC container which is not installed on Bridges, you can access it directly from the NGC Registry, but you must create an account on the NGC Registry to do this.

PackageNVIDIA Documentation

CUDA

https://ngc.nvidia.com/registry/nvidia-cuda

Accessing Container Images

Containers already on Bridges

A subset of the container images provided by NGC is already available on Bridges under the directory /pylon5/containers/ngc/. For details on which containers are installed and the software each contains, see Singularity images on Bridges.

Each software package (tensorflow, caffe, etc.) has its own directory containing several images. These images correspond to different versions of software. For example, to see all of the tensorflow images available type:

ls /pylon5/containers/ngc/tensorflow/

In this case, the output shows that there are four images are available. Note that there are two containers built with python2 and two built with python3.

18.09-py2.simg  18.10-py2.simg 
18.09-py3.simg  18.10-py3.simg

These are all Singularity images which are ready to use on Bridges. You can use these containers directly from the /pylon5/containers/ngc directory; there is no need to copy them to another directory.

Using your own containers

If you need a container that is not already available on Bridges, you may be able to find a suitable one on the NGC Registry. However this requires creating your own account with NGC. Go to http://ngc.nvidia.com to create an NGC Registry account.

If you have a Singularity container of your own, you can download it to Bridges and use it on Bridges-AI. If you have a Docker container you wish to use, download it to Bridges and then convert it to Singularity before using it.

Converting Docker containers to Singularity 

Once you have downloaded a Docker image, you must convert it to Singularity before using it on Bridges.  You can do this in an interactive or batch session on one of Bridges-AI nodes.  See the Running jobs section of the Bridges User Guide for information on starting an interactive session or submitting a batch job.

Whether you are using an interactive session or a batch job, you must use the GPU-AI partition.  Once your interactive session has started, or inside your batch script, load the Singularity module and use the singularity build command to convert your Docker container.

To convert a Docker container from the NGC to Singularity, use these commands in an interactive session or in a batch script:

source /etc/profile.d/modules.sh
module load singularity
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=your-key-string
export SINGULARITY_CACHEDIR=$SCRATCH/.singularity
singularity build $SCRATCH/new-container.simg docker://nvcr.io/nvidia/old-container

where the username and password credentials are those from NVIDIA when you register at http://ngc.nvidia.com

Running Jobs

Once you have a Singularity image available locally you are ready to start using it for your application. You can run jobs either interactively or as a batch job, as with any other job on Bridges. For more details on this please see the Running job section of the Bridges User Guide.  You must use the GPU-AI partition.

 

GPU-AI partition summary

Partition nameGPU-AI
 Volta 16DGX-2
Node type Tesla V100 (Volta) GPUs with 16 GB of GPU memory Tesla V100 GPUs with 32 GB of GPU memory
GPUs/node 8 16
Default # of nodes 1 1
Max # of nodes 1 1
Min GPUs per job 1 1
Max GPUs per job 8 16
Max GPUs in use per user 8 16
Walltime default 1 hour 1 hour
Walltime max 12 hours 12 hours

 

Running interactively in the GPU-AI partition

To run in an interactive session on Bridges-AI, use the interact command and specify the GPU-AI partition. An example interact command to request 1 GPU on an Apollo 6500 node is:

interact -p GPU-AI --gres=gpu:volta16:1

Where:

-p indicates the intended partition

--gres=gpu:volta16:1  requests the use of 1 V100 GPU

Once your interactive session has started, you can run the Singularity image. 

 

To start the Singularity image and then fire up a shell, type

singularity shell --nv singularity-container-name.simg

where 

 --nv uses NVIDIA optimizations

singularity-container-name is the container you wish to use

Type any commands you like at the prompt.

 

Alternately, you can create a bash shell script and run it inside of the container. To do so, once your interactive session has started, type

singularity exec --nv singularity-container-name.simg  bash_script.sh

where: 

--nv uses NVIDIA optimizations

singularity-container-name is the container you wish to use

bash_script.sh is your bash script

 

Running a batch job in the GPU-AI partition

 

Using module files on Bridges-AI

The Module package provides for the dynamic modification of a users's environment via module files. Module files manage necessary changes to the environment, such as adding to the default path or defining environment variables, so that you do not have to manage those definitions and paths manually. Before you can use module files in a batch job on Bridges-AI, you must issue the following command:

If you are using the bash or ksh:

source /etc/profile.d/modules.sh      

If you are using csh or tcsh:

source /etc/profile.d/modules.csh

See the Module documentation for information on the module command.

 

The sbatch command

To run a batch job, you must create a batch script and submit it to the GPU-AI partition using the sbatch command.  Please see the Running jobs section of the Bridges User Guide for information on batch scripts, the sbatch command and its options, and more.

 

A sample sbatch command to submit a job to run on one of the Apollo servers and use all eight GPUs would be

sbatch -p GPU-AI -N 1 --gres=gpu:volta16:8 -t 1:00:00 myscript.job

where 

-p GPU-AI requests the GPU-AI partition

-N 1 requests one node

--gres=gpu:volta16:8 requests an Apollo server with Volta 100 GPUs , and specifies you will use all 8 GPUs on that node

-t 1:00:00 requests one hour of running time

myscript.job is your batch script.

 

Here is an example batch script intended to run on one Apollo server, using all eight Volta 100 GPUs. This script specifies the same sbatch directives as the sbatch command above. You can specify directives either way, but directives on the command line take precedence over those in a batch script.

Note we are using the bash shell and have included the command to load the module command.

#!/bin/bash
#SBATCH --partition=GPU-AI #SBATCH --nodes=1 #SBATCH --gres=gpu:volta16:8 #SBATCH --time=1:00:00
source /etc/profile.d/modules.sh
cd $SCRATCH/ngc module load singularity singularity exec --nv $SCRATCH/tensorflow.simg $SCRATCH/ngc/matrix.s

 

Environment variables

Using environment variables can make your life easier. Defining one variable to be the file path for the image you want to use and another to run that Singularity image can make it easier to access those strings later. For example, if you wish to use the tensorflow 18.10-py3 image, define a variable SIMG with the command:

SIMG=/pylon5/containers/ngc/tensorflow/18.10-py3.simg

Then define another environment variable that will run the Singularity image using NVIDIA optimizations:

S_EXEC="singularity exec --nv ${SIMG}"

Assuming that you have defined SIMG and S_EXEC as shown above, a sample sbatch command to request the use of 1 GPU from the DGX-2 node would be:

sbatch -p GPU-AI --gres=gpu:volta32:1 ${S_EXEC} myscript.job

Where:

-p indicates the intended partition

--gres=gpu:volta32:1  requests the use of 1 V100 GPU on the DGX-2 

myscript.job is the name of your batch script

 

Example scripts

 Example scripts using Tensorflow are available on Bridges in /opt/packages/examples/tensorflow/AI. 

 

See also

 

Questions

Questions regarding the early user program can be sent to: bridges-dl-early@psc.edu, and technical support can be reached at: bridges-ai@psc.edu.

.