TensorFlow

 

TensorFlow is an open source software library for numerical computation using data flow graphs. It is widely used in machine learning.  Tensorflow is sometimes used as the computational backend to other AI software, including Keras.

Sample scripts for TensorFlow use are available in directory /opt/packages/examples/tensorflow on Bridges.  Among the examples are scripts to use a TensorFlow Singularity container on Bridges Volta or P100 GPU nodes, in directory /opt/packages/examples/tensorflow/AI.

Documentation

Usage

When using TensorFlow, keep these things in mind:

  • TensorFlow cannot be run on Bridges' login nodes. The SLURM scheduler  (Simple Linux Utility for Resource Management) manages and allocates all of Bridges' compute nodes. All of your production computing must be done on Bridges' compute nodes.

  • Choose the correct partition and module for the type of node you want to use.  TensorFlow can be run on Bridges' GPU nodes or on CPU nodes. Be certain to load the correct module and use the correct partition.  See the Bridges User Guide for information on Bridges'  partitions.

  • Use the EGRESS option if your TensorFlow job will need to communicate with external sites.  See the sections below on Batch use or Interactive use for specifics.

  • Activate the Python virtual environment after you load the tensorflow module.  Typically this is done with the source activate command.

    However, the activation command can vary with older versions of TensorFlow.  Use the module help command for guidance. 

  • If you want to use TensorBoard to visualize your TensorFlow data, you must add lines to your TensorFlow code to log TensorFlow summaries.  See below for TensorBoard instructions.

  • Singularity containers for TensorFlow are available for use on Bridges in directory /pylon5/containers/ngc/tensorflow. Multiple containers are available, for different versions of python and supporting software.  These containers can be used on the Volta or P100 GPU nodes. See Singularity images on Bridges for more information on the available containers.

Choose and load the appropriate module

To see what versions of TensorFlow are available type

module avail tensorflow

The module name indicates which version of Tensorflow it uses and whether it is for use on GPU nodes or not.

The module help command displays information about the TensorFlow version it uses and any additional steps are needed. To see this information and more, type

module help tensorflow/xxx

where xxx is the specific name of the module you want to use.  To use TensorFlow, include a command like this in your batch script or interactive session to load the TensorFlow module:

module load tensorflow/xxx

 

Batch use

TensorFlow can be run in batch mode on Bridges.  You will need to create a batch job and then submit it to one of Bridges' partitions using the sbatch command.  See the Running Jobs section of the Bridges User Guide for more information on running batch jobs.

If your TensorFlow job needs to communicate with sites external to Bridges, use the -C EGRESS option to the sbatch command.  See the section on sbatch in the Running Jobs section of the Bridges User Guide for more information on options to sbatch, including -C.

Multiple example batch jobs for TensorFlow can be found on Bridges in subdirectories under /opt/packages/examples/tensorflow. For more information and instructions, check the README file in the appropriate subdirectory.

 

Interactive use

TensorFlow can be used in an interactive session on Bridges.  You will need to request an interactive session with the interact command.  See the Running Jobs section of this User Guide for more information on interact

If your TensorFlow job needs to communicate with sites external to Bridges, use the --egress option to the interact command.  See the section on interact in the Running Jobs section of the Bridges User Guide for more information on options to sbatch, including --egress.

Once your session has started,  load the TensorFlow module you need.  Below is an example of an interactive session using TensorFlow on a GPU node that adds two numbers.  Commands you type are shown in bold.

If you are running on GPU nodes, information about the GPUs being used is displayed when you launch a session with the tf.Session() command. If there are issues with the GPU or CUDA, TensorFlow will often complain about it at this point.  If this happens, make sure that you are on a GPU node by checking the prompt which appeared when the interactive session started.  In the example below, you can see that it is running on gpu017. If you are on a GPU and the issue persists, email bridges@psc.edu to report it.

 

[joeuser@br018 ~]$ interact --gpu
A command prompt will appear when your session begins "Ctrl+d" or "exit" will end your session
[joeuser@gpu017 ~]$ module load tensorflow/1.5_gpu [joeuser@gpu017 ~]$ source activate (tf1.5) [joeuser@gpu017 ~]$ python Python 2.7.14 (default, Feb 18 2018, 23:54:30) [GCC 5.3.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> a=tf.constant(1.5) >>> b=tf.constant(4.2) >>> c=a+b >>> print a,b,c Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32) Tensor("add:0", shape=(), dtype=float32) >>> ses=tf.Session() 2018-02-22 16:34:49.666998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:87:00.0 totalMemory: 15.89GiB freeMemory: 15.60GiB 2018-02-22 16:34:49.667058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:87:00.0, compute capability: 6.0) >>> a_value,b_value,c_value=ses.run([a,b,c]) >>> print("Values are a=%s,b=%s,c=%s"%(a_value,b_value,c_value)) Values are a=1.5,b=4.2,c=5.7 >>> exit() (tf1.5) [joeuser@gpu017 ~]$ deactivate [joeuser@gpu017 ~]$ exit exit [joeuser@br018 ~]

 

TensorBoard Usage

You must add commands to your TensorFlow code to log summaries, which TensorBoard will then read and display.  Once the summaries are created, start a job in which you call TensorBoard to read the summaries and display them.

Once the summaries are created, follow these steps to use TensorBoard:

  1. Get an interactive session on a Bridges GPU node
    interact -gpu
  2. After the session starts, move to the parent directory of where the summary logs are stored.  For example, if the summaries are stored in myTensorFlowdirectory/logs, type
    cd myTensorFlowdirectory
  3. Load the TensorFlow module
    module load tensorflow
  4. Launch TensorBoard. Use the --logdir option to point to the directory where the summary logs are stored.
    tensorboard --logdir=logs

    The output will depend on what you have done before with TensorBoard, but will end with a line like:

    TensorBoard 1.5.1 at http://gpu048.pvt.bridges.psc.edu:6006 (Press CTRL+C to quit)
    
  5. From this last line of output, note the port and node you are running on; here, port 6006 of Bridges node gpu048.
  6. Set up a local port forwarding to connect a port on your local machine to the port your TensorBoard job is running on on Bridges. You may be able to do this through the terminal window on your local machine. Alternately, you can do it through the PuTTY application.
    • To do it through the terminal window
      1. Bring up a terminal (command) window on your local machine (e.g., CMD or Powershell prompt on a Windows machine)
      2. From the terminal window, set up local port forwarding to connect a port (say, 6006) on your local machine to the Bridges port where TensorBoard is running (in this example, port 6006 on Bridges node gpu048).
        ssh2 -L 6006:gpu048.pvt.bridges.psc.edu:6006 yourBridgesusername@bridges.psc.edu
        Password: yourBridgespassword
        
    • If you are connected to Bridges via PuTTY
      1. Choose a port number on your local machine (e.g. 6006) where PuTTY should listen for incoming connections.
      2. Go to the PuTTY settings
      3. Choose Connections > SSH > Tunnels
      4. Check the "Local ports accept connections from other hosts" button.
      5. Enter the port number you chose (e.g., 6006) in the Source port field.
      6. Enter the destination port in the Destination field. In this example, that is gpu048.pvt.bridges.psc.edu:6006.
      7. Click the Add button. The details of your port forwarding should appear in the list box.
  7. Open up a browser window and navigate to http://localhost:6006. TensorBoard will open in that window.

User Information

Passwords
Connect to PSC systems:
Policies
Technical questions:

Send mail to remarks@psc.edu or call the PSC hotline: 412-268-6350.