Running Jobs on Bridges

Common error messages, improving turnaround

There are several techniques you can use to try to improve your turnaround, although, since the scheduler is FIFO with backfill, you may just have to wait your turn.

  1. Estimate your walltime as closely as possible without underestimating.
  2. Use flexible walltime by using the --time-min option to the sbatch command. The use of this option is described in the Running Jobs section of the Bridges User Guide.
  3. Use job bundling (also called job packing) to combine your jobs. This combines several smaller jobs into one. There is a better chance for one job to start than multiple smaller jobs. Job bundling is described in the Sample Scripts section of the Bridges User Guide.
  4. Space out your job submissions. Bridges runs a fairshare scheduler. If you have run a lot of jobs recently, the priority of your queued jobs will be reduced slightly until they have accumulated some waiting time in the queue. 
  5. Consider whether this task needs to be run on Bridges. If you can conveniently complete it on your local system, you do not have to wait.
on May 10, 2019

"Invalid qos specification" means that you asked for a resource that you do not have access to.  For example:

  • submitting a job to one of the GPU partitions when you don't have a Bridges_GPU allocation
  • submitting a job to one of the LM partitions when you don't have a Bridges_Large allocation

Please check your grants to see what you have been allocated.  The projects command will list your grants and the resources allocated to each one.

This error can also occur when you have more than one active grant.  It's important to run jobs under the correct SLURM account id. Your jobs will run under your default SLURM account id unless you specify a different account id to use for a job. If the default grant does not have access to the resources you are requesting, you will get this error.

Use the projects command to find which grant is your default if you have more than one.

See

  • The Account administration section of the Bridges user Guide for information on finding your SLURM account ids, setting a nondefault account id for a specific job,  and changing your default account id permanently.
on May 14, 2019

It is not possible to predict with any accuracy when your job will run. 

The scheduler on Bridges is largely FIFO. The squeue command lists the running and queued jobs on Bridges in FIFO order. However, jobs can move up in the queue if a slot becomes available on the machine, this job will fit in the open slot, and others ahead of it in the FIFO ordering cannot fit. In addition, jobs can finish before their requested walltime for a variety of reasons. 

on May 14, 2019

For help running Tensorflow on Bridges:

  • See the tensorflow page for detailed information
  • Check the example batch scripts on Bridges in the directory /opt/packages/examples/tensorflow. For more information and instructions, see the README file in the appropriate subdirectory.

In general, you must load a tensorflow module and then activate your Python virtual environment.

To see all the available tensorflow modules, type

module avail tensorflow

Load the appropriate module with 

module load tensorflow/version

After the module is loaded, activate the Python virtual environment.   Typically this is done with the source activate command.  

 

on May 14, 2019

Reservations on Bridges are set-asides for a specific grant of a specific set of nodes for a specific time period. They may be granted under special circumstances, typically for jobs that require real-time or interactive processing (e.g. of streaming external data, student exercises etc.).

If you feel that you have such a use case, you may apply for a reservation using the Reservation Request form  at least 48 hours before the proposed start time. A user consultant will contact you about your request. 

Note that it may not be possible to honor all reservation requests. In addition, modifications to your request may be necessary, depending on the overall processing demands on the system.

If your reservation is accepted, your Bridges grant will be charged for all specified nodes for the entire specified time period, regardless of the actual usage your jobs incur.

Example: A reservation for 1024 RM cores that starts Monday 8:00 am EST and ends Wednesday 8:00 am EST will be charged 1024*48 SUs even if you run no jobs on Tuesday.

on May 14, 2019

There can be many reasons that a job is waiting in the queue.

  • There are jobs in front of you in the queue. If the status field in the output from the command squeue -l has 'Priority' in the status field, this is the case.
  • There is a maximum amount of nodes your RM jobs can cummulatively request. The maximum limit varies based on the load on the system. If the status field in the squeue -l output says 'QOSMaxCPUPerUserLimit', this is the case.
  • There is a maximum amount of cores your RM-shared jobs can cummulatively request. This maximum limit varies based on the load on the system. If the status field in the squeue -l output says 'QOSMaxCPUPerUserLimit', this is the case.
  • There is a maximum number of GPU nodes your GPU jobs  can simultaneously use. This limit applies to the p100 and k80 nodes combined. If the status field in the squeue -l output says 'QOSMaxGRESPerUser', this is the case.
  • The partition you want to use is down. If the status field in the squeue -l output says 'PartitionDown', this is the case.
  • There are currently no nodes available to run your job. This can be because
    • They are already running jobs
    • There are a lot of reserved nodes. The sinfo command shows reserved nodes.
    • There are a lot of down nodes. The sinfo command shows down nodes.
    • They are being reserved for a job with a reservation that is about to start
    • They are being reserved for a job at which a system drain is targeted. If the status field in the squeue -l output says 'ReqNodeNotAvailable', this is the case. The output is somewhat misleading because it will list all nodes on the machine which are unavailable to run jobs, even nodes on which your job could not run because they are in a different partition.
on May 17, 2019

Many applications require environment variables to be set before they will run as you intend, or even at all. The PSC-supplied modules set many of the necessary environment variables for a package, but in some cases you need to set additional environment variables. The command to use depends on the shell type you are using.

If you are using a shell in the bash family of shells, set an an environment variable with a command similar to

export VAR1 = value1

If you are using a shell in the C-shell family of shells, set an environment variable with a command similar to

setenv VAR1 value1
on May 17, 2019

There are two parameters to consider:

  • the SLURM account id, which determines which grant the SUs used by a job are deducted from, and
  • the Unix group, which determines which group owns any files created by the job.

See "Managing multiple grants" in the Account Admininstration section of the Bridges User Guide for information on determing your default SLURM account id and Unix group, and changing them either permanently or temporarily, for just one job or login session.

  

on May 17, 2019

Software

What software is available on Bridges; common errors with a given package.

To run Gromacs on both GPUs and CPUs, use the same number of CPU tasks as GPUs. Thus, no matter how many nodes you use, set the value of the SBATCH option ntasks-per-node to 4 if you are using the K80 GPU nodes and to 2 if you are using the P100 GPU nodes.

You must also use the correct GROMACS module to insure that your compilation will work.

To use GPUs and CPUs, load a gromacs module with "gpu" in its name, similar to

module load gromacs/2018_gpu

To use just CPUs, load a gromacs module with "cpu" in its name, similar to

module load gromacs/2018_cpu

There are complete sample scripts for both cases in directory /opt/packages/examples/gromacs on Bridges.

on May 17, 2019