Common error messages, improving turnaround
There are several techniques you can use to try to improve your turnaround, although, since the scheduler is FIFO with backfill, you may just have to wait your turn.
"Invalid qos specification" means that you asked for a resource that you do not have access to. For example:
Please check your grants to see what you have been allocated. The projects
command will list your grants and the resources allocated to each one.
This error can also occur when you have more than one active grant. It's important to run jobs under the correct SLURM account id. Your jobs will run under your default SLURM account id unless you specify a different account id to use for a job. If the default grant does not have access to the resources you are requesting, you will get this error.
Use the projects
command to find which grant is your default if you have more than one.
See
It is not possible to predict with any accuracy when your job will run.
The scheduler on Bridges is largely FIFO. The squeue
command lists the running and queued jobs on Bridges in FIFO order. However, jobs can move up in the queue if a slot becomes available on the machine, this job will fit in the open slot, and others ahead of it in the FIFO ordering cannot fit. In addition, jobs can finish before their requested walltime for a variety of reasons.
For help running Tensorflow on Bridges:
In general, you must load a tensorflow module and then activate your Python virtual environment.
To see all the available tensorflow modules, type
module avail tensorflow
Load the appropriate module with
module load tensorflow/version
After the module is loaded, activate the Python virtual environment. Typically this is done with the source activate
command.
Reservations on Bridges are set-asides for a specific grant of a specific set of nodes for a specific time period. They may be granted under special circumstances, typically for jobs that require real-time or interactive processing (e.g. of streaming external data, student exercises etc.).
If you feel that you have such a use case, you may apply for a reservation using the Reservation Request form at least 48 hours before the proposed start time. A user consultant will contact you about your request.
Note that it may not be possible to honor all reservation requests. In addition, modifications to your request may be necessary, depending on the overall processing demands on the system.
If your reservation is accepted, your Bridges grant will be charged for all specified nodes for the entire specified time period, regardless of the actual usage your jobs incur.
Example: A reservation for 1024 RM cores that starts Monday 8:00 am EST and ends Wednesday 8:00 am EST will be charged 1024*48 SUs even if you run no jobs on Tuesday.
There can be many reasons that a job is waiting in the queue.
squeue -l
has 'Priority' in the status field, this is the case.squeue -l
output says 'QOSMaxCPUPerUserLimit', this is the case. squeue -l
output says 'QOSMaxCPUPerUserLimit', this is the case.squeue -l
output says 'QOSMaxGRESPerUser', this is the case.squeue -l
output says 'PartitionDown', this is the case.sinfo
command shows reserved nodes.sinfo
command shows down nodes.squeue -l
output says 'ReqNodeNotAvailable', this is the case. The output is somewhat misleading because it will list all nodes on the machine which are unavailable to run jobs, even nodes on which your job could not run because they are in a different partition.Many applications require environment variables to be set before they will run as you intend, or even at all. The PSC-supplied modules set many of the necessary environment variables for a package, but in some cases you need to set additional environment variables. The command to use depends on the shell type you are using.
If you are using a shell in the bash family of shells, set an an environment variable with a command similar to
export VAR1 = value1
If you are using a shell in the C-shell family of shells, set an environment variable with a command similar to
setenv VAR1 value1
There are two parameters to consider:
See "Managing multiple grants" in the Account Admininstration section of the Bridges User Guide for information on determing your default SLURM account id and Unix group, and changing them either permanently or temporarily, for just one job or login session.
What software is available on Bridges; common errors with a given package.
Check our software page at https://www.psc.edu/resources/software for the list of software installed on Bridges.
You must get permission before you can use Gaussian on PSC systems.
Complete the form at https://www.psc.edu/user-resources/software/gaussian/permission-form?view=form
We will notify you when you have been granted access to Gaussian.
To run Gromacs on both GPUs and CPUs, use the same number of CPU tasks as GPUs. Thus, no matter how many nodes you use, set the value of the SBATCH option ntasks-per-node to 4 if you are using the K80 GPU nodes and to 2 if you are using the P100 GPU nodes.
You must also use the correct GROMACS module to insure that your compilation will work.
To use GPUs and CPUs, load a gromacs module with "gpu" in its name, similar to
module load gromacs/2018_gpu
To use just CPUs, load a gromacs module with "cpu" in its name, similar to
module load gromacs/2018_cpu
There are complete sample scripts for both cases in directory /opt/packages/examples/gromacs on Bridges.