Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

Tuning and Analysis Utilities:
Tackling Large Scientific Applications


This document describes the process used to apply TAU to the Advanced Regional Prediction System (ARPS) code. PSC's Blacklight system was used to perform the analysis. Other users of large codes can also apply the techniques outlined here to improve their code optimization.

ARPS is being used in support of the NSF-funded project Enabling Petascale Ensemble-based Data Assimilation for Numerical Analysis and Prediction of High-Impact Weather.


ARPS employs both MPI communication calls and OpenMP directives. Its makearps system can be configured to build the code four ways: serial, OpenMP, MPI, and hybrid.

Our objective was to automatically apply source-level instrumentation to the hybrid version of ARPS. However, both parsing and run-time errors were immediately encountered with the less complex MPI code. We altered our approach to start with the simpler dynamic instrumentation model. Once timing information was obtained from dynamic instrumentation, compiler instrumentation was instituted. Finally, source code instrumentation was performed using the TAU Program Database Toolkit (PDT).

We found source-level profiling to be robust when applied to reduced sets of routines. By using a feature of PDT which allows files and functions to be specifically included or excluded, we were able to meet our initial objective of automatically profiling the MPI version of ARPS.

Usage instructions for all three tools are included here.

ARPS compilation

In general, ARPS was compiled using its default optimization settings:

-O3 -fnoalias -ip -fp-model precise

When run-time errors occurred in instrumented source files, the makearps debug switch (-d) was applied. This allowed us to trace into these intermediate files and take corrective measures.

Viewing Profiling Data

In addition to the usual program output, TAU generates a set of profiling data files. For a run using four MPI processes on Blacklight, the file set looks like:

profile.0.0.0 profile.1.0.0 profile.2.0.0 profile.3.0.0

These files are small, and they may be examined with a text editor, but the data is formatted for pprof (text output) and ParaProf (GUI), companion pieces to TAU. Versions of pprof and ParaProf are installed on Blacklight. However, it may be more convenient to install the Windows version of ParaProf on a local machine. See the ParaProf User's Manual from the University of Oregon for more information.

To view profile data, download the set of profile.* files and run pprof from the directory that contains the profile data. Or, start ParaProf and open the directory that contains the data files from the File menu.

Dynamic Instrumentation

Dynamic instrumentation, also known as library interposition, was tested first. This method is the simplest one to use. It is applied to existing binaries.

To perform this type of instrumentation, configure the TAU environment:

module load tau/2.20

and then use the tau_exec script with one or more options to start the application:

mpirun -np 4 tau_exec [ -io | -memory | -T MPI ] arpsenkf_mpi ...

MPI functions are instrumented by default unless "-T SERIAL" is specified. Profile data was viewed with pprof and ParaProf. (See the section on Viewing Profiling Data.) This gave us some idea of where cycles were being spent. To get additional detail, however, it was necessary to move on to the next level.

Compiler Instrumentation

Compiler instrumentation is applied to the entire code. The use of compiler instrumentation requires a compiler switch. For example, to use Intel's ifort compiler, enable the -finstrument-functions flag. Compiler instrumentation was found to be reliable at all optimization levels.

On Blacklight, configure TAU:

module load tau/2.20

and select compiler instrumentation:

setenv TAU_OPTIONS '-optCompInst'

The choice of a TAU Makefile doesn't matter for compiler instrumentation options, but something relatively simple is suggested:

setenv TAU_MAKEFILE ${TAU_ROOT_DIR}/x86_64/lib/Makefile.tau-icpc-mpi

The overhead can be significant, but the compiler-instrumented code was only used occasionally for different core counts. The results were used to define a short list of routines for subsequent analysis using the source-level instrumentation model.

The default settings yield minimal output from the TAU compiler scripts. Additional options are available for controlling output (-optVerbose) and determining if intermediate source files should be preserved (-optKeepFiles). Both are recommended, at least initially.

setenv TAU_OPTIONS ${TAU_OPTIONS}' -optVerbose -optKeepFiles'

The Verbose option may be removed once the instrumentation process has stabilized. but we've made a habit of maintaining it.

KeepFiles retains copies of intermediate source codes. Clean-up operations can be modified to include TAU intermediate files, e.g., *.pdb, *.chk.f90, *.chk.pomp.*, *.inst.f90, and *.opari.inc.

After updating a few of the ARPS clean options, we decided to keep the intermediate files too.

TAU compiler scripts must be specified for building the application. For the ARPS code, three lines were added to the makearps script just before the makecmd variable was set:

set C_str   = "'CC  = tau_cc.sh'"
set FTN_str = "'FTN = tau_f90.sh'"
set LDR_str = "'LDR = tau_f90.sh'"

Multiple binaries were tested, including uninstrumented and compiler-instrumented versions, and source-level builds with and without masking. Multiple TAU versions were also used. We recommend that binaries be renamed to reflect what was used in building them. For example:

mv   arpsenkf_mpi    arpsenkf_mp-opt-tau-2.20-compinst

This also allows additional binaries to be built while test runs are in progress.

Test jobs must load the TAU module that was used in building the code. This will ensure that the corresponding binaries and libraries are used.

A run-time verbose setting is also available, and its use is also recommended.

setenv TAU_VERBOSE 1

Verbose output may be eliminated after successful testing, but we usually left it in place, too.

Profile data was viewed with pprof and ParaProf. (See the section on Viewing Profiling Data.)

Source Instrumentation - Program Database Toolkit (PDT)

Source instrumentation (PDT) supports the finest level of profiling and is therefore highly desirable. The first step in using this model is to identify an appropriate TAU makefile.


All available makefiles reside in the library directory /usr/local/packages/TAU/usr/2.20/x86_64/lib/Makefile*

To work with MPI source using PDT, the selected makefile must include both "mpi" and "pdt" in its name. For example, Makefile.tau-icpc-mpi-pdt could be chosen.

An attempt was made to apply PDT instrumentation to the entire ARPS code. Pre-2.20 versions of TAU were unable to complete the parsing and instrumentation phases. In these cases, we limited TAU profiling to a few routines that had been identified by compiler instrumentation runs (see the section on compiler instrumentation).

TAU 2.20 finished both the parsing and instrumentation phases without any problems. However, the instrumented code failed at run time. For large production codes, particularly legacy sources like ARPS, such problems are common.

Using TAU 2.20, the entire code was recompiled and run with debugging options enabled. Subsequent failures produced tracebacks which allowed us to identify the culprit: an I/O routine with multiple entry points.

The PDT model supports inclusion or exclusion of files and functions. In this case, the problematic file was excluded by creating a masking file which contained:


The masking file was named "select.tau" and placed in the top-level application directory. To use it, the TAU_OPTIONS setting was changed from optCompInst to optTauSelectFile:

setenv TAU_OPTIONS \

The exclusion workaround allowed the entire code with the exception of file arpsio3d.f90 to be instrumented.

Similarly, one may include only certain files and/or functions:


The inclusion setting as shown above limited instrumentation to this single subroutine, ENSRF_ANALYSIS. Of course, more than one function could be included (or excluded).

We also applied loop-level instrumentation to a single routine. In this case, the select.tau file looked like:


loops file="enkfhelp.f90" routine="ENSRF_ANALYSIS"

Note that all inclusions and exclusions appear in a single file.

The optVerbose and optKeepFiles settings described in the section on compiler instrumentation may be removed or left in place. We maintained them.

Profile data was viewed with pprof and ParaProf. (See the section on Viewing Profiling Data.)