Debugging & Optimization
Cray Performance Analysis Tools on Phoenix
Profiling
Profiling with the Cray tools requires multiple steps, but it does not require you to recompile your code.
Robin is a computer system that is used as a cross-compiler system for Phoenix. Because it is much faster at scalar operations, such as compiling and performance analysis, you may want to perform as much work on Robin as possible. On Robin, you can do the following:
-
Compile for Phoenix.
-
Run
pat_buildto generate instrumented executables. -
Submit batch jobs using
qsub. -
Run
pat_reportto generate performance reports. -
Run
app2to visualize performance.
You cannot run aprun or pat_run directly on Robin, but you can submit jobs that in turn use these commands.
pat_build
Builds an instrumented version of an executable code.
> pat_build [options] <executable> <instrumented executable>
Supports
-
Fortran, C, C++, CAF, UPC
-
MPI, SHMEM, OpenMP, pThreads
Performance measurements
-
Sample-based (operating system [OS] interrupts or hardware [HW] counters overflow)
-
Trace-based
-
User functions
-
Application programming interface (API) for fine-grain instrumentation
-
Predefined function groups (
mpi,shmem,io, etc.)
-
Source code mapping
-
Call stack
-
Line numbers
pat_run
User interface to simplify Cray PAT usage. Runs an instrumented executable and generates a report, all in one step.
The following executes a.out_instr and produces a report measuring the number of floating point operations per second (flops), calculating the mega floating point operations per second (mflops) rate, and determining the average number of results produced per vector operation for the traced functions.
pat_run -O flops,mflops,vl a.out_instr
The following produces a load-balance report showing average versus maximum time per processor (based on wall-clock time) for an MPI program:
pat_run -O balance mpirun -np 4 a.out_instr
The -O option is a comma-delineated list of keywords to specify the following:
-
Specify data to be recorded:
cycles,flops,mflops,vl, etc. -
Show how to record it:
sample,overflow,trace. -
Show callers:
callers,calltree. -
Show source/line number:
source,line. -
Show load balance:
balance[.$data][.$by].-
$datacan besamplesortime(default),cycles, etc. -
$bycan bepe(default),thread, orssp.
-
Examples
To get basic profile run
pat_run -b [pe,]function:source,line [-s percent=relative] aprun -n1 <instrumented executable>
In the output file
100.0% | 100.0% | 965 |Total |------------------------------------- | 88.2% | 88.2% | 851 |kron_matmull@module_kron_ ||------------------------------------ || 40.4% | 40.4% | 344 |line.307 || 37.0% | 77.4% | 315 |line.297
To get a call tree run
pat_run -b [pe,] function:source,calltree [-s percent=relative] aprun -n1 <instrumented executable>
pat_report
You can use aprun instead of pat_run on an instrumented executable, which will produce a performance-data file (ending in .xf). This file can then be processed into a human-readable text profile using the pat_report command.
Experiment Types
There are many types of performance experiments you can run. By default on Phoenix, the experiment type is profil, which has the lowest overhead. It samples the program counter by user and system CPU time. Another common experiment is samp_cs_time, which samples the call stack at a given time interval. This experiment returns the total program time and the absolute and relative times each call-stack counter was recorded but is otherwise identical to the samp_pc_time experiment. This later experiment is useful for obtaining profile reports with call-tree or caller information included.
See the pat man page for more information.
Run-Time Library
Use the PAT run-time library to get statistics on a specific region of code.
Example
program test_module_kron use pat_api … ! Begin region of interest call PAT_region_begin ( 1, 'kron_matmul_kernel' ) ! # and name must be unique to each region call kron_matmulL(…) ! End region of interest call PAT_region_end ( 1 ) end program
Compile
ftn *.f -o test.exe
Relink
pat_build -w test.exe test.exe.trace
Run and produce a report
[setenv PAT_RT_RECORD_SSP 0-3] pat_run -g normal [-b function,ssp=HIDE] aprun -n1 test.exe.trace
Apprentice2 Visualizer
Apprentice2 is targeted to help identify and correct
-
Excessive communication
-
Network contention
-
Load imbalance
-
Excessive serialization
Supports
-
Call graph profile
-
Communication statistics
-
Timeline view
-
Communication
-
Input/output (I/O)
-
-
Activity view
-
Pairwise communication statistics
-
Text reports
-
Source code mapping
Apprentice2 (invoked with app2) takes as input an XML file. The input file is generated as follows:
pat_report –c records –f xml <perf.file>.xf > <perf.file>.xml gzip <file>.xml
The gzip part is optional but recommended because the XML files can get quite large.
Visualization is possible with both profiles and trace files, but Apprentice2 has less functionality with profiles. The following features are supported for profiles (run-time summaries):
-
Call graph view
-
Function statistics overview
-
Function report
-
PE breakdown
-
General information
Hardware Performance Counters
pat_hwpc
pat_hwpc collects hardware performance counters information for an application. No instrumentation is required. Usage is as follows:
pat_hwpc [options] <executable>
pat_hwpc accepts various hardware counter groups and produces a report with raw counts and derived metrics for the whole execution. The hardware counters are summed across all threads in each process.
As an example for pat_hwpc, stats on an entire execution would be as follows:
-
No recompiling, relinking, or instrumentation required.
-
Can just run and produce a report:
pat_hwpc [-d P –d E] aprun -n1 test.exe.
You may need to set PAT_HWPC_APPLACE_TIME to a value larger than the default of 5, say 20 or 60, to be able to use pat_hwpc. If not, you may get a message saying something to the effect of “not able to start application.” This environment variable is the number of seconds allotted for the application to be scheduled for execution before pat_hwpc terminates.
Totals for program ------------------------------------------------------------------------ Cycles 9.219 secs 3687579361 cycles Instructions graduated 625.832M/sec 5769509199 instr Branches & Jumps 6.543M/sec 60322302 instr Branches mispredicted 0.292M/sec 2692264 misses 4.5% Correctly predicted 6.251M/sec 57630038 95.5% Vector instructions 348.059M/sec 3208733836 instr 55.6% Scalar instructions 277.773M/sec 2560775363 instr 44.4% Vector ops 15528.961M/sec 143160693833 ops Vector FP adds 7088.614M/sec 65349570400 ops Vector FP multiplies 7106.030M/sec 65510120760 ops Vector FP divides etc 0.001M/sec 12708 ops Vector FP misc 10.220M/sec 94213419 ops Vector FP ops 14204.865M/sec 130953917287 ops 100.0% Scalar FP ops 0.003M/sec 28177 ops 0.0% Total FP ops 14204.868M/sec 130953945464 ops FP ops per load 12.18 Scalar integer ops 4.053M/sec 37368261 ops Scalar memory refs 82.084M/sec 756725277 refs 7.0% Vector TLB misses 202 /sec 1867 misses Scalar TLB misses 53 /sec 496 misses Instr TLB misses 37 /sec 347 misses Total TLB misses 293 /sec 2710 misses Dcache references 81.956M/sec 755550421 refs Dcache bypass refs 0.127M/sec 1174856 refs Dcache misses 3.508M/sec 32337195 misses Vector integer adds 0.074M/sec 681031 ops Vector logical ops 172.972M/sec 1594618026 ops Vector shifts 57.677M/sec 531720865 ops Vector loads 966.865M/sec 8913480904 refs Vector stores 117.717M/sec 1085230909 refs Vector memory refs 1084.583M/sec 9998711813 refs 93.0% Scalar memory refs 82.084M/sec 756725277 refs 7.0% Total memory refs 1166.666M/sec 10755437090 refs Average vector length 44.62 A-reg Instr 173.494M/sec 1599433367 instr Scalar FP Instr 0.003M/sec 28177 instr Syncs Instr 0.725M/sec 6682128 instr Stall VLSU 7.705 secs 3082157516 clks Stall VU 6.220 secs 2487845272 clks Vector Load Alloc 522.613M/sec 4817941865 refs Vector Load Index 8.710M/sec 80301038 refs Vector Load Stride 0.007M/sec 61414 refs Vector Store Alloc 65.338M/sec 602349028 refs Vector Store Stride 798 /sec 7359 refs