Perfsuite
PerfSuite is a collection of tools, utilities, and libraries for software performance analysis. It is relatively easy to use and is targeted to users at all levels of expertise.
Installed version
- 1.1.4 built with GCC and PAPI
PerfSuite is available as a module. To load PerfSuite, use the module load command as follows:
$ module load perf/perfsuite/1.1.4
Basic usage
- run program with
psrun
command - view statitistics with
psprocess
command
To measure code performance using PerfSuite, use the psrun
command as follows:
For serial codes:
$ psrun [-c configuration.xml] a.out
For OpenMP codes:
$ export OMP_NUM_THREADS x $ psrun -p [-c configuration.xml] a.out
For MPI codes:
$ module load mpi/openmpi $ mpirun -np xx psrun -f [-c configuration.xml] a.out
An XML will be generated at the end of the execution
The
psrun
command uses an XML configuration file to determine the mode of operation (counting or profiling). If you do not specify an XML configuration file, the default file itimer.xml
will be used.
For post-processing the output, use the psprocess
command as follows:
$ psprocess a.out.pid.hostname.xml
Example
Execute Matmul (omp_matmul_03.exe
) program with N=5000
Execute:
$ export OMP_NUM_THREADS=8 $ psrun ./omp_matmul_03.exe 5000
The output file is omp_matmul_03.exe.49333.mesoshared.xml
Postprocessing:
$ psprocess omp_matmul_03.exe.49333.mesoshared.xml
Statistic output:
PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Tue Dec 06 16:39:13 CET 2016 Generator : psprocess 0.5 XML Source : omp_matmul_03.exe.49333.mesoshared.xml Execution Information ============================================================================================ Collector : libpshwpc Date : Tue Dec 6 16:37:47 2016 Host : mesoshared Process ID : 49333 Thread : 0 User : kmazouzi Command : omp_matmul_03.exe Processor and System Information ============================================================================================ Node CPUs : 32 Vendor : Intel Brand : Intel(R) Xeon(R) CPU X7550 @ 2.00GHz CPUID : family: 6, model: 46, stepping: 6 CPU Revision : 6 Clock (MHz) : 2000.180 Memory (MB) : 64409.21 Pagesize (KB) : 4 Cache Information ============================================================================================ Cache levels : 3 -------------------------------- Level 1 Type : instruction Size (KB) : 32 Linesize (B) : 64 Assoc : 4 Type : data Size (KB) : 32 Linesize (B) : 64 Assoc : 8 -------------------------------- Level 2 Type : unified Size (KB) : 256 Linesize (B) : 64 Assoc : 8 -------------------------------- Level 3 Type : unified Size (KB) : 18432 Linesize (B) : 64 Assoc : 24 Index Description Counter Value ============================================================================================ 1 Conditional branch instructions.................................. 1000412152 2 Branch instructions.............................................. 1058656734 3 Conditional branch instructions mispredicted..................... 1436660 4 Conditional branch instructions taken............................ 867951726 5 Floating point operations........................................ 4377146474 6 Level 1 data cache accesses...................................... 9466165484 7 Level 1 data cache misses........................................ 603760004 8 Level 1 instruction cache accesses............................... 4358373809 9 Level 1 instruction cache misses................................. 3199281 10 Level 2 instruction cache accesses............................... 3064801 11 Level 2 instruction cache misses................................. 2001209 12 Level 2 total cache accesses..................................... 956543561 13 Level 2 cache misses............................................. 372782206 14 Load instructions................................................ 4994753947 15 Cycles stalled on any resource................................... 5075909775 16 Store instructions............................................... 2278401353 17 Data translation lookaside buffer misses......................... 1037980 18 Instruction translation lookaside buffer misses.................. 4486 19 Total cycles..................................................... 10721046042 20 Instructions issued.............................................. 11870790930 21 Instructions completed........................................... 12961269121 Event Index ============================================================================================ 1: PAPI_BR_CN 2: PAPI_BR_INS 3: PAPI_BR_MSP 4: PAPI_BR_TKN 5: PAPI_FP_OPS 6: PAPI_L1_DCA 7: PAPI_L1_DCM 8: PAPI_L1_ICA 9: PAPI_L1_ICM 10: PAPI_L2_ICA 11: PAPI_L2_ICM 12: PAPI_L2_TCA 13: PAPI_L2_TCM 14: PAPI_LD_INS 15: PAPI_RES_STL 16: PAPI_SR_INS 17: PAPI_TLB_DM 18: PAPI_TLB_IM 19: PAPI_TOT_CYC 20: PAPI_TOT_IIS 21: PAPI_TOT_INS Statistics ============================================================================================ Counting domain........................................................ user Multiplexed............................................................ yes Floating point operations per cycle.................................... 0.408 Floating point operations per graduated instruction.................... 0.338 Graduated instructions per cycle....................................... 1.209 Issued instructions per cycle.......................................... 1.107 Graduated instructions per issued instruction.......................... 1.092 Issued instructions per level 1 instruction cache miss................. 3710.456 Graduated instructions per level 1 instruction cache miss.............. 4051.307 Level 1 data cache accesses per graduated instruction.................. 0.730 % cycles stalled on any resource....................................... 47.345 Graduated loads and stores per floating point operation................ 1.662 Level 1 instruction cache misses per issued instruction................ 0.000 Level 1 cache miss ratio (data)........................................ 0.064 Level 1 cache miss ratio (instruction)................................. 0.001 Level 2 cache miss ratio (data), data cache miss and access counts derived 0.389 Level 2 cache miss ratio (instruction)................................. 0.653 Bandwidth used to level 2 cache (MB/s)................................. 4451.097 MFLOPS (cycles)........................................................ 816.626 MFLOPS (wall clock).................................................... 659.210 MIPS (cycles).......................................................... 2418.129 MIPS (wall clock)...................................................... 1952.000 CPU time (seconds)..................................................... 5.360 Wall clock time (seconds).............................................. 6.640 % CPU utilization...................................................... 80.724
PerfSuite Command-Line Tools
Command-Line Utilities
psinv
: a utility that provides access to information about the characteristics of a machine (e.g., processor type, processor features, cache information, available performance counters)psprocess
: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurementspsrun
: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based, and MPI applications. Performance counter multiplexing is supported. Optionally, psrun can also report information about the resource usage of your application such as memory usage, page faults, etc. psrun requires no source code changes to or relinking of your application.psconfigs
: a tool for easy management of PerfSuite configuration files (needs X session)
No XML file is currently available to gather overall performance characteristics for Haswell or Broadwell processors.