Perfsuite

PerfSuite is a collection of tools, utilities, and libraries for software performance analysis. It is relatively easy to use and is targeted to users at all levels of expertise.

  • 1.1.4 built with GCC and PAPI

PerfSuite is available as a module. To load PerfSuite, use the module load command as follows:

$ module load perf/perfsuite/1.1.4
  1. run program with psrun command
  2. view statitistics with psprocess command

To measure code performance using PerfSuite, use the psrun command as follows:

For serial codes:

$ psrun [-c configuration.xml] a.out

For OpenMP codes:

$ export OMP_NUM_THREADS x
$ psrun -p [-c configuration.xml] a.out

For MPI codes:

$ module load mpi/openmpi
 
$ mpirun -np xx psrun -f [-c configuration.xml] a.out

An XML will be generated at the end of the execution

The psrun command uses an XML configuration file to determine the mode of operation (counting or profiling). If you do not specify an XML configuration file, the default file itimer.xml will be used.

For post-processing the output, use the psprocess command as follows:

$ psprocess a.out.pid.hostname.xml

Example

Execute Matmul (omp_matmul_03.exe) program with N=5000

Execute:

$ export OMP_NUM_THREADS=8
$ psrun ./omp_matmul_03.exe 5000

The output file is omp_matmul_03.exe.49333.mesoshared.xml

Postprocessing:

$ psprocess omp_matmul_03.exe.49333.mesoshared.xml

Statistic output:

PerfSuite Hardware Performance Summary Report
 
Version                      : 1.0
Created                      : Tue Dec 06 16:39:13 CET 2016
Generator                    : psprocess 0.5
XML Source                   : omp_matmul_03.exe.49333.mesoshared.xml
 
Execution Information
============================================================================================
Collector                    : libpshwpc
Date                         : Tue Dec  6 16:37:47 2016
Host                         : mesoshared
Process ID                   : 49333
Thread                       : 0
User                         : kmazouzi
Command                      : omp_matmul_03.exe
 
Processor and System Information
============================================================================================
Node CPUs                    : 32
Vendor                       : Intel
Brand                        : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
CPUID                        : family: 6, model: 46, stepping: 6
CPU Revision                 : 6
Clock (MHz)                  : 2000.180
Memory (MB)                  : 64409.21
Pagesize (KB)                : 4
 
Cache Information
============================================================================================
Cache levels                 : 3
--------------------------------
Level 1
Type                         : instruction
Size (KB)                    : 32
Linesize (B)                 : 64
Assoc                        : 4
Type                         : data
Size (KB)                    : 32
Linesize (B)                 : 64
Assoc                        : 8
--------------------------------
Level 2
Type                         : unified
Size (KB)                    : 256
Linesize (B)                 : 64
Assoc                        : 8
--------------------------------
Level 3
Type                         : unified
Size (KB)                    : 18432
Linesize (B)                 : 64
Assoc                        : 24
 
Index Description                                                              Counter Value
============================================================================================
    1 Conditional branch instructions..................................           1000412152
    2 Branch instructions..............................................           1058656734
    3 Conditional branch instructions mispredicted.....................              1436660
    4 Conditional branch instructions taken............................            867951726
    5 Floating point operations........................................           4377146474
    6 Level 1 data cache accesses......................................           9466165484
    7 Level 1 data cache misses........................................            603760004
    8 Level 1 instruction cache accesses...............................           4358373809
    9 Level 1 instruction cache misses.................................              3199281
   10 Level 2 instruction cache accesses...............................              3064801
   11 Level 2 instruction cache misses.................................              2001209
   12 Level 2 total cache accesses.....................................            956543561
   13 Level 2 cache misses.............................................            372782206
   14 Load instructions................................................           4994753947
   15 Cycles stalled on any resource...................................           5075909775
   16 Store instructions...............................................           2278401353
   17 Data translation lookaside buffer misses.........................              1037980
   18 Instruction translation lookaside buffer misses..................                 4486
   19 Total cycles.....................................................          10721046042
   20 Instructions issued..............................................          11870790930
   21 Instructions completed...........................................          12961269121
 
Event Index
============================================================================================
    1: PAPI_BR_CN          2: PAPI_BR_INS         3: PAPI_BR_MSP         4: PAPI_BR_TKN     
    5: PAPI_FP_OPS         6: PAPI_L1_DCA         7: PAPI_L1_DCM         8: PAPI_L1_ICA     
    9: PAPI_L1_ICM        10: PAPI_L2_ICA        11: PAPI_L2_ICM        12: PAPI_L2_TCA     
   13: PAPI_L2_TCM        14: PAPI_LD_INS        15: PAPI_RES_STL       16: PAPI_SR_INS     
   17: PAPI_TLB_DM        18: PAPI_TLB_IM        19: PAPI_TOT_CYC       20: PAPI_TOT_IIS    
   21: PAPI_TOT_INS    
 
Statistics
============================================================================================
Counting domain........................................................                 user
Multiplexed............................................................                  yes
Floating point operations per cycle....................................                0.408
Floating point operations per graduated instruction....................                0.338
Graduated instructions per cycle.......................................                1.209
Issued instructions per cycle..........................................                1.107
Graduated instructions per issued instruction..........................                1.092
Issued instructions per level 1 instruction cache miss.................             3710.456
Graduated instructions per level 1 instruction cache miss..............             4051.307
Level 1 data cache accesses per graduated instruction..................                0.730
% cycles stalled on any resource.......................................               47.345
Graduated loads and stores per floating point operation................                1.662
Level 1 instruction cache misses per issued instruction................                0.000
Level 1 cache miss ratio (data)........................................                0.064
Level 1 cache miss ratio (instruction).................................                0.001
Level 2 cache miss ratio (data), data cache miss and access counts derived                0.389
Level 2 cache miss ratio (instruction).................................                0.653
Bandwidth used to level 2 cache (MB/s).................................             4451.097
MFLOPS (cycles)........................................................              816.626
MFLOPS (wall clock)....................................................              659.210
MIPS (cycles)..........................................................             2418.129
MIPS (wall clock)......................................................             1952.000
CPU time (seconds).....................................................                5.360
Wall clock time (seconds)..............................................                6.640
% CPU utilization......................................................               80.724

PerfSuite Command-Line Tools

Command-Line Utilities

  • psinv: a utility that provides access to information about the characteristics of a machine (e.g., processor type, processor features, cache information, available performance counters)
  • psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements
  • psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based, and MPI applications. Performance counter multiplexing is supported. Optionally, psrun can also report information about the resource usage of your application such as memory usage, page faults, etc. psrun requires no source code changes to or relinking of your application.
  • psconfigs: a tool for easy management of PerfSuite configuration files (needs X session)

No XML file is currently available to gather overall performance characteristics for Haswell or Broadwell processors.

Links