Intel Cluster Studio XE 2013 (upgrade 15.0.3)

Intel Cluster Studio XE 2013 meets the challenges facing HPC developers by providing, for the first time, a comprehensive suite of tools that enables developers to boost HPC application performance and reliability. It combines Intel’s proven cluster tools with Intel’s advanced threading/memory correctness analysis and performance profiling tools to enable scaling application development for today’s and tomorrow’s HPC cluster systems. It contains:

  • Intel® C++ Compiler XE 14.0.2
  • Intel® Fortran Compiler XE 14.0.2
  • Intel® Debugger 13.0 Update 1 (for Linux* OS only)
  • GNU* Project Debugger (GDB*) 7.5
  • Intel® Integrated Performance Primitives 8.0 Update 1
  • Intel® Threading Building Blocks (TBB) 4.2
  • Intel® Math Kernel Library (MKL) 11.1
  • Intel® MPI Library 4.1 Update 1
  • Intel® Trace Analyzer and Collector 8.1 Update 3

Compiling Applications

Intel cluster suite is only installed on Lumière cluster.

Before you start compiling your application, you need to load the corresponding module:

$ module load intel/14.0.2

Sequential programs : ifort, icc, icpc

Parallel programs (MPI) : mpiifort, mpiicc, mpiicpp

Compiler command line:

$ icc [options] myprogram.c -o myprogram 

Selected Intel compiler options:

OptionDescription
-O0Disables all optimizations. Recommended for program development and debugging
-O1Enables optimization for speed, while being aware of code size (e.g no loop unrolling)
-O2Default optimization. Optimizations for speed, including global code scheduling, software pipelining, predication, and speculation.
-fastShorthand. -O3 -ipo -static -xHOST -no-prec-div. Note that the processor specific optimization flag (-xHOST) will optimize for the processor compiled on it is the only flag of -fast that may be overridden.
-O3-O2 optimizations plus more aggressive optimizations such as prefetching, scalar replacement, and loop transformations. Enables optimizations for technical computing applications (loop-intensive code): loop optimizations and data prefetch.
-axCORE-AVX2This option tells the compiler to generate multiple, processor-specific auto-dispatch code paths for Intel processors if there is a performance benefit. It also generates a baseline code path which can run on non-AVX processors. The Intel processor-specific auto-dispatch path is usually more optimized than the baseline path.
-p Compiles and links for function profiling with gprof.
-g Produces a symbol tables, i.e. line numbers for profiling are available.
-openmpEnables the parallelizer to generate multithreaded code based on OpenMP directives.
-parallelTells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. To use this option, you must also specify -O2 or -O3.
-opt_reportgenerate an optimization report to stderr.
-fp-model precise Allow users to trade off floating point optimizations for accuracy and consistency. Intel recommends specifying -fp-model precise. The default value of the option is -fp-model fast=1 or which means that the compiler uses more aggressive optimizations on floating-point calculations.
-pc 64Force the compiler to use floating double precision (53-bit mantissa) to ensure the portability of results.

Because processors of cluster Lumière are Sandybridge. We strongly recommend the following options to reach the best performances:

-axCORE-AVX2 -O2

if your application is sensitive to the to the accuracy of the precision, or fractional part of the floating-point value. You should add: -fp-model precise -pc 64. This option may degrade performances.

Examples :

Compile a C program

$ icc -axCORE-AVX2  -O3 -fp-model precise -pc 64 pgm.c -o pgm

Compile a Fortran program

$ ifort -axCORE-AVX2  -O3 -fp-model precise -pc 64 pgm.f -o pgm

Compiler a C++ program

$ icpc -axCORE-AVX2  -O3 -fp-model precise -pc 64 pgm.cpp -o pgm

Compile an OpenMP program written inc C

$ icpc -openmp -axCORE-AVX2 -O3 -fp-model precise -pc 64 pgm.cpp -o pgm

Compiler an OpenMP program written in Fortran

$ ifort -openmp -axCORE-AVX2  -O3 -fp-model precise -pc 64 pgm.f -o pgm

Compiler Auto Vectorization

The Intel compiler has several options for vectorization. One option is the -x flag, which tells the compiler to generate specific vectorization instructions. The -x flag takes a mandatory option, which can be AVX (i.e., -xAVX), SSE4.2, SSE4.1, SSE3, SSE2, etc.

Using the -xHost flag enables the highest level of vectorization supported on the processor on which the user compiles. Note that the Intel compiler will try to vectorize a code with SSE2 instructions at optimizations of -O2 or higher. Disable this by specifying -no-vec.

The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e.,AVX2 ,AVX, …, SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and ill then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. Example -axCORE-AVX2.

Another useful option for the Intel compiler is the -vec-report flag, which generates diagnostic information regarding vectorization to stdout. The -vec-report flag takes an optional parameter that can be a number between 0 and 5 (e.g., -vec-report0), with 0 disabling diagnostics and 5 providing the most detailed diagnostics about what loops were optimized, what loops were not optimized, and why those loops were not optimized. The output can be useful to identify possible strategies to get a loop to vectorize.

Comparing a program, Matrix Multiply (sequential version), compiled with with options: -O3 and -03 -xCORE-AVX2 for N=6000

OptionTime (s) MFlops
-O3 145 4079
-O3 -xCORE-AVX2 49 4405

Links