Intel Cluster Studio XE 2013 (upgrade 15.0.3)
Intel Cluster Studio XE 2013 meets the challenges facing HPC developers by providing, for the first time, a comprehensive suite of tools that enables developers to boost HPC application performance and reliability. It combines Intel’s proven cluster tools with Intel’s advanced threading/memory correctness analysis and performance profiling tools to enable scaling application development for today’s and tomorrow’s HPC cluster systems. It contains:
- Intel® C++ Compiler XE 14.0.2
- Intel® Fortran Compiler XE 14.0.2
- Intel® Debugger 13.0 Update 1 (for Linux* OS only)
- GNU* Project Debugger (GDB*) 7.5
- Intel® Integrated Performance Primitives 8.0 Update 1
- Intel® Threading Building Blocks (TBB) 4.2
- Intel® Math Kernel Library (MKL) 11.1
- Intel® MPI Library 4.1 Update 1
- Intel® Trace Analyzer and Collector 8.1 Update 3
Compiling Applications
Before you start compiling your application, you need to load the corresponding module:
$ module load intel/14.0.2
Sequential programs : ifort
, icc
, icpc
Parallel programs (MPI) :
mpiifort, mpiicc, mpiicpp
Compiler command line:
$ icc [options] myprogram.c -o myprogram
Compilers Options
Selected Intel compiler options:
Option | Description |
---|---|
-O0 | Disables all optimizations. Recommended for program development and debugging |
-O1 | Enables optimization for speed, while being aware of code size (e.g no loop unrolling) |
-O2 | Default optimization. Optimizations for speed, including global code scheduling, software pipelining, predication, and speculation. |
-fast | Shorthand. -O3 -ipo -static -xHOST -no-prec-div . Note that the processor specific optimization flag (-xHOST ) will optimize for the processor compiled on it is the only flag of -fast that may be overridden. |
-O3 | -O2 optimizations plus more aggressive optimizations such as prefetching, scalar replacement, and loop transformations. Enables optimizations for technical computing applications (loop-intensive code): loop optimizations and data prefetch. |
-axCORE-AVX2 | This option tells the compiler to generate multiple, processor-specific auto-dispatch code paths for Intel processors if there is a performance benefit. It also generates a baseline code path which can run on non-AVX processors. The Intel processor-specific auto-dispatch path is usually more optimized than the baseline path. |
-p | Compiles and links for function profiling with gprof. |
-g | Produces a symbol tables, i.e. line numbers for profiling are available. |
-openmp | Enables the parallelizer to generate multithreaded code based on OpenMP directives. |
-parallel | Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. To use this option, you must also specify -O2 or -O3. |
-opt_report | generate an optimization report to stderr. |
-fp-model precise | Allow users to trade off floating point optimizations for accuracy and consistency. Intel recommends specifying -fp-model precise . The default value of the option is -fp-model fast=1 or which means that the compiler uses more aggressive optimizations on floating-point calculations. |
-pc 64 | Force the compiler to use floating double precision (53-bit mantissa) to ensure the portability of results. |
Because processors of cluster Lumière are Sandybridge
. We strongly recommend the following options to reach the best performances:
-axCORE-AVX2 -O2
if your application is sensitive to the to the accuracy of the precision, or fractional part of the floating-point value. You should add: -fp-model precise -pc 64. This option may degrade performances.
Examples :
Compile a C program
$ icc -axCORE-AVX2 -O3 -fp-model precise -pc 64 pgm.c -o pgm
Compile a Fortran program
$ ifort -axCORE-AVX2 -O3 -fp-model precise -pc 64 pgm.f -o pgm
Compiler a C++ program
$ icpc -axCORE-AVX2 -O3 -fp-model precise -pc 64 pgm.cpp -o pgm
Compile an OpenMP program written inc C
$ icpc -openmp -axCORE-AVX2 -O3 -fp-model precise -pc 64 pgm.cpp -o pgm
Compiler an OpenMP program written in Fortran
$ ifort -openmp -axCORE-AVX2 -O3 -fp-model precise -pc 64 pgm.f -o pgm
Compiler Auto Vectorization
The Intel compiler has several options for vectorization. One option is the -x
flag, which tells the compiler to generate specific vectorization instructions. The -x
flag takes a mandatory option, which can be AVX (i.e., -xAVX
), SSE4.2, SSE4.1, SSE3, SSE2, etc.
Using the -xHost
flag enables the highest level of vectorization supported on the processor on which the user compiles. Note that the Intel compiler will try to vectorize a code with SSE2 instructions at optimizations of -O2
or higher. Disable this by specifying -no-vec
.
The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax
flag, which takes the same options as the -x
flag (i.e.,AVX2 ,AVX, …, SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and ill then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax
level of vectorization specified is not supported. Example -axCORE-AVX2
.
Another useful option for the Intel compiler is the -vec-report
flag, which generates diagnostic information regarding vectorization to stdout
. The -vec-report
flag takes an optional parameter that can be a number between 0 and 5 (e.g., -vec-report0
), with 0 disabling diagnostics and 5 providing the most detailed diagnostics about what loops were optimized, what loops were not optimized, and why those loops were not optimized. The output can be useful to identify possible strategies to get a loop to vectorize.
Example
Comparing a program, Matrix Multiply (sequential version), compiled with with options: -O3
and -03 -xCORE-AVX2
for N=6000
Option | Time (s) | MFlops |
---|---|---|
-O3 | 145 | 4079 |
-O3 -xCORE-AVX2 | 49 | 4405 |