Exporter la page en format Open Document

This documentation is obsolete. XeonPHI cards are disabled in this node.

XeonPHI

Intel® Xeon Phi™ coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit available processor vector capabilities or memory bandwidth. For such applications, the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth, while maintaining the programmability and support associated with Intel Xeon processors.

MesocentreFC provide one host with 4 XeonPhi devices:

1 Device (mic1)

Family 7120P
Number of cores 61
Frequency of cores 1.28 GHz
GDDR5 memory size 16 GB
Number of hardware threads per core 4
SIMD vector registers32 (512-bit wide) per thread context
Level 2 cache size 61 x 512 KB 8-way associative inclusive caches
Theoretical peak performance 1.2 TFlop/s (DP)

3 Devices (mic0,mic2,mic3)

Family 5110P
Number of cores 60
Frequency of cores 1.053 GHz
GDDR5 memory size 8 GB
Number of hardware thre ads per core 4
Level 2 cache size 30M
Theoretical peak performance 1.0 TFlop/s (DP)

Host

Number of cores 8
Processor familyIntel(R) Xeon(R) IvyBridge
Frequency 2.6 GHz
Memory Size128G

Is the Intel® Xeon Phi™ coprocessor right for me ?

You can use Intel Xeon processors and Intel Xeon Phi coprocessors together to optimize performance for almost any workload. To take full advantage of Intel Xeon Phi coprocessors, an application must scale well to over onehundred threads, and either make extensive use of vectors or efficiently use more local memory bandwidth than is available on an Intel Xeon processor. Learn more at https://software.intel.com/en-us/articles/is-the-intel-xeon-phi-coprocessor-right-for-me

Access

The Xeon PHI hosting node is configured in new SGE queue called xphi.q on lumiere cluster.

The access to node is exclusive i.e only one user can access at once.

By default the access to the xphi.q queue is disabled. Contact us if you need to use Xeon PHI card.

First we access to the queue xphi.q (since the access is exclusive we can ask for full memory i.e 128 G)

From lumiere cluster (mesologin1.univ-fcomte.fr) :

[user@mesologin1 ~]$ qlogin -q xphi.q -l h_vmem=128G

A remote shell is opened to the node1-51

Your job 79206 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 79206 has been successfully scheduled.
Establishing builtin session to host node1-51.cluster ...
 
[user@node1-51 ~]$

Change directory to WORK (cd WORK) (because HOME is readonly on all computing nodes)

You can check the state of the MIC card by:

[user@node1-51 ~]$ micinfo

To compile and execute on MIC card you need to load intel/mic module

 
[user@node1-51 ~]$ module load intel/mic

Check the next section to learn how to compile and execute parallel programs on Xeon PHI.

How to?

In this section we will describe briefly how to effectively use Xeon PHI.

To benefit from MIC card performance you application MUST scale over hundreds of threads and heavily uses vectorization.

Applications exploiting the parallelism are often based on: OpenMP, MKL, MPI

We describe the two ways to execute OpenMP on MIC card :

  • Native mode: compile on host run on MIC
  • Offload mode: compile on host run on host/MIC

Let's use a classical hello world openMP example.

hello.c
 
#include <omp.h>
 
main ()  {
 
int nthreads, tid;
 
    /* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
 {
 
  /* Obtain and print thread id */
   tid = omp_get_thread_num();
   printf("Hello World from thread = %d\n", tid);
  /* Only master thread does this */
   if (tid == 0) 
    {
     nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }
 
   }  /* All threads join master thread and terminate */
 
}

Native mode

The simplest model of running applications on the Intel Xeon Phi coprocessor is native mode. In native mode an application is compiled on the host using the compiler switch -mmic to generate code for the MIC architecture. The binary can then be copied to the coprocessor and has to be started there or by using micnativeloadex to automatically upload and execute code in co-processor.

To compile for mic:

$ icc -openmp -mmic hello.c -o hello.mic

To execute in mic0 (dynamic load):

$ micnativeloadex ./hello.mic

By default mic0 is used. Use -d 1 to use mic1 for example.

...
Hello World from thread = 207
Hello World from thread = 109
Hello World from thread = 219
Hello World from thread = 107
Hello World from thread = 95
Hello World from thread = 124
Hello World from thread = 110
Hello World from thread = 189
..
..

This execution use the all threads available on card. We can specify the number of threads by :

$ micnativeloadex ./hello.mic -e "OMP_NUM_THREADS=40"

In general we use :

$ micnativeloadex ./application -a "arg1 arg2 ..." -e "env1 env2 ..." -d mic3
  • -a give the list of application arguments
  • -e pass the environements variable
  • -d The (zero based) index of the Intel(R) Xeon Phi(TM) coprocessor to run the app on

Offload mode

Explicit controls of data transfer and remote execution using compiler offload progmas/directives.

Lets modify our example :

hello_offload.c
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define NAMELENGTH 256
 
main ()  {
 
char hostname[NAMELENGTH];
int nthreads, tid;
 
/* Fork a team of threads with each thread having a private tid variable */
#pragma offload target(mic)
gethostname(hostname,NAMELENGTH);
 
#pragma offload target (mic)
#pragma omp parallel private(tid)
 {
   /* Obtain and print thread id */
     tid = omp_get_thread_num();
     printf("Hello World from thread = %d\n", tid);
    /* Only master thread does this */
    if (tid == 0)
     {
        nthreads = omp_get_num_threads();
        printf("Number of threads = %d, running on %s\n", nthreads,hostname);
     }
  }  /* All threads join master thread and terminate */
 
}

To compile for offload mode:

$ icc -openmp hello_off.c -o hello_off

Once the program is finished compiling, we can simply run it on the host and the offload portions will automatically be sent to a MIC card

To execute on mic using 4 threads (on MIC) for example, just run :

$export MIC_ENV_PREFIX=MIC
$ MIC_OMP_NUM_THREADS=4 ./hello_off
Hello World from thread = 0
Number of threads = 4, running on node1-51-mic0
Hello World from thread = 2
Hello World from thread = 3
Hello World from thread = 1

Manage multiple devices

  1. Choose explicitly the device in the code, for example offload to mic1 ex :
     #pragma offload target (mic:1)
  2. Use environment variable
    export OFFLOAD_DEVICES=0,1

    to offload on mic0 and mic1 for example

Hybride mode

For hybrid OpenMP we can execute some program sections on MIC and others on HOST.

hybridOMP.c
#include <omp.h>
#include "stdio.h"
 
int main()
{
 
 
// run on host
#pragma omp parallel for
    for(int i=0; i<5; i++) {
        printf("x %d\n", i);
    }
 
 
// run on MIC
#pragma offload target (mic)
 
#pragma omp parallel for
    for(int i=0; i<5; i++) {
        printf(". %d\n", i);
    }
}

Q: How to control the number of threads on MIC and on HSOT?

A: We use environment variable, For example 240 threads on mic, 12 on host:

export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM8THREADS=240
export OMP_NUM_THREADS=12

and than run the application on host:

./hybridOMP
  • Support for the Intel® Xeon Phi™ coprocessors is introduced starting Intel® MKL 11.0
  • Heterogeneous computing: Takes advantage of both multicore host and many-core coprocessors
  • All Intel MKL functions are supported: But optimized at different levels.

Highly Optimized Functions:

  • BLAS Level 3, and much of Level 1 & 2
  • Sparse BLAS: ?CSRMV, ?CSRMM
  • Some important LAPACK routines (LU, QR, Cholesky Fast Fourier transforms)
  • Vector Math Library
  • Random number generators in the Vector Statistical Library

The following 3 usage models of MKL are available for the Xeon Phi:

  1. Automatic Offload
  2. Compiler Assisted Offload
  3. Native Execution

Automatic Offload (AO)

In the case of automatic offload the user does not have to change the code at all. For automatic offload enabled functions the runtime may automatically download data to the Xeon Phi coprocessor and execute (all or part of) the computations there. The data transfer and the execution management is completely automatic and transparent for the user.

As of Intel MKL 11.0.2 only the following functions are enabled for automatic offload:

  • Level-3 BLAS functions
    • GEMM (for m,n > 2048, k > 256)
    • ?TRSM (for M,N > 3072)
    • ?TRMM (for M,N > 3072)
    • ?SYMM (for M,N > 2048)
  • LAPACK functions
    • LU (M,N > 8192)
    • QR
    • Cholesky

The AO is enabled either by using the function mkl_mic_enable() or by setting the following environment variable:

$ export MKL_MIC_ENABLE=1

To see more information about the Automatic Offload and debugging please enable the environment variable:

$ export OFFLOAD_REPORT=2

This value can be set from 0 to 3, where a higher number means more debugging information.

Since we have several devices in host, use OFFLOAD_DEVICES environment variable to choose the offload device. For example: export OFFLOAD_DEVICES=1,2 to offload on mic1 and mic2

To build a program for automatic offload, the same way of building code as on the Xeon host is used:

$ icc -O3 -mkl file.c -o file

By default, the MKL library decides when to offload and also tries to determine the optimal work division between the host and the targets (MKL can take advantage of multiple coprocessors). In case of the BLAS routines the user can specify the work division between the host and the coprocessor by calling the routine:

mkl_mic_set_Workdivision(MKL_TARGET_MIC,0,0.5) 

or by setting the environment variable

$ export MKL_MIC_0_WORKDIVISION=0.5

Both examples specify to offload 50% of computation only to the 1st card (card #0).

Threading control in Automatic Offload

HostCoprocessor
OMP_NUM_THREADSMIC_OMP_NUM_THREADS
KMP_AFFINITYMIC_KMP_AFFINITY

For example 2 threads on host and 200 on card :

$ export OMP_NUM_THREADS=2
 
$ export MIC_OMP_NUM_THREADS=200

Compiler assisted Offload (CAO)

In this mode of MKL the offloading is explicitly controlled by compiler pragmas or directives. In contrast to the automatic offload mode, all MKL function can be offloaded in CAO-mode.

A big advantage of this mode is that it allows for data persistence on the device.

For Intel compilers it is possible to use AO and CAO in the same program, however the work division must be explicitly set for AO in this case. Otherwise, all MKL AO calls are executed on the host.

MKL functions are offloaded in the same way as any other offloaded function. An example for offloading MKL’s sgemm routine looks as follows:

#pragma offload target(mic) \ 
  in(transa, transb, N, alpha, beta) \
  in(A:length(N*N)) \ 
  in(B:length(N*N)) \ 
  in(C:length(N*N)) \ 
  out(C:length(N*N) alloc_if(0)) { 
 
  sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); 
 
  }

To build a program for compiler assisted offload, the following command is recommended by Intel:

$ icc –O3 -openmp -mkl \
–offload-option,mic,ld, “-L$MKLROOT/lib/mic -Wl,\
--start-group -lmkl_intel_lp64 -lmkl_intel_thread \
-lmkl_core -Wl,--end-group” file.c –o file

To avoid using the OS core, it is recommended to use the following environment setting (in case of a 61-core coprocessor):

MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-240:1]

Setting larger pages by the environment setting

MIC_USE_2MB_BUFFERS=16K 

usually increases performance.

Native Execution

In this mode of MKL the Intel Xeon Phi coprocessor is used as an independent compute node.

To build a program for native mode, the following compiler settings should be used:

$ icc -O3 -mkl -mmic file.c -o file.mic

To execute program in Xeon PHI from the Host :

 
$ micnativeloadex ./file.mic -e "environment variables" -a "args"

The program will automatically loaded and executed in MIC.

provide environment variable

Before starting using Intel MPI in the mic you need to setup the environment (ssh password less to mic0):

  • From mesologin1 type:
$ ssh-keygen -f $HOME/.ssh/id_rsa
  • Don't enter passphrase if you wan't to connect without password
  • Once key was generated:
$ mkdir $WORK/.ssh
$ cp $HOME/.ssh/id_rsa.pub $WORK/.ssh/authorized_keys
  • Logout from mesologin1 and reconnect to mesologin1
  • To test if it works:
$ qlogin -q xphi.q
$ ssh mic0
  • The WORK directory is mounted read only on MIC
  • For mpi we use only Intel MPI
    $ module load intel/mic

We use this example for testing

hello_mpi.c
#include <mpi.h>
 
int main(int argc, char** argv) {
 
MPI_Init(&argc, &argv);
 
 // Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
 
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
 
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
             processor_name, world_rank, world_size);
 
 // Finalize the MPI environment.
 MPI_Finalize();
}

Native Intel MPI

All MPI ranks reside only on the coprocessors.

To build and run an application in coprocessor-only mode, the following commands have to be executed:

  1. Compile the program for the coprocessor (-mmic)
    $ mpiicc -mmic hello_mpi.c -o hello_mpi.mic
  2. Set the I_MPI_MIC variable
    export I_MPI_MIC=1
  3. Launch MPI jobs on the coprocessor mic0 from the host you MUST give the full path to program (because WORK is shared with MIC)
    mpirun -host mic0  -n 244 /Work/Users/kmazouzi/hello_mpi.mic

Alternatively one can copy and login to the mic0 and run mpirun there)

Symmetric Intel MPI

The MPI ranks reside on both the host and the coprocessor. Most general MPI case.

  1. Compile the program for the coprocessor (-mmic)
    $ mpiicc -mmic hello_mpi.c -o hello_mpi.mic 
  2. Compile the program for the host
    $ mpiicc -mmic hello_mpi.c -o hello_mpi
  3. Set the I_MPI_MIC variable
    export I_MPI_MIC=1
  4. Launch MPI jobs on the host node1-51 and on the coprocessor mic0
    $ mpirun -host node1-51 -n 1 ./hello_mpi : -n 244  -host mic0 /Work/Users/kmazouzi/hello_mpi.mic

    this will run 1 process on host (node1-51) and 244 on mic0.

For MIC you need to give the full path to the executable program

Offload Intel MPI

This mode is also known as Hybride MPI/OpenMP.

All MPI ranks reside only on host, some sections are offloaded to mic using openMP pragma.

Lets modify our example

hello_mpi_omp.c
include <mpi.h>;
#include <omp.h> 
#include <unistd.h>
int main(int argc, char** argv) {
 
MPI_Init(&argc, &argv);
 
  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
 
  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
 
  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);
 
 char processor_name_target[MPI_MAX_PROCESSOR_NAME];
 int th_id;
// Only process 0 can offload to mic
if(world_rank==0)
{
#pragma offload target (mic)
#pragma omp parallel private(th_id) 
 {
    gethostname(processor_name_target,MPI_MAX_PROCESSOR_NAME) ;
    th_id=omp_get_thread_num();
    printf("Hello world from thread %d on processor %s, from mpi process %d\n",th_id,processor_name_target,world_rank);
 }
}
  // Print off a hello world message
  printf("Hello world from processor %s, rank %d"
         " out of %d processors\n",
         processor_name, world_rank, world_size);
 
  // Finalize the MPI environment.
  MPI_Finalize();
}
  • Compile
$ mpiicc -openmp hello_mpi_omp.c -o  hello_mpi_omp
  • Execute
$ MIC_NUM_THREADS=10 mpirun -np 3 ./hello_mpi_omp

Run 3 MPI process on host, process 0 will spawn 10 threads on mic0

OMP_NUM_THREADS=10 mpirun -np 3 ./a.out 
Hello world from processor node1-51, rank 1 out of 3 processors
Hello world from processor node1-51, rank 2 out of 3 processors
Hello world from processor node1-51, rank 0 out of 3 processors
Hello world from thread 0 on processor node1-51-mic0, from mpi process 0
Hello world from thread 2 on processor node1-51-mic0, from mpi process 0
Hello world from thread 4 on processor node1-51-mic0, from mpi process 0
Hello world from thread 6 on processor node1-51-mic0, from mpi process 0
Hello world from thread 3 on processor node1-51-mic0, from mpi process 0
Hello world from thread 7 on processor node1-51-mic0, from mpi process 0
Hello world from thread 8 on processor node1-51-mic0, from mpi process 0
Hello world from thread 9 on processor node1-51-mic0, from mpi process 0
Hello world from thread 5 on processor node1-51-mic0, from mpi process 0
Hello world from thread 1 on processor node1-51-mic0, from mpi process 0

Batch mode (SGE)

Once you compiled your code for Xeon phi (see section above) you can submit your job using SGE.

Here is an example of an SGE script running OpenMP program on Mic using native mode:

xphi.q
#!/bin/bash 
 
#$ -q xphi.q
 
#$ -N myPhi_program
 
module load intel/mic
 
### You only to modify
 
micnativeloadex ./myprogram.mic a "my program argument" -e "MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-240:1]"

Performances

  • Better use one core for the OS running on MIC so the maximum of threads to use us use 61*4 -1 = 243
  • Try different threads affinity using
   $ export MIC_ENV_PREFIX=MIC
   $ export MIC_KMP_AFFINITY='...your affinity settings of choice...'
 

Where MIC_KMP_AFFINITY can takes several values, those often used:

  1. granularity=fine,balanced
  2. granularity=fine,compact

TIP: use the VERBOSE modifier on MIC_KMP_AFFINITY to get a detailed list of bindings.
Example:

export MIC_KMP_AFFINITY="verbose,granularity=fine,balanced"

Variablepossibile values doc
MKL_MIC_ENABLE 0, 1 enable or not Automatic MKL offload
OFFLOAD_REPORT 0,1,2,3 offload debug informations
OFFLOAD_DEVICES {0,1,2,3} choose device de offload
MKL_MIC_X_WORKDIVISION (X is id of device) work division between the host and the coprocessor

Benchmarks

We used a naive matmul.c for benchmark, Matrix size: '10000x10000'.

Execution modetime(s)Gflops/s
12 threads on host2577.5
native execution 244 threads 15133.5
offload execution 244 threads19.3103.8

We used BlAS 3 DGEMM subroutine, Matrix size: '10000x10000'

Execution modetime(s)Gflops/s
12 threads on host13152
Automatic Offload (244 threads)3640
Native execution (244 threads) 5.8364

Links