XeonPHI
Intel® Xeon Phi™ coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit available processor vector capabilities or memory bandwidth. For such applications, the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth, while maintaining the programmability and support associated with Intel Xeon processors.
MesocentreFC provide one host with 4 XeonPhi devices:
1 Device (mic1)
| Family | 7120P | 
| Number of cores | 61 | 
| Frequency of cores | 1.28 GHz | 
| GDDR5 memory size | 16 GB | 
| Number of hardware threads per core | 4 | 
| SIMD vector registers | 32 (512-bit wide) per thread context | 
| Level 2 cache size | 61 x 512 KB 8-way associative inclusive caches | 
| Theoretical peak performance | 1.2 TFlop/s (DP) | 
3 Devices (mic0,mic2,mic3)
| Family | 5110P | 
| Number of cores | 60 | 
| Frequency of cores | 1.053 GHz | 
| GDDR5 memory size | 8 GB | 
| Number of hardware thre ads per core | 4 | 
| Level 2 cache size | 30M | 
| Theoretical peak performance | 1.0 TFlop/s (DP) | 
Host
| Number of cores | 8 | 
| Processor family | Intel(R) Xeon(R) IvyBridge | 
| Frequency | 2.6 GHz | 
| Memory Size | 128G | 
Is the Intel® Xeon Phi™ coprocessor right for me ?
You can use Intel Xeon processors and Intel Xeon Phi coprocessors together to optimize performance for almost any workload. To take full advantage of Intel Xeon Phi coprocessors, an application must scale well to over onehundred threads, and either make extensive use of vectors or efficiently use more local memory bandwidth than is available on an Intel Xeon processor. Learn more at https://software.intel.com/en-us/articles/is-the-intel-xeon-phi-coprocessor-right-for-me
Access
The Xeon PHI hosting node is configured in new SGE queue called xphi.q on lumiere cluster. 
The access to node is exclusive i.e only one user can access at once.
By default the access to the xphi.q queue is disabled. Contact us if you need to use Xeon PHI card.
Interactive mode
First we access to the queue xphi.q (since the access is exclusive we can ask for full memory i.e 128 G)
From lumiere cluster (mesologin1.univ-fcomte.fr) :
[user@mesologin1 ~]$ qlogin -q xphi.q -l h_vmem=128G
A remote shell is opened to the node1-51
Your job 79206 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 79206 has been successfully scheduled. Establishing builtin session to host node1-51.cluster ... [user@node1-51 ~]$
cd WORK) (because HOME is readonly on all computing nodes)
You can check the state of the MIC card by:
[user@node1-51 ~]$ micinfo
To compile and execute on MIC card you need to load intel/mic module
[user@node1-51 ~]$ module load intel/mic
Check the next section to learn how to compile and execute parallel programs on Xeon PHI.
How to?
In this section we will describe briefly how to effectively use Xeon PHI.
To benefit from MIC card performance you application MUST scale over hundreds of threads and heavily uses vectorization.
Applications exploiting the parallelism are often based on: OpenMP, MKL, MPI
OpenMP
We describe the two ways to execute OpenMP on MIC card :
- Native mode: compile on host run on MIC
- Offload mode: compile on host run on host/MIC
Let's use a classical hello world openMP example.
- hello.c
- #include <omp.h> main () { int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ } 
Native mode
The simplest model of running applications on the Intel Xeon Phi coprocessor is native mode. 
In native mode an application is compiled on the host using the compiler switch -mmic to generate code for the MIC architecture. The binary can then be copied to the coprocessor and has to be started there or by using micnativeloadex to automatically upload and execute code in co-processor. 
To compile for mic:
$ icc -openmp -mmic hello.c -o hello.mic
To execute in mic0 (dynamic load):
$ micnativeloadex ./hello.mic
By default mic0 is used. Use -d 1 to use mic1 for example.
... Hello World from thread = 207 Hello World from thread = 109 Hello World from thread = 219 Hello World from thread = 107 Hello World from thread = 95 Hello World from thread = 124 Hello World from thread = 110 Hello World from thread = 189 .. ..
This execution use the all threads available on card. We can specify the number of threads by :
$ micnativeloadex ./hello.mic -e "OMP_NUM_THREADS=40"
In general we use :
$ micnativeloadex ./application -a "arg1 arg2 ..." -e "env1 env2 ..." -d mic3
- -agive the list of application arguments
- -epass the environements variable
- -dThe (zero based) index of the Intel(R) Xeon Phi(TM) coprocessor to run the app on
Offload mode
Explicit controls of data transfer and remote execution using compiler offload progmas/directives.
Lets modify our example :
- hello_offload.c
- #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define NAMELENGTH 256 main () { char hostname[NAMELENGTH]; int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma offload target(mic) gethostname(hostname,NAMELENGTH); #pragma offload target (mic) #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d, running on %s\n", nthreads,hostname); } } /* All threads join master thread and terminate */ } 
To compile for offload mode:
$ icc -openmp hello_off.c -o hello_off
Once the program is finished compiling, we can simply run it on the host and the offload portions will automatically be sent to a MIC card
To execute on mic using 4 threads (on MIC) for example, just run :
$export MIC_ENV_PREFIX=MIC $ MIC_OMP_NUM_THREADS=4 ./hello_off
Hello World from thread = 0 Number of threads = 4, running on node1-51-mic0 Hello World from thread = 2 Hello World from thread = 3 Hello World from thread = 1
Manage multiple devices
- Choose explicitly the device in the code, for example offload to mic1 ex :#pragma offload target (mic:1)
- Use environment variableexport OFFLOAD_DEVICES=0,1 to offload on mic0 and mic1 for example 
Hybride mode
For hybrid OpenMP we can execute some program sections on MIC and others on HOST.
- hybridOMP.c
- #include <omp.h> #include "stdio.h" int main() { // run on host #pragma omp parallel for for(int i=0; i<5; i++) { printf("x %d\n", i); } // run on MIC #pragma offload target (mic) #pragma omp parallel for for(int i=0; i<5; i++) { printf(". %d\n", i); } } 
Q: How to control the number of threads on MIC and on HSOT?
A: We use environment variable, For example 240 threads on mic, 12 on host:
export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM8THREADS=240 export OMP_NUM_THREADS=12
and than run the application on host:
./hybridOMP
MKL
- Support for the Intel® Xeon Phi™ coprocessors is introduced starting Intel® MKL 11.0
- Heterogeneous computing: Takes advantage of both multicore host and many-core coprocessors
- All Intel MKL functions are supported: But optimized at different levels.
Highly Optimized Functions:
- BLAS Level 3, and much of Level 1 & 2
- Sparse BLAS: ?CSRMV, ?CSRMM
- Some important LAPACK routines (LU, QR, Cholesky Fast Fourier transforms)
- Vector Math Library
- Random number generators in the Vector Statistical Library
The following 3 usage models of MKL are available for the Xeon Phi:
- Automatic Offload
- Compiler Assisted Offload
- Native Execution
Automatic Offload (AO)
In the case of automatic offload the user does not have to change the code at all. For automatic offload enabled functions the runtime may automatically download data to the Xeon Phi coprocessor and execute (all or part of) the computations there. The data transfer and the execution management is completely automatic and transparent for the user.
As of Intel MKL 11.0.2 only the following functions are enabled for automatic offload:
- Level-3 BLAS functions- GEMM (for m,n > 2048, k > 256)
- ?TRSM (for M,N > 3072)
- ?TRMM (for M,N > 3072)
- ?SYMM (for M,N > 2048)
 
- LAPACK functions- LU (M,N > 8192)
- QR
- Cholesky
 
The AO is enabled either by using the function mkl_mic_enable() or by setting the following environment variable: 
$ export MKL_MIC_ENABLE=1
To see more information about the Automatic Offload and debugging please enable the environment variable:
$ export OFFLOAD_REPORT=2
This value can be set from 0 to 3, where a higher number means more debugging information.
Since we have several devices in host, use OFFLOAD_DEVICES environment variable to choose the offload device.
For example: export OFFLOAD_DEVICES=1,2 to offload on mic1 and mic2
To build a program for automatic offload, the same way of building code as on the Xeon host is used:
$ icc -O3 -mkl file.c -o file
By default, the MKL library decides when to offload and also tries to determine the optimal work division between the host and the targets (MKL can take advantage of multiple coprocessors). In case of the BLAS routines the user can specify the work division between the host and the coprocessor by calling the routine:
mkl_mic_set_Workdivision(MKL_TARGET_MIC,0,0.5)
or by setting the environment variable
$ export MKL_MIC_0_WORKDIVISION=0.5
Both examples specify to offload 50% of computation only to the 1st card (card #0).
Threading control in Automatic Offload
| Host | Coprocessor | 
|---|---|
| OMP_NUM_THREADS | MIC_OMP_NUM_THREADS | 
| KMP_AFFINITY | MIC_KMP_AFFINITY | 
For example 2 threads on host and 200 on card :
$ export OMP_NUM_THREADS=2 $ export MIC_OMP_NUM_THREADS=200
Compiler assisted Offload (CAO)
In this mode of MKL the offloading is explicitly controlled by compiler pragmas or directives. In contrast to the automatic offload mode, all MKL function can be offloaded in CAO-mode.
A big advantage of this mode is that it allows for data persistence on the device.
For Intel compilers it is possible to use AO and CAO in the same program, however the work division must be explicitly set for AO in this case. Otherwise, all MKL AO calls are executed on the host.
MKL functions are offloaded in the same way as any other offloaded function. An example for offloading MKL’s sgemm routine looks as follows:
#pragma offload target(mic) \ in(transa, transb, N, alpha, beta) \ in(A:length(N*N)) \ in(B:length(N*N)) \ in(C:length(N*N)) \ out(C:length(N*N) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); }
To build a program for compiler assisted offload, the following command is recommended by Intel:
$ icc –O3 -openmp -mkl \ –offload-option,mic,ld, “-L$MKLROOT/lib/mic -Wl,\ --start-group -lmkl_intel_lp64 -lmkl_intel_thread \ -lmkl_core -Wl,--end-group” file.c –o file
To avoid using the OS core, it is recommended to use the following environment setting (in case of a 61-core coprocessor):
MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-240:1]
Setting larger pages by the environment setting
MIC_USE_2MB_BUFFERS=16K 
usually increases performance.
Native Execution
In this mode of MKL the Intel Xeon Phi coprocessor is used as an independent compute node.
To build a program for native mode, the following compiler settings should be used:
$ icc -O3 -mkl -mmic file.c -o file.mic
To execute program in Xeon PHI from the Host :
$ micnativeloadex ./file.mic -e "environment variables" -a "args"
The program will automatically loaded and executed in MIC.
provide environment variable
Intel MPI
Before starting using Intel MPI in the mic you need to setup the environment (ssh password less to mic0):
- From mesologin1 type:
$ ssh-keygen -f $HOME/.ssh/id_rsa
- Don't enter passphrase if you wan't to connect without password
- Once key was generated:
$ mkdir $WORK/.ssh $ cp $HOME/.ssh/id_rsa.pub $WORK/.ssh/authorized_keys
- Logout from mesologin1 and reconnect to mesologin1
- To test if it works:
$ qlogin -q xphi.q $ ssh mic0
- The WORK directory is mounted read only on MIC
- For mpi we use only Intel MPI$ module load intel/mic 
We use this example for testing
- hello_mpi.c
- #include <mpi.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Get the name of the processor char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, &name_len); // Print off a hello world message printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size); // Finalize the MPI environment. MPI_Finalize(); } 
Native Intel MPI
All MPI ranks reside only on the coprocessors.
To build and run an application in coprocessor-only mode, the following commands have to be executed:
- Compile the program for the coprocessor (-mmic)$ mpiicc -mmic hello_mpi.c -o hello_mpi.mic 
- Set theI_MPI_MICvariableexport I_MPI_MIC=1 
- Launch MPI jobs on the coprocessormic0from thehostyou MUST give the full path to program (because WORK is shared with MIC)mpirun -host mic0 -n 244 /Work/Users/kmazouzi/hello_mpi.mic 
Alternatively one can copy and login to the mic0 and run mpirun there)
Symmetric Intel MPI
The MPI ranks reside on both the host and the coprocessor. Most general MPI case.
- Compile the program for the coprocessor (-mmic)$ mpiicc -mmic hello_mpi.c -o hello_mpi.mic 
- Compile the program for the host$ mpiicc -mmic hello_mpi.c -o hello_mpi 
- Set theI_MPI_MICvariableexport I_MPI_MIC=1 
- Launch MPI jobs on the host node1-51 and on the coprocessor mic0$ mpirun -host node1-51 -n 1 ./hello_mpi : -n 244 -host mic0 /Work/Users/kmazouzi/hello_mpi.mic this will run 1 process on host (node1-51) and 244 on mic0. 
For MIC you need to give the full path to the executable program
Offload Intel MPI
This mode is also known as Hybride MPI/OpenMP.
All MPI ranks reside only on host, some sections are offloaded to mic using openMP pragma.
Lets modify our example
- hello_mpi_omp.c
- include <mpi.h>; #include <omp.h> #include <unistd.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Get the name of the processor char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, &name_len); char processor_name_target[MPI_MAX_PROCESSOR_NAME]; int th_id; // Only process 0 can offload to mic if(world_rank==0) { #pragma offload target (mic) #pragma omp parallel private(th_id) { gethostname(processor_name_target,MPI_MAX_PROCESSOR_NAME) ; th_id=omp_get_thread_num(); printf("Hello world from thread %d on processor %s, from mpi process %d\n",th_id,processor_name_target,world_rank); } } // Print off a hello world message printf("Hello world from processor %s, rank %d" " out of %d processors\n", processor_name, world_rank, world_size); // Finalize the MPI environment. MPI_Finalize(); } 
- Compile
$ mpiicc -openmp hello_mpi_omp.c -o hello_mpi_omp
- Execute
$ MIC_NUM_THREADS=10 mpirun -np 3 ./hello_mpi_omp
Run 3 MPI process on host, process 0 will spawn 10 threads on mic0
OMP_NUM_THREADS=10 mpirun -np 3 ./a.out Hello world from processor node1-51, rank 1 out of 3 processors Hello world from processor node1-51, rank 2 out of 3 processors Hello world from processor node1-51, rank 0 out of 3 processors Hello world from thread 0 on processor node1-51-mic0, from mpi process 0 Hello world from thread 2 on processor node1-51-mic0, from mpi process 0 Hello world from thread 4 on processor node1-51-mic0, from mpi process 0 Hello world from thread 6 on processor node1-51-mic0, from mpi process 0 Hello world from thread 3 on processor node1-51-mic0, from mpi process 0 Hello world from thread 7 on processor node1-51-mic0, from mpi process 0 Hello world from thread 8 on processor node1-51-mic0, from mpi process 0 Hello world from thread 9 on processor node1-51-mic0, from mpi process 0 Hello world from thread 5 on processor node1-51-mic0, from mpi process 0 Hello world from thread 1 on processor node1-51-mic0, from mpi process 0
Batch mode (SGE)
Once you compiled your code for Xeon phi (see section above) you can submit your job using SGE.
Here is an example of an SGE script running OpenMP program on Mic using native mode:
- xphi.q
- #!/bin/bash #$ -q xphi.q #$ -N myPhi_program module load intel/mic ### You only to modify micnativeloadex ./myprogram.mic a "my program argument" -e "MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-240:1]" 
Performances
- Better use one core for the OS running on MIC so the maximum of threads to use us use 61*4 -1 = 243
- Try different threads affinity using
$ export MIC_ENV_PREFIX=MIC $ export MIC_KMP_AFFINITY='...your affinity settings of choice...'
 Where MIC_KMP_AFFINITY can takes several values, those often used:
- granularity=fine,balanced
- granularity=fine,compact
MIC_KMP_AFFINITY to get a detailed list of bindings.Example:
export MIC_KMP_AFFINITY="verbose,granularity=fine,balanced"
Usefull env variables
| Variable | possibile values | doc | 
|---|---|---|
| MKL_MIC_ENABLE | 0, 1 | enable or not Automatic MKL offload | 
| OFFLOAD_REPORT | 0,1,2,3 | offload debug informations | 
| OFFLOAD_DEVICES | {0,1,2,3} | choose device de offload | 
| MKL_MIC_X_WORKDIVISION | (X is id of device) | work division between the host and the coprocessor | 
Benchmarks
Openmp
We used a naive matmul.c for benchmark, Matrix size: '10000x10000'.
| Execution mode | time(s) | Gflops/s | 
|---|---|---|
| 12 threads on host | 25 | 77.5 | 
| native execution 244 threads | 15 | 133.5 | 
| offload execution 244 threads | 19.3 | 103.8 | 
MKL
We used BlAS 3 DGEMM subroutine, Matrix size: '10000x10000'
| Execution mode | time(s) | Gflops/s | 
|---|---|---|
| 12 threads on host | 13 | 152 | 
| Automatic Offload (244 threads) | 3 | 640 | 
| Native execution (244 threads) | 5.8 | 364 | 
Applications
- Using XeonPhi with Python



