Le mésocentre

Matlab Parallel Computing Toolbox (PCT) is now available at the Mesocentre as a part of Matlab r2011b.

We currently support only local parallel mode, i.e running on a single hardware node. The recommended best practice is to run on the cluster interactively or using the Matlab compiler.

Parallel for-Loops (parfor): Run loop iterations in parallel on a MATLAB® pool using parfor language construct
Distributed Arrays and SPMD: Partition arrays across multiple MATLAB® workers for data-parallel computing and simultaneous execution
Batch Processing: Offload execution of a function or script to run in a cluster or desktop background
GPU Computing: Transfer data between MATLAB® and a graphics processing unit (GPU); run code on a GPU

parfor is a parallel for loop. Loop index is automatically distributed to workers in chunks that operate in matlabpool.

Because of independence of iteration order, execution of parfor does not guarantee deterministic results

.

Let's start by a simple program that use parfor loop to compute the values of a vector array.

myParforTest.m

function myParforTest
 
% we first open 4 workers 
 
matlabpool open 4
 
tic
parfor i = 1:1024
    A(i) = sin(i*2*pi/1024);
    end
toc  
 
% we close the worker pool
matlabpool close
 
 toc
 return

The program above will use 4 cores in the same host to run the code between matlabpool open 4 and matlabpool close in parallel.

Note that matlabpool open 4 is used to open and create 4 workers. The maximum workers can be created is 12 cores (8 on older versions). matlabpool close shutdown them.

spmd execute code in parallel on MATLAB pool. Inside the body of the spmd statement, each MATLAB worker has a unique value of labindex, while numlabs denotes the total number of workers executing the block in parallel.

Example : compute the arithmetic series $s = 1 + 2 + ... + n$

spmdExample.m

function spmdExample
 
matlabpool open 4
% This program compute the arithmetic series s = 1 + 2 + ... + n
 
n = 10;
% computes para11e1 range beginning and end indices
% [n1,nb,ne]=prange(1, n, matlabpool('size'));
 
spmd
 
s = 0; % initia1ize 1oca1 sum
for i=nb(labindex):ne(labindex)
  s = s + i; % local sum on each worker
end
 
disp(['Local sum on worker ' num2str(labindex) ' is ' num2str(s)])
 
s = gplus(s); % compute global sum with gplus
 
end % spmd
 
disp(['Global sum is ' num2str(s{1})])
 
matlabpool close

Executing result:

>> spmdExample
Lab 1: 
Local sum on worker 1 is 6 
Lab 2: 
Local sum on worker 2 is 15 
Lab 3:
Loca1 sum on worker 3 is 15
Lab 4:
Local sum on worker 4 is 19
G1obal sum is 55

Worker	nl	nb	ne
1	3	1	3
2	3	4	6
3	2	7	8
4	2	9	10

GPU Background

Originally used to accelerate graphics rendering, GPUs are now increasingly applied to scientific calculations. Unlike a traditional CPU, which includes no more than a handful of cores, a GPU has a massively parallel array of integer and floating-point processors, as well as dedicated, high-speed memory. A typical GPU comprises hundreds of these smaller processors. These processors can be used to greatly speed-up particular types of applications.

A good rule of thumb is that your problem may be a good fit for the GPU if it is:

Massively parallel: The computations can be broken down into hundreds or thousands of independent units of work. You will see the best performance when all of the cores are kept busy, exploiting the inherent parallel nature of the GPU. Seemingly simple, vectorized MATLAB calculations on arrays with hundreds of thousands of elements often can fit into this category.
Computationally intensive: The time spent on computation significantly exceeds the time spent on transferring data to and from GPU memory. Because a GPU is attached to the host CPU via the PCI Express bus, the memory access is slower than with a traditional CPU. This means that your overall computational speedup is limited by the amount of data transfer that occurs in your algorithm.

Applications that do not satisfy these criteria might actually run more slowly on a GPU than on a CPU

.

With that background, we can now start working with the GPU in MATLAB.

Let's create a GPUArray and perform a fft using the GPU. However, let's first do this on the CPU so that we can see the difference in code and performance.

A1 = rand(3000,3000);
tic;
B1 = fft(A1);
toc;

To perform the same operation on the GPU, we first use gpuArray to transfer data from the MATLAB workspace to device memory. Then we can run fft, which is one of the overloaded functions on that data:

A2 = gpuArray(A1);
tic;
B2 = fft(A2);
toc;

To bring the data back to the CPU, we run gather.

B2 = gather(B2);

Let's test our program in tesla device.

We first need to connect to the tesla machine :

$ qlogin -q tesla -l h_vmem=2G

and then load needed modules: cuda, matlab.

$tesla% module load cuda matlab

$ matlab -nodesktop -nodisplay
 
>> fft_bench                     
Elapsed time is 0.074823 seconds. % CPU
Elapsed time is 0.022858 seconds. % GPU without data transfert

SpeedUp = time1/time2 = 3,27

We test parfor program myParforTest on mesoshared.

This mode of execution will hold the PTC licence token until the program exit. For better usage of PTC we use Matlab Compiler, see the next section.

$ module load  matlab/r2011b
$ matlab -nodesktop -nodisplay

                                                      < M A T L A B (R) >
                                                Copyright 1984-2011 The MathWorks, Inc.
                                                 R2011b (7.13.0.564) 64-bit (glnxa64)
                                                            August 13, 2011
 
 
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
 
>> myParforTest
 
Starting matlabpool using the 'local' configuration ... connected to 4 labs.
Sending a stop signal to all the labs ... stopped.
Elapsed time is 9.964229 seconds.
 
>>

Since we have only one licence for Matlab PTC we recommend the use of Matlab Compiler (documentation) to avoid the use of Matlab licences during execution.

Let's take again the parfor loop example : myParforTest.m

We compile the function using the Matlab compiler (mcc).

$ module load matlab
$ mcc -m myParforTest.m

The compilation may take few minutes.

The result is an executable program independent from Matlab Licences file, so we can run it multiple time.

This program use 4 cores for workers + 1 core for the master program. To use SGE we just need to define parallel environment like we do with OpenMP programs.

Here is the SGE example script :

#!/bin/bash -l
#$ -V
#$ -N test_matlab_PTC
#$ -cwd
#$ -o $JOB_NAME.$JOB_ID.out
#$ -pe openmp 5
 
module load matlab/r2011b # We need to load matlab to define the $MATLAB_HOME variable 
 
./run_myParforTest.sh $MATLAB_HOME

Generally if starting N Matlab worker we need to request -pe openmp N+1

We use Matlab Compiler to generate an executable program wich will be executed in tesla GPU.

Here is an SGE example:

#!/bin/bash -l
#$ -V
#$ -N test_matlab_PTC
#$ -cwd
#$ -o $JOB_NAME.$JOB_ID.out
#$ -q tesla.q  ## request tesla
#$ -l h_vmem=10G
 
module load gpu/cuda/8.0    
module load matlab/r2015a 
 
## first we compile the program
 
mcc -m myGPUTest.m
 
## then we run the program
 
./run_myGPUTest.sh $MATLAB_HOME

See also:

Using Matlab | Using Matlab Compiler

Matlab Parallel Computing Toolbox: PCT

Features

Usage examples

Parallel for loop example

SPMD example

Using GPU example

GPU Background

How to execute programs in parallel?

Using Matlab PCT on the Lumière cluster

Using GPU Matlab with SGE

Documentation