Dell servers with VOLTA V100 GPUs
How to connect
We use Grid Engine to access to the node.
Interactive mode
Objective : open a shell or run directly commands on the server.
- login node : mesologin1.univ-fcomte.fr or mesoshared.univ-fcomte.fr
- queue : volta.q
- request GPU :
-l gpu=N
where0< N < =2
; default valuegpu=1
- Default SGE setting for
h_vmem
value is4Go
per core. Use-l h_vmem
to request more memory. We suggest using more than 20G.For example:-l h_vmem=20G
for 20 Go memory. Warning Job will be killed if memory exceed HOME
folder is mounted in read only mode.- Use
WORK
folder (cd$WORK
) as your working space.
Examples
Shell session
- Request shell session with 1 GPU and 20Go memory for 2 hours:
[user@mesologin1 ~]$ qlogin -q volta.q -l h_vmem=20G -l h_rt=2:00:00
Once connected, change directory to WORK
[user@node2-69 ~]$ cd WORK
Check for allocated #GPU:
[user@node2-69 ~]$ nvidia-smi Thu Feb 27 15:29:50 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 26C P0 25W / 250W | 0MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
- Request shell session with 2 GPU and 64Go memory:
[user@mesologin1~]$ qlogin -q volta.q -l gpu=2 -l h_vmem=64G
All software are installed in the default path, you can invoke directly nvcc
for example:
[user@node2-69 ~]$ nvcc program.cu -o program
Batch Mode
Objective : run programs in batch using SGE.
Example Script to adapt for your needs
- gpu.sge
#!/bin/bash -l #$ -q volta.q #$ -l gpu=1 ## adapter selon besoin #$ -l h_vmem=30G ## job will cancel if memory exceed #$ -N Test_GPU #################" ## Please adapt this script for your need ################# ## 1-vanila python python GPU_Program.py ## 2-anaconda export PATH="/opt/anaconda3/bin:$PATH" python GPU_Program.py ## 2-bis anaconda with env conda activate $WORK/conda/meso && python GPU_Program.py ##3-singularity singularity exec --nv tensorflow-gpu.simg python mytensorflow.py
Submit SGE job:
[user@mesologin1 ~]$ qsub gpu.sge
Installing and Running software
This host has its own software installed. It doesn't share any software with the rest of cluster (module
system). Only Deep-leaning Software based on python are installed.
Using default installed software
Vanilla Python
Several programs are installed by default python2.*
and python3.*
Use pip list
or pip3 list
command to view installed packages:
$ pip list
gpustat 0.5.0
grpcio 1.18.0
h5py 2.9.0
Keras 2.2.4
Keras-Applications 1.0.6
Keras-Preprocessing 1.0.5
Markdown 3.0.1
numpy 1.16.0
nvidia-ml-py 7.352.0
pbr 5.1.1
protobuf 3.6.1
scipy 1.2.0
setuptools 20.7.0
six 1.10.0
tensorboard 1.12.2
tensorflow-gpu 1.12.0
Run python program:
[user@node2-69~]$ python my_python_program.py
$WORK
because $HOME
is read only.
The Idea is to set WORK as HOME directory (export HOME=$WORK
) and using –user
option of pip or pip3
For example let's install panda
:
$ export HOME=$WORK $ pip install --user panda
Voilà ! ; That's all!
Anaconda [Recommended]
For better performances Anaconda python V3 is installed in /opt/
. Several optimized scientific packages are available.
Using anaconda with conda
Conda is powerfull package system management and environment.
To use conda with anaconda:
module
command only for local software in these nodes.
To load anaconda simply use:
$ module load anaconda
Check for available packages:
$ conda list mkl 2019.1 144 mkl-service 1.1.2 py37he904b0f_5 mkl_fft 1.0.6 py37hd81dba3_0 mkl_random 1.0.2 py37hd81dba3_0 more-itertools 4.3.0 py37_0 mpc 1.1.0 h10f8cd9_1 mpfr 4.0.1 hdf1c602_3 mpmath 1.1.0 py37_0 msgpack-python 0.5.6 py37h6bb024c_1 multipledispatch 0.6.0 py37_0 navigator-updater 0.2.1 py37_0 nbconvert 5.4.0 py37_1 nbformat 4.4.0 py37_0 ncurses 6.1 he6710b0_1 networkx 2.2 py37_1 nltk 3.4 py37_1 nose 1.3.7 py37_2 notebook 5.7.4 py37_0
You may need to install additional package without having root access. We instead recommend building a virtual environment
locally and install packages there.
But before we need to tell conda to use WORK to store pkgs :
export HOME=$WORK
- Create a conda environment
$ conda create -y --prefix $WORK/conda-env
replace
conda-env
by a valid name. - Activate the environment you just created
$ source activate $WORK/conda-env
- Download and install the conda package in the environment
$ conda install package
replace package by the package you wan to install, for example
panda
- Run your analysis from within the environment
- Deactivating environment, once you are done with your analysis- you can deactivate the conda environment by running the command
$ conda deactivate
Example: installing Tensorflow GPU on local environment
We will use an environment locally (in our WORK) and we install all needed package on it.
$ conda create -y --prefix $WORK/conda/meso
$ conda activate $WORK/conda/meso
$ (tensorflowGPU) user@node2-69:~$ conda install tensorflow-gpu
$ (tensorflowGPU) user@node2-69: python Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf # this may take too long >>> tf.test.gpu_device_name() ... 2020-02-27 15:18:44.636689: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56308c48ef00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-02-27 15:18:44.636746: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0 '/device:GPU:0'
- To list your conda environment use
conda env list
- To remove an environment
conda remove --name myenv --all
Using Singularity v3
pull
sub-command to import a container image directly from Docker Hub without having root or superuser privileges (or Docker) on your host system.
For example, to use Singularity to import the image of the latest Tensorflow (GPU version) into the present working directory.
$ singularity pull tensorflow-gpu.simg docker://tensorflow/tensorflow:latest-gpu
Notice that we prepend the original URL with
docker://
prefix.
This will download and build singularity image named tensorflow-gpu.simg
We can spawn an interactive shell within a container with the shell
sub-command.
$ singularity shell --nv tensorflow-gpu.simg
Or execute directly tensorflow within container:
$ singularity exec --nv tensorflow-gpu.simg python mytensorflow.py
The parameter –nv
is mandatory for GPU applications.
Monitoring
Command line
From login or mesoshared
nodes you can access to all GPUs stats using voltastat
command.
For example:
user@mesologin1:~# voltastat node2-69 Wed Mar 4 23:47:01 2020 440.33.01 [0] Tesla V100-PCIE-16GB | 36'C, 80 % | 15553 / 16160 MB | plop(15541M) [1] Tesla V100-PCIE-16GB | 28'C, 0 % | 0 / 16160 MB | [2] Tesla V100-PCIE-16GB | 28'C, 0 % | 0 / 16160 MB | node3-70 Wed Mar 4 23:46:01 2020 440.33.01 [0] Tesla V100-SXM2-32GB | 34'C, 0 % | 0 / 32510 MB | [1] Tesla V100-SXM2-32GB | 32'C, 0 % | 0 / 32510 MB | [2] Tesla V100-SXM2-32GB | 30'C, 0 % | 0 / 32510 MB | [3] Tesla V100-SXM2-32GB | 34'C, 0 % | 0 / 32510 MB |
voltastat
has 10s granularity. i.e it request GPU stats each 10 second interval
Ganglia Graph
FAQ
Use this SGE option: -l dgx=1
On the submit host, use this command: voltastat
this will pull the GPU each 10s
Once connected to the node, create or enter to your environment. For example :
$ conda activate $WORK/conda/meso
and install caffe-gpu with conda:
$ conda install caffe-gpu
To test your installation
import caffe caffe.set_mode_gpu() caffe.set_device(0)
Please follow this tutorial Running Jupyter notebook with V100 and SGE