Using the DGX station
- This node is shared between users and is not managed by a resources manager (GridEngine). Thank you to not monopolize resources
- Network Folder
HOME
andWORK
are not available. - Users with root access: raphael.couturier@univ-fcomte.fr
Nvidia DGX station
- Processor Model: Intel Xeon E5-2698 v4. 2,2 GHz
- Number of cores: 20cores
- Memory: 256Go DDR4
4xGPU
- Tesla V100
- Total GPU Memory 128 GB total system
- Total NVIDIA Tensor Cores 2,560
- Total NVIDIA CUDA® Cores 20,480
- Storage: Data: 3X 1.92 TB SSD RAID 0, system 1X 1.92 TB SSD
- Connection NVLINK 300 Go/s
- Performance Deep Learning 112 TeraFLOPS x 4
- Performance 7.8 TeraFLOPS x 4
- Power consumption 250 Watts
How to connect
SSH
is used to access to the DGX using the host name:
mesodgx.univ-fcomte.fr
$ ssh login@mesodgx.univ-fcomte.fr
Once connected a local HOME directory is created.
- Using graphical session with x2goclient and setting session type to MATE
Running software
SGE is not available in the host, applications are executed directly from the terminal.
- Applications are killed when SSH session is closed. Use
screen
command to kept sessions opened - Since the host is shared between users, please check the GPU card to use with
nvidia-smi
command.
Using default installed software
Vanilla Python
Several programs are installed by default python2.*
and python3.*
Use pip list
command to view installed packages:
$ pip list
gpustat 0.5.0
grpcio 1.18.0
h5py 2.9.0
Keras 2.2.4
Keras-Applications 1.0.6
Keras-Preprocessing 1.0.5
Markdown 3.0.1
numpy 1.16.0
nvidia-ml-py 7.352.0
pbr 5.1.1
protobuf 3.6.1
scipy 1.2.0
setuptools 20.7.0
six 1.10.0
tensorboard 1.12.2
tensorflow-gpu 1.12.0
Run python program:
[user@mesodgx~]$ python my_python_program.py
You can install new package without root access. For example:
$ pip install --user package
Anaconda
For better performances Anaconda python V3 is installed in /opt/
. Several optimized scientific packages are available.
To use anaconda, please add these lines to your .bashrc
file
export PATH="/opt/anaconda3/bin:$PATH" . /opt/anaconda3/etc/profile.d/conda.sh
Check for available packages:
$ conda list mkl 2019.1 144 mkl-service 1.1.2 py37he904b0f_5 mkl_fft 1.0.6 py37hd81dba3_0 mkl_random 1.0.2 py37hd81dba3_0 more-itertools 4.3.0 py37_0 mpc 1.1.0 h10f8cd9_1 mpfr 4.0.1 hdf1c602_3 mpmath 1.1.0 py37_0 msgpack-python 0.5.6 py37h6bb024c_1 multipledispatch 0.6.0 py37_0 navigator-updater 0.2.1 py37_0 nbconvert 5.4.0 py37_1 nbformat 4.4.0 py37_0 ncurses 6.1 he6710b0_1 networkx 2.2 py37_1 nltk 3.4 py37_1 nose 1.3.7 py37_2 notebook 5.7.4 py37_0
You may need to install additional package without having root access. We instead recommend building a virual enviroment
locally and install packages there.
- Create a conda environment
$ conda create -y -n conda-env
replace
conda-env
by a valid name. - Activate the environment you just created
$ source activate conda-env
- Download and install the conda package in the environment
$ conda install -c package
replace package by the package you wan to install, for example
panda
- Run your analysis from within the environment
- Deactivating environment, once you are done with your analysis- you can deactivate the conda environment by running the command
$ source deactivate
Example: installing Tensorflow GPU
$ conda create -y -n tensorflowGPU
$ conda activate tensorflowGPU
$ (tensorflowGPU) user@mesodgx:~$ conda install -c anaconda tensorflow-gpu
$ (tensorflowGPU) user@mesodgx: python Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
conda info --envs
to view all your environment
Using Docker
Before pulling the images, check the available images already downloaded.
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE tensorflow/tensorflow latest-gpu 58a8e83b7dbf 2 months ago 3.36GB tensorflow/tensorflow latest 2054925f3b43 2 months ago 1.34GB alpine latest 196d12cf6ab1 3 months ago 4.41MB nvcr.io/nvidia/caffe2 18.08-py3 e82334d03b18 5 months ago 3.02GB nvcr.io/nvidia/tensorflow 18.01-py2 377b46c75bfc 12 months ago 2.88GB
Pull the image from Docker Registry
- Load the latest image (TF)
$ docker pull tensorflow/tensorflow:latest-gpu
- Check the image
$ docker images
- Run the container
$ nvidia-docker run -it --rm tensorflow/tensorflow:latest-gpu [I 11:08:25.330 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret [I 11:08:25.354 NotebookApp] Serving notebooks from local directory: /notebooks [I 11:08:25.354 NotebookApp] The Jupyter Notebook is running at: [I 11:08:25.354 NotebookApp] http://(6575ea53c8ff or 127.0.0.1):8888/?token=1769f40c3571e3b66c83ee22da1b5dab5b40a325c5682642 [I 11:08:25.354 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 11:08:25.354 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://(6575ea53c8ff or 127.0.0.1):8888/?token=1769f40c3571e3b66c83ee22da1b5dab5b40a325c5682642
This will open a NoteBook application.
- Connect to the container using the URL above, replace 127.0.0.1 with mesodgx.
$ NV_GPU=0,1 nvidia-docker run [...]
Pull the image from NVidia registry
Log on to the NVIDIA GPU CLOUD and follow the instructions.
Using Singularity
Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. Singularity can pull and use images from docker.
Pull the image from docker registry
Docker registry hub https://hub.docker.com is world's largest library and community for container images. You can browse and download (pull) docker images.
For example, we load the latest Tensorflow image from docker registery:
[user@mesologin1~]$ singularity pull tensorflow-gpu.simg docker://tensorflow/tensorflow:latest-gpu
The image (tensorflow-gpu.simg
) will be downloaded and stored in the current working directory.
And then we run our program within the container:
[user@mesologin1~]$ singularity exec --nv tensorflow-gpu.simg python my_python_program.py
Notice that we prepend the original URL with
docker://
prefix.
Pull the image from Nvidia Registry
Check for NVIDIA GPU CLOUD and proceed like for docker containers.
For example, to pull Tensorflow image:
$ singularity pull docker://nvcr.io/nvidia/tensorflow:18.12-py3
This will download the image in the current directory.
And to run interactive shell inside the container:
$ singularity shell --nv tensorflow:18.12-py3
Monitoring
- Using
nvidia-smi
on can show GPU statistics usage.
$ nvidia-smi Thu Jan 10 15:27:43 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.145 Driver Version: 384.145 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 | | N/A 39C P0 38W / 300W | 42MiB / 32496MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 | | N/A 38C P0 39W / 300W | 10MiB / 32499MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 | | N/A 39C P0 39W / 300W | 10MiB / 32499MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 | | N/A 39C P0 39W / 300W | 10MiB / 32499MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1612 G /usr/lib/xorg/Xorg 38MiB | +-----------------------------------------------------------------------------+
- Using
gpustat
$ gpustat mesodgx Thu Jan 10 15:28:40 2019 [0] Tesla V100-DGXS-32GB | 39'C, 0 % | 42 / 32496 MB | root(38M) [1] Tesla V100-DGXS-32GB | 38'C, 0 % | 10 / 32499 MB | [2] Tesla V100-DGXS-32GB | 39'C, 0 % | 10 / 32499 MB | [3] Tesla V100-DGXS-32GB | 39'C, 0 % | 10 / 32499 MB |
- Using ganglia you can bookmark this link to your navigator
Miscs
To monitor overall GPU usage with 1-second update intervals:
$ nvidia-smi dmon # gpu pwr gtemp mtemp sm mem enc dec mclk pclk # Idx W C C % % % % MHz MHz 0 41 52 50 0 0 0 0 877 135 1 42 50 52 0 0 0 0 877 135 2 43 51 50 0 0 0 0 877 135 3 42 51 51 0 0 0 0 877 135
To monitor per-process GPU usage with 1-second update intervals:
$ nvidia-smi pmon # gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 1604 G 0 0 0 0 Xorg 1 - - - - - - - 2 - - - - - - - 3 - - - - - - -
To view system topology and connectivity:
$ nvidia-smi topo --matrix GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X NV1 NV1 NV2 0-39 GPU1 NV1 X NV2 NV1 0-39 GPU2 NV1 NV2 X NV1 0-39 GPU3 NV2 NV1 NV1 X 0-39
To view nvlink topology:
$ nvidia-smi nvlink --status GPU 0: Tesla V100-DGXS-32GB (UUID: GPU-977ee796-0365-88e1-7ffe-46dc1ad9762b) Link 0: 25.781 GB/s Link 1: 25.781 GB/s Link 2: 25.781 GB/s Link 3: 25.781 GB/s GPU 1: Tesla V100-DGXS-32GB (UUID: GPU-8bd233a4-5282-2737-c598-533b3904dc93) Link 0: 25.781 GB/s Link 1: 25.781 GB/s Link 2: 25.781 GB/s Link 3: 25.781 GB/s GPU 2: Tesla V100-DGXS-32GB (UUID: GPU-be326514-bb08-9a38-8ea5-b58f7f30058b) Link 0: 25.781 GB/s Link 1: 25.781 GB/s Link 2: 25.781 GB/s Link 3: 25.781 GB/s GPU 3: Tesla V100-DGXS-32GB (UUID: GPU-a1bfa66e-38fe-d37b-f394-b7bcf36dec81) Link 0: 25.781 GB/s Link 1: 25.781 GB/s Link 2: 25.781 GB/s Link 3: 25.781 GB/s