Using the DGX station

The DGX station is opened only for allowed users (I-SITE ADVANCES). For security reasons it is isolated from the rest of cluster.

  • This node is shared between users and is not managed by a resources manager (GridEngine). Thank you to not monopolize resources
  • Network Folder HOME and WORK are not available.
  • Users with root access: raphael.couturier@univ-fcomte.fr

Nvidia DGX station

CPU

  • Processor Model: Intel Xeon E5-2698 v4. 2,2 GHz
  • Number of cores: 20cores
  • Memory: 256Go DDR4

4xGPU

  • Tesla V100
  • Total GPU Memory 128 GB total system
  • Total NVIDIA Tensor Cores 2,560
  • Total NVIDIA CUDA® Cores 20,480
  • Storage: Data: 3X 1.92 TB SSD RAID 0, system 1X 1.92 TB SSD
  • Connection NVLINK 300 Go/s
  • Performance Deep Learning 112 TeraFLOPS x 4
  • Performance 7.8 TeraFLOPS x 4
  • Power consumption 250 Watts

How to connect

  • SSH is used to access to the DGX using the host name:

mesodgx.univ-fcomte.fr

$ ssh login@mesodgx.univ-fcomte.fr

Once connected a local HOME directory is created.

  • Using graphical session with x2goclient and setting session type to MATE

Users from outside the university of Franche-Comté need a VPN to access to the DGX. Please contact us for any problem.

Running software

SGE is not available in the host, applications are executed directly from the terminal.

  • Applications are killed when SSH session is closed. Use screen command to kept sessions opened
  • Since the host is shared between users, please check the GPU card to use with nvidia-smi command.

Vanilla Python

Several programs are installed by default python2.* and python3.*

Use pip list command to view installed packages:

$ pip list
gpustat             0.5.0    
grpcio              1.18.0   
h5py                2.9.0    
Keras               2.2.4    
Keras-Applications  1.0.6    
Keras-Preprocessing 1.0.5    
Markdown            3.0.1    
numpy               1.16.0   
nvidia-ml-py        7.352.0  
pbr                 5.1.1    
protobuf            3.6.1    
scipy               1.2.0    
setuptools          20.7.0   
six                 1.10.0   
tensorboard         1.12.2 
tensorflow-gpu      1.12.0

Run python program:

[user@mesodgx~]$ python my_python_program.py     

You can install new package without root access. For example:

$ pip install --user package

Anaconda

For better performances Anaconda python V3 is installed in /opt/. Several optimized scientific packages are available.

To use anaconda, please add these lines to your .bashrc file

export PATH="/opt/anaconda3/bin:$PATH"
. /opt/anaconda3/etc/profile.d/conda.sh

Check for available packages:

$ conda list
mkl                       2019.1                      144  
mkl-service               1.1.2            py37he904b0f_5  
mkl_fft                   1.0.6            py37hd81dba3_0  
mkl_random                1.0.2            py37hd81dba3_0  
more-itertools            4.3.0                    py37_0  
mpc                       1.1.0                h10f8cd9_1  
mpfr                      4.0.1                hdf1c602_3  
mpmath                    1.1.0                    py37_0  
msgpack-python            0.5.6            py37h6bb024c_1  
multipledispatch          0.6.0                    py37_0  
navigator-updater         0.2.1                    py37_0  
nbconvert                 5.4.0                    py37_1  
nbformat                  4.4.0                    py37_0  
ncurses                   6.1                  he6710b0_1  
networkx                  2.2                      py37_1  
nltk                      3.4                      py37_1  
nose                      1.3.7                    py37_2  
notebook                  5.7.4                    py37_0  

You may need to install additional package without having root access. We instead recommend building a virual enviroment locally and install packages there.

  1. Create a conda environment
    $ conda create -y -n conda-env

    replace conda-env by a valid name.

  2. Activate the environment you just created
    $ source activate conda-env 
  3. Download and install the conda package in the environment
     $ conda install -c package

    replace package by the package you wan to install, for example panda

  4. Run your analysis from within the environment
  5. Deactivating environment, once you are done with your analysis- you can deactivate the conda environment by running the command
    $ source deactivate

Example: installing Tensorflow GPU

$ conda create -y -n tensorflowGPU
$ conda activate tensorflowGPU
$ (tensorflowGPU) user@mesodgx:~$ conda install -c anaconda tensorflow-gpu 
$ (tensorflowGPU) user@mesodgx: python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Use

conda info --envs

to view all your environment

Before pulling the images, check the available images already downloaded.

$ docker images 
REPOSITORY                  TAG                 IMAGE ID            CREATED             SIZE
tensorflow/tensorflow       latest-gpu          58a8e83b7dbf        2 months ago        3.36GB
tensorflow/tensorflow       latest              2054925f3b43        2 months ago        1.34GB
alpine                      latest              196d12cf6ab1        3 months ago        4.41MB
nvcr.io/nvidia/caffe2       18.08-py3           e82334d03b18        5 months ago        3.02GB
nvcr.io/nvidia/tensorflow   18.01-py2           377b46c75bfc        12 months ago       2.88GB
 
 
 

Pull the image from Docker Registry

  1. Load the latest image (TF)
    $ docker pull tensorflow/tensorflow:latest-gpu 
  2. Check the image
    $ docker images
  3. Run the container
    $ nvidia-docker run -it --rm    tensorflow/tensorflow:latest-gpu
    [I 11:08:25.330 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
    [I 11:08:25.354 NotebookApp] Serving notebooks from local directory: /notebooks
    [I 11:08:25.354 NotebookApp] The Jupyter Notebook is running at:
    [I 11:08:25.354 NotebookApp] http://(6575ea53c8ff or 127.0.0.1):8888/?token=1769f40c3571e3b66c83ee22da1b5dab5b40a325c5682642
    [I 11:08:25.354 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
    [C 11:08:25.354 NotebookApp] 
     
        Copy/paste this URL into your browser when you connect for the first time,
        to login with a token:
            http://(6575ea53c8ff or 127.0.0.1):8888/?token=1769f40c3571e3b66c83ee22da1b5dab5b40a325c5682642

    This will open a NoteBook application.

  4. Connect to the container using the URL above, replace 127.0.0.1 with mesodgx.

Bind your containers to specific graphic cards:

$ NV_GPU=0,1 nvidia-docker run [...]

Pull the image from NVidia registry

Log on to the NVIDIA GPU CLOUD and follow the instructions.

Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. Singularity can pull and use images from docker.

Pull the image from docker registry

Docker registry hub https://hub.docker.com is world's largest library and community for container images. You can browse and download (pull) docker images.

For example, we load the latest Tensorflow image from docker registery:

[user@mesologin1~]$ singularity pull tensorflow-gpu.simg docker://tensorflow/tensorflow:latest-gpu 

The image (tensorflow-gpu.simg) will be downloaded and stored in the current working directory.
And then we run our program within the container:

[user@mesologin1~]$ singularity exec --nv tensorflow-gpu.simg python my_python_program.py 

Notice that we prepend the original URL with

docker://

prefix.

Pull the image from Nvidia Registry

Check for NVIDIA GPU CLOUD and proceed like for docker containers.
For example, to pull Tensorflow image:

$ singularity pull docker://nvcr.io/nvidia/tensorflow:18.12-py3

This will download the image in the current directory.
And to run interactive shell inside the container:

$ singularity shell --nv  tensorflow:18.12-py3

Monitoring

  • Using nvidia-smi on can show GPU statistics usage.
$ nvidia-smi 
Thu Jan 10 15:27:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0  On |                    0 |
| N/A   39C    P0    38W / 300W |     42MiB / 32496MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   38C    P0    39W / 300W |     10MiB / 32499MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   39C    P0    39W / 300W |     10MiB / 32499MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   39C    P0    39W / 300W |     10MiB / 32499MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1612      G   /usr/lib/xorg/Xorg                            38MiB |
+-----------------------------------------------------------------------------+
  • Using gpustat
$  gpustat 
mesodgx  Thu Jan 10 15:28:40 2019
[0] Tesla V100-DGXS-32GB | 39'C,   0 % |    42 / 32496 MB | root(38M)
[1] Tesla V100-DGXS-32GB | 38'C,   0 % |    10 / 32499 MB |
[2] Tesla V100-DGXS-32GB | 39'C,   0 % |    10 / 32499 MB |
[3] Tesla V100-DGXS-32GB | 39'C,   0 % |    10 / 32499 MB |
  • Using ganglia you can bookmark this link to your navigator

To monitor overall GPU usage with 1-second update intervals:

$ nvidia-smi dmon
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    41    52    50     0     0     0     0   877   135
    1    42    50    52     0     0     0     0   877   135
    2    43    51    50     0     0     0     0   877   135
    3    42    51    51     0     0     0     0   877   135
 

To monitor per-process GPU usage with 1-second update intervals:

$ nvidia-smi pmon
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0       1604     G     0     0     0     0   Xorg           
    1          -     -     -     -     -     -   -              
    2          -     -     -     -     -     -   -              
    3          -     -     -     -     -     -   -      

To view system topology and connectivity:

$ nvidia-smi topo --matrix
	GPU0	GPU1	GPU2	GPU3	CPU Affinity
GPU0	 X 	NV1	NV1	NV2	0-39
GPU1	NV1	 X 	NV2	NV1	0-39
GPU2	NV1	NV2	 X 	NV1	0-39
GPU3	NV2	NV1	NV1	 X 	0-39

To view nvlink topology:

$  nvidia-smi nvlink --status
GPU 0: Tesla V100-DGXS-32GB (UUID: GPU-977ee796-0365-88e1-7ffe-46dc1ad9762b)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
GPU 1: Tesla V100-DGXS-32GB (UUID: GPU-8bd233a4-5282-2737-c598-533b3904dc93)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
GPU 2: Tesla V100-DGXS-32GB (UUID: GPU-be326514-bb08-9a38-8ea5-b58f7f30058b)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
GPU 3: Tesla V100-DGXS-32GB (UUID: GPU-a1bfa66e-38fe-d37b-f394-b7bcf36dec81)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s

Tutorial and Labs

Links