Testing Dell C4140 Node

# rpm -q centos-release 
centos-release-8.0-0.1905.0.9.el8.x86_64
# uname -a
Linux node2-70 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# lspci -nn | grep -i nvidia
1a:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
1c:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
1d:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
1e:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
# nvidia-smi 
Mon Jan 13 14:07:30 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:1D:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:1E:00.0 Off |                    0 |
| N/A   36C    P0    45W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Activate persistence mode:

# nvidia-smi -pm 1
# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:1A:00.0.
Enabled persistence mode for GPU 00000000:1C:00.0.
Enabled persistence mode for GPU 00000000:1D:00.0.
Enabled persistence mode for GPU 00000000:1E:00.0.

Default execution

By default gpuburnwill choose all available GPU.

$ ./gpu_burn 
Run length not specified in the command line.  Burning for 10 secs
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-ca51cbd1-4eb0-f265-9cb6-613f848d9ebd)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-d02431f6-3ad3-e257-2858-8e830e06fa56)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-2c34530c-3db6-3cd2-df6b-39f65d3448c2)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-c8740a89-14af-b1f3-d17c-629cb2454737)
 
Couldn't init a GPU test: Error: 

The computation still running in 3 GPU. The GPU 3 seems to be in error according to the dmesg :

[ 2432.287866] NVRM: Xid (PCI:0000:1e:00): 48, pid=2442, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 1.
[ 2432.288198] NVRM: Xid (PCI:0000:1e:00): 64, pid=2442, Dynamic Page Retirement: Blacklisting failed.  Page should have been blacklisted on boot(0x00000000007fffa9).
[ 2432.714101] NVRM: Xid (PCI:0000:1e:00): 64, pid=2442, Dynamic Page Retirement: Blacklisting failed.  Page should have been blacklisted on boot(0x00000000007fffb8).
[ 2434.693780] NVRM: Xid (PCI:0000:1e:00): 45, pid=2442, Ch 00000000
[ 2434.694046] NVRM: Xid (PCI:0000:1e:00): 45, pid=2442, Ch 00000001
[ 2434.694263] NVRM: Xid (PCI:0000:1e:00): 45, pid=2442, Ch 00000002

After this run, we could not run nvidia-smi anymore:

# nvidia-smi 
Unable to determine the device handle for GPU 0000:1E:00.0: Unknown Error

It looks like the GPU#3 is not working. Let's try to run directly only in this GPU:

# CUDA_VISIBLE_DEVICES=3 ./gpu_burn

Error message:

NVRM: Xid (PCI:0000:1e:00): 45, pid=2769, Ch 00000001