Testing Dell C4140 Node
# rpm -q centos-release centos-release-8.0-0.1905.0.9.el8.x86_64
# uname -a Linux node2-70 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# lspci -nn | grep -i nvidia 1a:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1) 1c:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1) 1d:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1) 1e:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
# nvidia-smi Mon Jan 13 14:07:30 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 37C P0 44W / 300W | 0MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1C:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 0MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:1D:00.0 Off | 0 | | N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:1E:00.0 Off | 0 | | N/A 36C P0 45W / 300W | 0MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Using gpu-burn tools
Activate persistence mode:
# nvidia-smi -pm 1 # nvidia-smi -pm 1 Enabled persistence mode for GPU 00000000:1A:00.0. Enabled persistence mode for GPU 00000000:1C:00.0. Enabled persistence mode for GPU 00000000:1D:00.0. Enabled persistence mode for GPU 00000000:1E:00.0.
Default execution
By default gpuburnwill choose all available GPU.
$ ./gpu_burn Run length not specified in the command line. Burning for 10 secs GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-ca51cbd1-4eb0-f265-9cb6-613f848d9ebd) GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-d02431f6-3ad3-e257-2858-8e830e06fa56) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-2c34530c-3db6-3cd2-df6b-39f65d3448c2) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-c8740a89-14af-b1f3-d17c-629cb2454737) Couldn't init a GPU test: Error:
The computation still running in 3 GPU. The GPU 3 seems to be in error according to the dmesg
:
[ 2432.287866] NVRM: Xid (PCI:0000:1e:00): 48, pid=2442, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 5, subpartition 1. [ 2432.288198] NVRM: Xid (PCI:0000:1e:00): 64, pid=2442, Dynamic Page Retirement: Blacklisting failed. Page should have been blacklisted on boot(0x00000000007fffa9). [ 2432.714101] NVRM: Xid (PCI:0000:1e:00): 64, pid=2442, Dynamic Page Retirement: Blacklisting failed. Page should have been blacklisted on boot(0x00000000007fffb8). [ 2434.693780] NVRM: Xid (PCI:0000:1e:00): 45, pid=2442, Ch 00000000 [ 2434.694046] NVRM: Xid (PCI:0000:1e:00): 45, pid=2442, Ch 00000001 [ 2434.694263] NVRM: Xid (PCI:0000:1e:00): 45, pid=2442, Ch 00000002
After this run, we could not run nvidia-smi
anymore:
# nvidia-smi Unable to determine the device handle for GPU 0000:1E:00.0: Unknown Error
It looks like the GPU#3 is not working. Let's try to run directly only in this GPU:
# CUDA_VISIBLE_DEVICES=3 ./gpu_burn
Error message:
NVRM: Xid (PCI:0000:1e:00): 45, pid=2769, Ch 00000001