Nvidia DGX-2 SYSTEM User Manual page 84

Hide thumbs Also See for DGX-2 SYSTEM:
Table of Contents

Advertisement

after its state has been marked as 'bad' by the system, the VM will fail to start and an
appropriate error message is returned. Restarting an existing VM after a GPU fails will
result in the same failure and error message.
The following is an example of launching a VM when GPU 12 and 13 have been marked
as degraded or in a failed state.
nvidia-vm create --gpu-count 8 --gpu-index 8
ERROR: GPU 12 is in unexpected state "missing", can't use it -
BDF:e0:00.0 SXMID:13 UUID:GPU-b7187786-d894-2266-d11d-21124dc61dd3
ERROR: GPU 13 is in unexpected state "missing", can't use it -
BDF:e2:00.0 SXMID:16 UUID:GPU-9a6a6a52-c6b6-79c3-086b-fcf2d5b1c87e
ERROR: 2 GPU's are unavailable, unable to start this VM "dgx2vm-
labMon1559-8g8-15"
If you attempt to launch a VM with a failed GPU before the system has
Note:
identified its failed state, the VM will fail to launch but without an error
message. If this happens, keep trying to launch the VM until the message
appears.
Restarting a VM After the System or VM Crashes
Some GPU errors may cause the VM or the system to crash.
If the system crashes, you can attempt to restart the VM.
If the VM crashes (but not the system), you can attempt to restart the VM.
Your VM should restart successfully if none of the associated GPUs failed. However, if
one or more of the GPUs associated with your VM failed, then the response depends on
whether the system has had a chance to identify the GPU as unavailable.
Failed GPU identified as unavailable
The system will return an error indicating that the GPU is missing or unavailable and
that the VM is unable to start.
Failed GPU not yet identified as unavailable
The VM crashes upon being restarted.
Restoring a System from Degraded Mode
All GPUs need to be replaced to restore the DGX-2 from degraded mode.
DGX-2 System User Guide
Using DGX-2 System in KVM Mode
84

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents