Nvidia DGX-2 SYSTEM User Manual page 83

Hide thumbs Also See for DGX-2 SYSTEM:
Table of Contents

Advertisement

NVSwitch
NVSwitch assignments are optimized for NVLink peer-to-peer performance.
NVLink
An NVLink connection is the connection between each GPU and the NVSwitch fabric.
Each NVLink connection allows up to 25 GB/s uni-directional performance.
11.9.3
NVIDIA KVM Security Considerations
Consult the security policies of your organization to determine firewall needs and
settings.
11.9.4
Launching VMs in Degraded Mode
On DGX-2 KVM systems, degraded mode is a mechanism that allows one or more GPUs
to fail without affecting the operation or creation of other VMs on the server. This allows
the DGX-2 System to run GPU VMs with fewer than 16 GPUs present. System
administrators can then keep a subset of GPU VMs available for use while waiting to
replace GPUs that may have failed.
When the DGX-2 is Put in Degraded Mode
The following are the type of GPU errors that will put the system in degraded mode:
GPU double-bit ECC errors
GPU failure to enumerate on the PCIe bus
GPU side NVLink training error
GPU side unexpected XID error
To identify failed GPUs, the KVM host automatically polls the state of all GPUs in the
system at various times:
When the DGX-2 System boots, to capture the initial state of the GPUs
On a nightly basis
Upon launching a VM
When a failed GPU is identified by the software, the DGX-2 System is marked as
'degraded' and operates in degraded mode until all bad GPUs are replaced.
Creating VMs with the DGX-2 System in Degraded Mode
You can still create guest GPU VMs on a DGX-2 System in degraded mode as long as
you do not try to assign a failed GPU. If you attempt to create a VM with a failed GPU
DGX-2 System User Guide
Using DGX-2 System in KVM Mode
83

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents