Page 10
Chapter 7: Special Features and Configuration Chapters 8-10: Software and firmware update instructions Chapter 11: How to use the BMC Chapter 12: How to configure and use the DGX-2 System as a Kernel Virtual Machine host...
Page 12
Note: The DGX-2 will not operate with less than three PSUs. WARNING: To avoid electric shock or fire, do not connect other power cords to the DGX-2. For more details, see B.6. Electrical Precautions.
Page 20
• • Note: Some of the documentation listed below are not available at the time of publication. See https://docs.nvidia.com/dgx/ for the latest status. ...
Page 21
NVIDIA Enterprise Support portal enterprisesupport@nvidia.com. NVIDIA Enterprise Support Phone Numbers...
Page 22
DGX OS Server software installs Docker CE which uses the 172.17.xx.xx subnet by default for Docker containers. If the DGX-2 System is on the same subnet, you will not be able to establish a network connection to the DGX-2 System.
Page 28
• • CAUTION: Once you create your login credentials, the default admin/admin login will no longer work. Note: The BMC software will not accept "sysadmin" for a user name. If you create this user name for the system log in, "sysadmin" will not be available for logging in to the BMC.
Page 29
• Note: After you select the primary network interface, the system attempts to configure the interface for DHCP and then asks you to enter a hostname for the system. If DHCP is not available, you will have the option to configure the network manually. If y ou need to configure a static IP address on a network interface connected to a DHCP Cancel Network configuration –...
Page 30
NVIDIA GPU Cloud for DGX https://www.nvidia.com/en-us/support/enterprise/...
Page 31
It is mandatory that your DGX-2 System be installed by NVIDIA IMPORTANT: service personnel or trained Advanced Technology Program (ATP) installation partner. If not performed by NVIDIA or an ATP partner, your DGX-2 hardware warranty will be voided. https://docs.nvidia.com/dgx/ngc-registry-for-dgx-user-guide/...
Page 32
WARNING: Risk of Danger - Removing power cables or using Power Distribution Units (PDUs) to shut off the system while the Operating System is running may cause damage to sensitive components in the DGX-2 server. $ sudo nvsm show health $ sudo docker --version Docker version 18.03-ce...
Page 33
nvidia-docker2 Note: If Docker is updated to 19.03 on a system which already has the nvidia-docker2 package installed, then the instructions for using the NVIDIA Container Runtime for Docker can still be used. docker run --gpus •...
Page 34
$ docker run ... CAUTION: If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so may result in the...
Page 35
Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process for guidance.
Page 37
http_proxy="http://myproxy.server.com:8080/" ftp_proxy="ftp://myproxy.server.com:8080/"; https_proxy="https://myproxy.server.com:8080/"; /etc/apt/apt.conf.d/myproxy Acquire::http::proxy "http://<username>:<password>@<host>:<port>/"; Acquire::ftp::proxy "ftp://<username>:<password>@<host>:<port>/"; Acquire::https::proxy "https://<username>:<password>@<host>:<port>/"; Acquire::http::proxy "http://myproxy.server.com:8080/"; Acquire::ftp::proxy "ftp://myproxy.server.com:8080>/"; Acquire::https::proxy "https://myproxy.server.com:8080/"; https://docs.docker.com/engine/admin/systemd/#http-proxy 172.17.0.0/16 If your network does not conflict with the default Docker IP address range, then no changes are needed and you can skip this section.
Page 40
Note: If you cannot access the DGX-2 System remotely, then connect a display (1440x900 or lower resolution) and keyboard directly to the DGX-2 System. sudo ipmitool lan print 1 Set the IP address source to static. sudo ipmitool lan set 1 ipsrc static Set the appropriate address information.
Page 44
Note: If you cannot access the DGX-2 System remotely, then connect a display (1440x900 or lower resolution) and keyboard directly to the DGX-2 System.
Page 46
Note: If you are not returned to the command line prompt after a minute, then reboot the system. https://help.ubuntu.com/lts/serverguide/network- configuration.html.en sudo mst start sudo mst status • MST modules: ------------ MST PCI module is not loaded MST PCI configuration module is not loaded •...
Page 52
Configure an NFS mount for the DGX-2 System. Edit the filesystem tables configuration. sudo vi /etc/fstab Add a new line for the NFS mount, using the local mount point of /mnt.
Page 53
Verify the NFS server is reachable. ping <nfs_server> Mount the NFS export. sudo mount /mnt /mnt Verify caching is enabled. cat /proc/fs/nfsfs/volumes...
Page 54
Notes: MaxQ is supported on DGX-2 systems with BMC firmware version 1.04.03 or later. MaxQ is not supported on DGX-2H systems. Commands to switch to MaxP or MaxQ, or to see the current power state, are not supported on DGX-2H systems.
Page 55
$ sudo nvsm set powermode=maxp $ sudo nvsm show chassis/localhost...
Page 61
DGX-2 DGX-2 DGX-2 Obtaining the DGX-2 Software ISO Image and Checksum File DGX-2 • Re-Imaging the System Remotely • Creating a Bootable Installation Medium Re-Imaging the System From a USB Flash Drive Note: The DGX OS Server software is restored on one of the two NMVe M.2 drives.
Page 62
DGX-2 NVIDIA Enterprise Support Re-Imaging the System from a USB Flash Drive DGX-2 Obtaining the DGX-2 Software ISO Image and Checksum File Install DGX Server without formatting RAID. Retaining the RAID Partition While Installing the OS...
Page 63
Note: The Mellanox InfiniBand driver installation may take up to 10 minutes. Setting Up the DGX-2 System Note: If you are restoring the software image remotely through the BMC, you do not need a bootable installation medium and you can omit this task.
Page 64
To ensure that the resulting flash drive is bootable, use the dd command Note: to perform a device bit copy of the image. If you use other commands to perform a simple file copy of the image, the resulting flash drive may not be bootable.
Page 65
Akeo Reliable USB Formatting Utility (Rufus) DGX-2 DGX-2 Akeo Reliable USB Formatting Utility (Rufus) Start.
Page 66
Write in DD Image mode Re-Imaging the System Remotely DGX-2 Install DGX Server without formatting RAID. Retaining the RAID Partition While Installing the OS Enter. Note: The Mellanox InfiniBand driver installation may take up to 10 minutes.
Page 67
Setting Up the DGX-2 System Install DGX Software Install DGX Server without formatting RAID RUN=yes /etc/default/cachefilesd /raid /etc/fstab /raid cachefilesd /etc/default/cachefiled. etc/fstab. sudo mount /raid systemctl start cachefilesd...
Page 69
wget CAUTION: These instructions update all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being updated, you can instruct the Ubuntu package manager to keep the current version. For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki.
Page 70
IMPORTANT: DGX-2H is supported only with firmware container nvfw-dgx2:19.03.1 or later. Do not update the DGX-2H firmware using an earlier container as this will result in version mismatch with the DGX-2H. DGX-2 System Firmware Update Container Release Notes • •...
Page 71
progress. A high workload can disrupt the firmware update process and result in an unusable component. When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold.
Page 72
<package-name>.tar.gz <run-file-name>.run Using the .run File sudo docker load -i <package-name>.tar.gz sudo docker images nvfw-dgx2_18.09.3.tar.gz REPOSITORY IMAGE ID CREATED SIZE nvfw-dgx2_18.09.3 latest aa681a4ae600 1 hours ago 278MB nvfw_dgx2_version nvfw_dgx2:tag, nvfw-dgx2_19.03.1.tar.gz REPOSITORY IMAGE ID CREATED SIZE nvfw-dgx2 19.03.1 fec80ce658ef 1 hours ago 532MB sudo docker run --rm --privileged -ti -v /:/hostfs <image-name>...
sudo docker run --privileged -v /:/hostfs <image-name> show_version sudo docker run --rm [-e auto=1] --privileged -ti -v /:/hostfs <image- name> update_fw [-f] <target> <target> SBIOS Other components may be supported beyond those listed here. Query the Note: firmware manifest to see all the components supported by the container. Additional Options [-e auto=1 [-f]...
Page 74
Following components will be updated with new firmware version: SBIOS IMPORTANT: Firmware update is disruptive and may require system reboot. Stop system activities before performing the update. Ok to proceed with firmware update? <Y/N> Note: While the progress output shows the current and manifest firmware versions, the versions may be truncated due to space limitations.
Page 75
Note: Be sure to consult the NVIDIA DGX-2 Firmware Update Container release notes for special instructions applicable to specific firmware versions. nvfw- dgx2_19.03.1 $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw- dgx2:19.03.1 update_fw SBIOS Following components will be updated with new firmware ver sion: IMPORTANT: Firmware update is disruptive and may require system reboot.
Page 78
update-backup-bmc Note: The -–update-backup-bmc option is available only with firmware update container version 19.12.1 and later. $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgx2:19.12.1 update_fw BMC --update-backup-bmc Note: The ability to update the secondary SBIOS using the firmware u pdate container is available only with firmware update container version 19.12.1 and later.
Page 81
Note: Depending on the BMC firmware version, the following quick links may appear: • Maintenance->Firmware Update • Settings->NbMeManagement->NvMe P3700Vpd Info Do not access these tasks using the Quick Links dropdown menu, as the resulting pages are not fully functional.
Page 83
IMPORTANT: While you can update the BMC firmware from this page, NVIDIA recommends using the NVIDIA Firmware Update Container instead (see section Updating Firmware for instructions). Do not update from versions earlier than 01.04.02 using the BMC UI, as the sensor data record (SDR) is erroneously preserved which can result in the BMC UI reporting a critical 3V Battery sensor error.
Page 84
It is strongly recommended that you create a unique password as soon as possible. https://<bmc-ip-address>/.
Page 87
Note: NVIDIA KVM is also supported on the NVIDIA DGX-2H. References to DGX-2 in this chapter also apply to DGX-2H.
Page 88
Note: Unlike the-bare metal DGX-2 system or the KVM host OS, the guest VM OS is configured for English-only with no option to switch to languages such as Chinese. To set up a guest VM for a different language, install the appropriate language pack onto the guest VM.
Page 89
Performance Tuning section of the DGX Best Practices guide • • • https://linux.die.net/man/1/virsh nvidia-vm nvidia-vm nvidia-vm sudo nvidia-vm --help...
Page 90
Note: Using nvidia-vm requires root or sudo privilege. This includes deleting VMs, running health-check, or other operations. sudo apt-get update sudo apt install -y dgx-bionic-updates-repo sudo apt update sudo apt full-upgrade -y sudo apt-cache policy dgx-kvm-image* dgx-kvm-sw sudo apt-get install dgx-kvm-sw <dgx-kvm-image-x-y-z>...
Page 91
CAUTION: Reverting the server back to a bare metal system destroys all guest GPU VMs that were created as well as any data. Be sure to save your data before removing the KVM software. sudo apt-get purge --auto-remove dgx-kvm-sw sudo reboot nvidia-vm...
Page 92
: This VM is assigned 1 GPU from index 0 my-lab-vm2-1g0 my-lab-vm3-4g8-11 : This VM is assigned 4 GPUs from index 8 through 11 About nvidia-vm nvidia-vm Syntax sudo nvidia-vm create --gpucount N --gpu-index X [--image] [options] where --gpu- count --gpu-...
Managing the Images --user- data Using cloud-init to Initialize the Guest VM --meta- data Using cloud-init to Initialize the Guest VM [options] Command Help: [sudo] nvidia-vm create --help sudo sudo Command Examples: sudo nvidia-vm create --gpu-count 4 --gpu-index 12...
Page 94
2g8-9 IMPORTANT: A value of 0x0 for the domain name is not supported. sudo nvidia-vm create --gpu-count 2 --gpu-index 2 --image dgx- kvm-image-4-1-1 rootTue1308-2g2-3 dgx-kvm-image-4-1-1 Note: If you encounter the following message when creating a VM, Error setting up logfile: No write access to directory /home/$USER/.cache/virt-manager...
Page 95
--user-data <cloud-config> --meta-data <meta-data file> $ nvidia-vm create --gpu-count <#> --verbose --user-data /home/lab/cloud-config --meta-data /home/lab/instance-data.json Using Cloud-init Releases the CPUs, memory, GPUs, and NVLink Retains allocation of the OS and data disks Note: Since allocation of the OS and data disks are retained, the creation of other VMs is still impacted by the shut-down VM.
Page 98
Creating a new user account sudo useradd -m <new-username> -p <new-password> Deleting the nvidia user account deluser -r nvidia sudo usermod -a -G libvirt <new-username> sudo usermod -a -G libvirt-qemu <new-username> Using cloud-init to Initialize the Guest VM...
Page 99
VM may not work properly. To keep guest VMs running uninterrupted, save the KVM source image to another location before uninstalling it. About nvidia-vm. nvidia-vm Syntax sudo nvidia-vm image [options] Command Help sudo nvidia-vm image --help apt-cache policy dgx-kvm-image*...
Page 100
Syntax apt show <kvm-image> Example apt show dgx-kvm-image-4-1-1 <snip> Description: NVIDIA DGX bionic KVM hard disk image DGX BaseOS image for KVM OS Version: Ubuntu 18.04 Kernel Version: 4.15.0-47.50 Nvidia Driver Version: 418.67 Nvidia Docker Version: 2.0.3+docker18.09.4-1 Nvidia Container Runtime Version: 2.0.0+docker18.09.4 -1 Libnvidia Container Version: 1.0.2-1...
Page 101
Ok to remove image package "dgx-kvm-image-x-y-z"? (y/N) : x-y-z IMPORTANT: If you uninstall KVM images without converting the system back to bare metal – or example, to recover space on the Hypervisor or to upgrade to a newer image - then you should make a copy of the image first.
Page 102
/dev/vda1 /dev/vdb1 /raid Resource Allocation Show storage pool $ virsh pool-list Name State Autostart ------------------------------------------- dgx-kvm-pool active Create a VM: $ sudo nvidia-vm create --gpu-count 1 --gpu-index 0...
Page 103
52:54:00:16:b9:ff 10.120.28.219/24 Viewing the Volume from the DGX-2 KVM Host $ virsh vol-list dgx-kvm-pool --details Name Path Type Capacity A llocation ----------------------------------------------------------------------------------------------- vol-dgx2vm-rootTue1616-1g0 /raid/dgx-kvm/vol-dgx2vm-labTue1616-1g0 file 1.74 TiB 3.71 GiB Viewing the Data Volume from the Guest VM...
Page 105
IMPORTANT: Updating the DGX OS software may result in an over-write of the associated KVM image. Guest VMs created from this older image will no longer be available. To keep guest VMs, save the older KVM image to another location and then and then restore the image after updating the DGX OS.
Page 106
IMPORTANT: A KVM guest VM runs a thin-provisioned copy of the source image. If the source image is ever uninstalled, the guest VM may not work properly. To keep guest VMs running uninterrupted, save the KVM source image to another location before uninstalling it.
Page 110
$ dcgmi health --host <vm-ip-address> --check +--------------------------+--------------------------------------+ | Health Monitor Report +==========================+======================================+ | Overall Health | Healthy +--------------------------+--------------------------------------+ sudo nvidia-vm create --gpu-count 8 --gpu-index 8 ERROR: GPU 12 is in unexpected state "missing", can't use it - BDF:e0:00.0 SXMID:13 UUID:GPU-b7187786-d894-2266-d11d-21124dc61dd3...
Page 111
ERROR: GPU 13 is in unexpected state "missing", can't use it - BDF:e2:00.0 SXMID:16 UUID:GPU-9a6a6a52-c6b6-79c3-086b-fcf2d5b1c87e ERROR: 2 GPU's are unavailable, unable to start this VM "dgx2vm- labMon1559-8g8-15" If you attempt to launch a VM with a failed GPU before the system has ...
Page 113
$ virsh net-list Name State Autostart Persistent ---------------------------------------------------------- macvtap-net active private-net active $ virsh domifaddr <vm-name> --source agent $ virsh domifaddr 1gpu-vm-1g2 --source agent Name MAC address Protocol Address ----------------------------------------------------------------- 00:00:00:00:00:00 ipv4 127.0.0.1/8 ipv6 ::1/128 enp1s0 52:54:00:3c:07:62 ipv4 10.120.28.227/24 ipv6 fe80::5054:ff:fe3c:762/64 docker0 02:42:9f:5c:39:da...
Page 114
Getting GPU Health Information from Within the VM :~$ sudo nvme list Node Model Namespace Usage Format FW Rev ------------ -------------- -------------------------- -- -------------------- ---------- -------- /dev/nvme0n1 S2X6NX0K501953 SAMSUNG MZ1LW960HMJP-00003 1 61.79 GB / 960.20 GB 512 B + 0 B CXV8601Q <snip>...
Page 115
Smart Log for NVME device:nvme9n1 namespace-id:ffffffff critical_warning <snip> ... critical_warning $ sudo mdadm -S -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Aug 13 08:23:52 2019 Raid Level : raid1 Array Size : 937034752 (893.63 GiB 959.52 GB) Used Dev Size : 937034752 (893.63 GiB 959.52 GB) Raid Devices : 2 Total Devices : 2...
Page 116
$ virsh pool-list --details Name State Autostart Persistent Capacity Allocation Available ---------------------------------------------------------------------------------- dgx-kvm-pool running 27.83 TiB 171.71 GiB 27.66 TiB images running 878.57 GiB 19.62 GiB 858.95 GiB From within the guest VM, run the following command. :~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT...
Page 121
nvmecli Do not use a root file system Execute a shell in the installer environment’ $ sudo nvme list Node Model Namespace Usage Format FM Rev...
Page 122
------------ -------- -------------- --------- -------------------- ------------ ----------- /dev/nvme0n1 Sxxxxxxx Samsung MZxxxx 88.99 GB / 960.20 GB B + 0 B CXV8501Q /dev/nvme1n1 Sxxxxxxx Samsung MZxxxx 90.11 GB / 960.20 GB B + 0 B CXV8501Q /dev/nvme2n1 18xxxxxx Micron_9200_xx 3.84 TB / 3.84 TB B + 0 B 101008R0...
Page 123
This process destroys all data and software customizations that you CAUTION: have made on the DGX-2 System. Be sure to back up any data that you want to preserve and push any Docker images that you want to keep to a trusted registry.
Page 124
Download the image ISO file. Restoring the DGX-2 Software Image https://docs.nvidia.com/dgx/dgx-os-server-release-notes/index.html#dgx-os-release- number-scheme https://docs.nvidia.com/dgx/pdf/DGX-OS-server-4.1-relnotes-update-guide.pdf These procedures apply only to upgrades within the same major Note: release, such as 4.x → 4.y. It does not support upgrades across major releases, such as 3.x → 4.x..
Page 125
# DGX specific repositories: deb http://international.download.nvidia.com/dgx/repos /bionic bionic main restricted universe multiverse...
Page 126
- updates main restricted universe multiverse deb http://international.download.nvidia.com/dgx/repos/bionic bionic- r418+cuda10.1 main multiverse restricted universe deb-i386 http://international.download.nvidia.com/dgx/repos/bionic bionic main restricted universe multiverse deb-i386 http://international.download.nvidia.com/dgx/repos/bionic bionic-updates main restricted universe multiverse # Only for DGX OS 4.1.0 deb-i386 http://international.download.nvidia.com/dgx/repos/bionic bionic-r418+cuda10.1 main multiverse restricted universe # Clean unused items clean http://archive.ubuntu.com/ubuntu...
Page 129
/etc/apt/sources.list.d/dgx-bionic-r450-cuda11-0-repo.list /etc/apt/sources.list.d/dgx-bionic- r450-cuda11-0-repo.list $ sudo apt install cuda-toolkit-11-0 If you did not configure apt to use the NVIDIA DGX OS packages in the Note: file /etc/apt/sources.list.d/dgx-bionic-r450-cuda11-0-repo.list, omit this step. If you try to install CUDA Toolkit 11.0, the attempt fails...
Page 131
https://cloudinit.readthedocs.io/en/latest/topics/examples.html name The user’s login name. The default file contains a dummy value which must be replaced with your own. primary_group Define the primary group. Defaults to a new group created named after the user. The default file contains a dummy value which must be replaced with your own. groups Optional.
Page 135
Clean, dry, and free of airborne particles (other than normal room dust). Well-ventilated and away from sources of heat including direct sunlight and radiators. Away from sources of vibration or physical shock. In regions that are susceptible to electrical storms, we recommend you plug your system into a surge suppressor and disconnect telecommunication lines to your modem during an electrical storm.
Page 136
Do not attempt to modify or use the AC power cord(s) if they are not the exact type required to fit into the grounded electrical outlets. The power cord(s) must meet the following criteria:...
Page 137
Turn off all peripheral devices connected to this product. Turn off the system by pressing the power button to off. Disconnect the AC power by unplugging all AC power cords from the system or wall outlet. Disconnect all cables telecommunicat ion lines that...
Page 140
Check first to make sure you have not left loose tools or parts inside the system. Check that cables, add-in cards, and other components are properly installed. Attach the covers to the chassis according to the product instructions.
Page 141
Federal Communications Commission (FCC) FCC Marking (Class A) NOTE: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment.
CAN ICES-3(A)/NMB-3(A) The Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulation. Cet appareil numerique de la class A respecte toutes les exigences du Reglement sur le materiel brouilleur du Canada. European Conformity; Conformité Européenne (CE) This is a Class A product.
Page 143
A Japanese regulatory requirement, defined by specification JIS C 0950, 2008, mandates that manufacturers provide Material Content Declarations for certain categories of electronic products offered for sale after July 1, 2006. To view the JIS C 0950 material declaration for this product, visit www.nvidia.com...
Page 145
A Japanese regulatory requirement, defined by specification JIS C 0950: 2008, mandates that manufacturers provide Material Content Declarations for certain categories of electronic products offered for sale after July 1, 2006. Product Model Number: DGX-2 Symbols of Specified Chemical Substance...
Page 146
China RoHS Material Content Declaration 产品中有害物质的名称及含量 The Table of Hazardous Substances and their Content 根据中国《电器电子产品有害物质限制使用管理办法》 as required by China’s Management Methods for Restricted of Hazardous Substances Used in Electrical and Electronic Products 有害物质 Hazardous Substances 部件名称 Parts 汞 镉 六价铬 多溴联苯...
Need help?
Do you have a question about the DGX-2 and is the answer not in the manual?
Questions and answers