Download Print this page
Nvidia DGX-2 User Manual
Hide thumbs Also See for DGX-2:

Advertisement

Quick Links

Advertisement

loading
Need help?

Need help?

Do you have a question about the DGX-2 and is the answer not in the manual?

Questions and answers

Subscribe to Our Youtube Channel

Summary of Contents for Nvidia DGX-2

  • Page 3 7.4.1...
  • Page 5 12.1 12.1.3 12.3.1 12.6.1...
  • Page 9: Hardware Overview

    Hardware Overview...
  • Page 10  Chapter 7: Special Features and Configuration  Chapters 8-10: Software and firmware update instructions  Chapter 11: How to use the BMC  Chapter 12: How to configure and use the DGX-2 System as a Kernel Virtual Machine host...
  • Page 12  Note: The DGX-2 will not operate with less than three PSUs. WARNING: To avoid electric shock or fire, do not connect other power cords to the DGX-2. For more details, see B.6. Electrical Precautions.
  • Page 13 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦...
  • Page 14 IMPORTANT: See the section Turning the DGX-2 On and Off for instructions on how to properly turn the system on or off.
  • Page 16 enp134s0f0 enp134s0f1...
  • Page 17 enp6s0 enp134s0f0 enp134s0f1...
  • Page 18 Mellanox ConnectX-5 Firmware Download...
  • Page 19 enp134s0f0 enp134s0f1...
  • Page 20      • •  Note: Some of the documentation listed below are not available at the time of publication. See https://docs.nvidia.com/dgx/ for the latest status.     ...
  • Page 21 NVIDIA Enterprise Support portal enterprisesupport@nvidia.com. NVIDIA Enterprise Support Phone Numbers...
  • Page 22 DGX OS Server software installs Docker CE which uses the 172.17.xx.xx subnet by default for Docker containers. If the DGX-2 System is on the same subnet, you will not be able to establish a network connection to the DGX-2 System.
  • Page 23 DGX-2 Server Front DGX-2 Server Back...
  • Page 24 Configuring Static IP Address for the BMC https://<bmc-ip-address>/...
  • Page 26 Network Ports Configuring Static IP Addresses for the Network Ports...
  • Page 27 Connecting to the DGX-2 Console...
  • Page 28 • • CAUTION: Once you create your login credentials, the default admin/admin login will no longer work.  Note: The BMC software will not accept "sysadmin" for a user name. If you create this user name for the system log in, "sysadmin" will not be available for logging in to the BMC.
  • Page 29 •  Note: After you select the primary network interface, the system attempts to configure the interface for DHCP and then asks you to enter a hostname for the system. If DHCP is not available, you will have the option to configure the network manually. If y ou need to configure a static IP address on a network interface connected to a DHCP Cancel Network configuration –...
  • Page 30 NVIDIA GPU Cloud for DGX https://www.nvidia.com/en-us/support/enterprise/...
  • Page 31 It is mandatory that your DGX-2 System be installed by NVIDIA IMPORTANT: service personnel or trained Advanced Technology Program (ATP) installation partner. If not performed by NVIDIA or an ATP partner, your DGX-2 hardware warranty will be voided. https://docs.nvidia.com/dgx/ngc-registry-for-dgx-user-guide/...
  • Page 32 WARNING: Risk of Danger - Removing power cables or using Power Distribution Units (PDUs) to shut off the system while the Operating System is running may cause damage to sensitive components in the DGX-2 server. $ sudo nvsm show health $ sudo docker --version Docker version 18.03-ce...
  • Page 33   nvidia-docker2  Note: If Docker is updated to 19.03 on a system which already has the nvidia-docker2 package installed, then the instructions for using the NVIDIA Container Runtime for Docker can still be used.  docker run --gpus •...
  • Page 34 $ docker run ... CAUTION: If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so may result in the...
  • Page 35 Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process for guidance.
  • Page 36    /etc/environment http_proxy="http://<username>:<password>@<host>:<port>/" ftp_proxy="ftp://<username>:<password>@<host>:<port>/"; https_proxy="https://<username>:<password>@<host>:<port>/"; no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com" HTTP_PROXY="http://<username>:<password>@<host>:<port>/" FTP_PROXY="ftp://<username>:<password>@<host>:<port>/"; HTTPS_PROXY="https://<username>:<password>@<host>:<port>/"; NO_PROXY="localhost,127.0.0.1,localaddress,.localdomain.com"...
  • Page 37 http_proxy="http://myproxy.server.com:8080/" ftp_proxy="ftp://myproxy.server.com:8080/"; https_proxy="https://myproxy.server.com:8080/"; /etc/apt/apt.conf.d/myproxy Acquire::http::proxy "http://<username>:<password>@<host>:<port>/"; Acquire::ftp::proxy "ftp://<username>:<password>@<host>:<port>/"; Acquire::https::proxy "https://<username>:<password>@<host>:<port>/"; Acquire::http::proxy "http://myproxy.server.com:8080/"; Acquire::ftp::proxy "ftp://myproxy.server.com:8080>/"; Acquire::https::proxy "https://myproxy.server.com:8080/"; https://docs.docker.com/engine/admin/systemd/#http-proxy 172.17.0.0/16 If your network does not conflict with the default Docker IP address range, then no changes are needed and you can skip this section.
  • Page 38 sudo vi /etc/systemd/system/docker.service.d/docker-override.conf [Service] ExecStart= ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 LimitMEMLOCK=infinity LimitSTACK=67108864 [Service] ExecStart= --bip=192.168.127.1/24 ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --fixed-cidr=192.168.127.128/25 LimitMEMLOCK=infinity LimitSTACK=67108864 /etc/systemd/system/docker.service.d/docker- override.conf sudo systemctl daemon-reload sudo systemctl restart docker...
  • Page 39       $ wget https://nvcr.io/v2 --2018-08-01 19:42:58-- https://nvcr.io/v2 Resolving nvcr.io (nvcr.io)... 52.8.131.152, 52.9.8.8 Connecting to nvcr.io (nvcr.io)|52.8.131.152|:443... connected. HTTP request sent, awaiting response... 401 Unauthorized   ...
  • Page 40  Note: If you cannot access the DGX-2 System remotely, then connect a display (1440x900 or lower resolution) and keyboard directly to the DGX-2 System. sudo ipmitool lan print 1 Set the IP address source to static. sudo ipmitool lan set 1 ipsrc static Set the appropriate address information.
  • Page 44  Note: If you cannot access the DGX-2 System remotely, then connect a display (1440x900 or lower resolution) and keyboard directly to the DGX-2 System.
  • Page 45 enp134s0f0 enp134s0f1 enp6s0 $ sudo vi /etc/netplan/01-netcfg.yaml network: version: 2 renderer: networkd ethernets: <port-designation>: dhcp4: no dhcp6: no addresses: [10.10.10.2/24] gateway4: 10.10.10.1 nameservers: search: [<mydomain>, <other-domain>] addresses: [10.10.10.1, 1.1.1.1] $ sudo netplan apply...
  • Page 46  Note: If you are not returned to the command line prompt after a minute, then reboot the system. https://help.ubuntu.com/lts/serverguide/network- configuration.html.en sudo mst start sudo mst status • MST modules: ------------ MST PCI module is not loaded MST PCI configuration module is not loaded •...
  • Page 47 MST devices: ------------ /dev/mst/mt4119_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:35:00.0 addr.reg=88 data.reg=92 Chip revision is: 00 /dev/mst/mt4119_pciconf1 - PCI configuration cycles access. domain:bus:dev.fn=0000:3a:00.0 addr.reg=88 data.reg=92 Chip revision is: 00 /dev/mst/mt4119_pciconf2 - PCI configuration cycles access. domain:bus:dev.fn=0000:58:00.0 addr.reg=88 data.reg=92 Chip revision is: 00 /dev/mst/mt4119_pciconf3 - PCI configuration cycles access.
  • Page 48 $ sudo mlxconfig query | egrep -e Device\|LINK_TYPE Device #1: Device type: ConnectX5 Device: 0000:bd:00.0 LINK_TYPE_P1 IB(1) Device #2: Device type: ConnectX5 Device: 0000:b8:00.0 LINK_TYPE_P1 IB(1) Device #3: Device type: ConnectX5 Device: 0000:3a:00.0 LINK_TYPE_P1 IB(1) Device #4: Device type: ConnectX5 Device: 0000:e1:00.0 LINK_TYPE_P1...
  • Page 49 mst status /dev/mst/mt4119_pciconf5. Starting the Mellanox Software Tools ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2 ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf1 set LINK_TYPE_P1=2 ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf2 set LINK_TYPE_P1=2 ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf3 set LINK_TYPE_P1=2 ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf4 set LINK_TYPE_P1=2 ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf5 set LINK_TYPE_P1=2 ~$ sudo mlxconfig -y -d /dev/mst/mt4119_pciconf6 set LINK_TYPE_P1=2...
  • Page 50 Device: 0000:35:00.0 LINK_TYPE_P1 ETH (1) Device #6: Device type: ConnectX5 Device: 0000:5d:00.0 LINK_TYPE_P1 ETH (1) Device #7: Device type: ConnectX5 Device: 0000:e6:00.0 LINK_TYPE_P1 ETH (1) Device #8: Device type: ConnectX5 Device: 0000:58:00.0 LINK_TYPE_P1 ETH (1) Device #9: Device type: ConnectX5 Device: 0000:86:00.0 LINK_TYPE_P1...
  • Page 51 Device: 0000:b8:00.0 LINK_TYPE_P1 IB(1) Device #3: Device type: ConnectX5 Device: 0000:3a:00.0 LINK_TYPE_P1 IB(1) Device #4: Device type: ConnectX5 Device: 0000:e1:00.0 LINK_TYPE_P1 IB(1) Device #5: Device type: ConnectX5 Device: 0000:35:00.0 LINK_TYPE_P1 IB(1) Device #6: Device type: ConnectX5 Device: 0000:5d:00.0 LINK_TYPE_P1 IB(1) Device #7: Device type: ConnectX5...
  • Page 52 Configure an NFS mount for the DGX-2 System. Edit the filesystem tables configuration. sudo vi /etc/fstab Add a new line for the NFS mount, using the local mount point of /mnt.
  • Page 53 Verify the NFS server is reachable. ping <nfs_server> Mount the NFS export. sudo mount /mnt /mnt Verify caching is enabled. cat /proc/fs/nfsfs/volumes...
  • Page 54  Notes: MaxQ is supported on DGX-2 systems with BMC firmware version 1.04.03 or later. MaxQ is not supported on DGX-2H systems. Commands to switch to MaxP or MaxQ, or to see the current power state, are not supported on DGX-2H systems.
  • Page 55     $ sudo nvsm set powermode=maxp $ sudo nvsm show chassis/localhost...
  • Page 56  /usr/sbin/dgx-kdump-config enable-dmesg-dump  /usr/sbin/dgx-kdump-config enable-vmcore-dump  /usr/sbin/dgx-kdump-config disable $ ipmitool -I lanplus -H <bmc-ip-address> -U <BMC-USERNAME> -P <BMC- PASSWORD> sol activate...
  • Page 57    Managing CPU Mitigations ~$ cat /sys/devices/system/cpu/vulnerabilities/*  Mitigation...
  • Page 58 KVM: Mitigation: Split huge pages Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable Mitigation: Clear CPU buffers; SMT vulnerable Mitigation: PTI Mitigation: Speculative Store Bypass disabled via prctl and seccomp Mitigation: usercopy/swapgs barriers and __user pointer sanitization Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling Mitigation: Clear CPU buffers;...
  • Page 59 $ sudo apt purge nv-mitigations-off $ cat /sys/devices/system/cpu/vulnerabilities/* $ nvidia-smi -q |egrep "GPU 00000000|GPU UUID" GPU 00000000:34:00.0 GPU : GPU-8196613d-54af-3ef5-60e7-046d9a4783cf UUID GPU 00000000:36:00.0 GPU : GPU-be8c9757-6874-2926-c5ac-366dc32147a4 UUID ..$ nvidia-smi -i 3 -q | grep UUID /etc/modprobe.d/ nvidia.conf options nvidia NVreg_GpuBlacklist=<gpu-uuid>...
  • Page 60 $ sudo update-initramfs -u $ dracut --force /boot/initramfs-$(uname -r).img $(uname -r) $ sudo reboot...
  • Page 61 DGX-2 DGX-2 DGX-2 Obtaining the DGX-2 Software ISO Image and Checksum File DGX-2 • Re-Imaging the System Remotely • Creating a Bootable Installation Medium Re-Imaging the System From a USB Flash Drive  Note: The DGX OS Server software is restored on one of the two NMVe M.2 drives.
  • Page 62 DGX-2 NVIDIA Enterprise Support Re-Imaging the System from a USB Flash Drive DGX-2 Obtaining the DGX-2 Software ISO Image and Checksum File Install DGX Server without formatting RAID. Retaining the RAID Partition While Installing the OS...
  • Page 63  Note: The Mellanox InfiniBand driver installation may take up to 10 minutes. Setting Up the DGX-2 System  Note: If you are restoring the software image remotely through the BMC, you do not need a bootable installation medium and you can omit this task.
  • Page 64  To ensure that the resulting flash drive is bootable, use the dd command Note: to perform a device bit copy of the image. If you use other commands to perform a simple file copy of the image, the resulting flash drive may not be bootable.
  • Page 65 Akeo Reliable USB Formatting Utility (Rufus) DGX-2  DGX-2  Akeo Reliable USB Formatting Utility (Rufus) Start.
  • Page 66 Write in DD Image mode Re-Imaging the System Remotely DGX-2 Install DGX Server without formatting RAID. Retaining the RAID Partition While Installing the OS Enter.  Note: The Mellanox InfiniBand driver installation may take up to 10 minutes.
  • Page 67 Setting Up the DGX-2 System Install DGX Software Install DGX Server without formatting RAID  RUN=yes /etc/default/cachefilesd  /raid /etc/fstab  /raid cachefilesd /etc/default/cachefiled. etc/fstab. sudo mount /raid systemctl start cachefilesd...
  • Page 68 $ wget -O f1-changelogs http://changelogs.ubuntu.com/meta-release-lts $ wget -O f2-archive http://archive.ubuntu.com/ubuntu/dists/bionic/Release $ wget -O f3-usarchive http://us.archive.ubuntu.com/ubuntu/dists/bionic/Release $ wget -O f4-security http://security.ubuntu.com/ubuntu/dists/bionic/Release $ wget -O f5-download http://download.docker.com/linux/ubuntu/dists/bionic/Release $ wget -O f6-international http://international.download.nvidia.com/dgx/repos/bionic/dists/bionic/ Release...
  • Page 69 wget CAUTION: These instructions update all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being updated, you can instruct the Ubuntu package manager to keep the current version. For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki.
  • Page 70 IMPORTANT: DGX-2H is supported only with firmware container nvfw-dgx2:19.03.1 or later. Do not update the DGX-2H firmware using an earlier container as this will result in version mismatch with the DGX-2H. DGX-2 System Firmware Update Container Release Notes  • •...
  • Page 71 progress. A high workload can disrupt the firmware update process and result in an unusable component. When initiating an update, the update software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold.
  • Page 72 <package-name>.tar.gz <run-file-name>.run Using the .run File sudo docker load -i <package-name>.tar.gz sudo docker images nvfw-dgx2_18.09.3.tar.gz REPOSITORY IMAGE ID CREATED SIZE nvfw-dgx2_18.09.3 latest aa681a4ae600 1 hours ago 278MB nvfw_dgx2_version nvfw_dgx2:tag, nvfw-dgx2_19.03.1.tar.gz REPOSITORY IMAGE ID CREATED SIZE nvfw-dgx2 19.03.1 fec80ce658ef 1 hours ago 532MB sudo docker run --rm --privileged -ti -v /:/hostfs <image-name>...
  • Page 73: Additional Options

    sudo docker run --privileged -v /:/hostfs <image-name> show_version sudo docker run --rm [-e auto=1] --privileged -ti -v /:/hostfs <image- name> update_fw [-f] <target> <target> SBIOS  Other components may be supported beyond those listed here. Query the Note: firmware manifest to see all the components supported by the container. Additional Options [-e auto=1 [-f]...
  • Page 74 Following components will be updated with new firmware version: SBIOS IMPORTANT: Firmware update is disruptive and may require system reboot. Stop system activities before performing the update. Ok to proceed with firmware update? <Y/N>  Note: While the progress output shows the current and manifest firmware versions, the versions may be truncated due to space limitations.
  • Page 75  Note: Be sure to consult the NVIDIA DGX-2 Firmware Update Container release notes for special instructions applicable to specific firmware versions. nvfw- dgx2_19.03.1 $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw- dgx2:19.03.1 update_fw SBIOS Following components will be updated with new firmware ver sion: IMPORTANT: Firmware update is disruptive and may require system reboot.
  • Page 76 sudo docker run --rm --privileged -ti -v /:/hostfs <image-name> update_fw -f <target> (-ti -e auto=1 $ sudo docker run -e auto=1 --rm --privileged -t -v /:/hostfs <image- name> update_fw <target>  $ sudo docker run --rm --privileged -v /:/hostfs <image-name> show_fw_manifest ...
  • Page 77 $ sudo docker rmi -f <image-name> $ chmod +x /<run-file-name>.run $ sudo ./<run-file-name>.run update_fw all  $ sudo ./<run-file-name>.run show_fw_manifest  $ sudo ./<run-file-name>.run show_version  $ sudo ./<run-file-name>.run update_fw <target>  $ sudo ./<run-file-name>.run update_fw -f <target>...
  • Page 78 update-backup-bmc  Note: The -–update-backup-bmc option is available only with firmware update container version 19.12.1 and later. $ sudo docker run --rm --privileged -ti -v /:/hostfs nvfw-dgx2:19.12.1 update_fw BMC --update-backup-bmc  Note: The ability to update the secondary SBIOS using the firmware u pdate container is available only with firmware update container version 19.12.1 and later.
  • Page 79 https://nvid.nvidia.com/dashboard/...
  • Page 80 https://<bmc-ip-address>/...
  • Page 81  Note: Depending on the BMC firmware version, the following quick links may appear: • Maintenance->Firmware Update • Settings->NbMeManagement->NvMe P3700Vpd Info Do not access these tasks using the Quick Links dropdown menu, as the resulting pages are not fully functional.
  • Page 83  IMPORTANT: While you can update the BMC firmware from this page, NVIDIA recommends using the NVIDIA Firmware Update Container instead (see section Updating Firmware for instructions). Do not update from versions earlier than 01.04.02 using the BMC UI, as the sensor data record (SDR) is erroneously preserved which can result in the BMC UI reporting a critical 3V Battery sensor error.
  • Page 84 It is strongly recommended that you create a unique password as soon as possible. https://<bmc-ip-address>/.
  • Page 85   .hpm .hpm...
  • Page 86 $ telinit 1 $ umount /raid $ sync $ ipmitool chassis power off...
  • Page 87  Note: NVIDIA KVM is also supported on the NVIDIA DGX-2H. References to DGX-2 in this chapter also apply to DGX-2H.
  • Page 88  Note: Unlike the-bare metal DGX-2 system or the KVM host OS, the guest VM OS is configured for English-only with no option to switch to languages such as Chinese. To set up a guest VM for a different language, install the appropriate language pack onto the guest VM.
  • Page 89   Performance Tuning section of the DGX Best Practices guide        • • • https://linux.die.net/man/1/virsh nvidia-vm nvidia-vm nvidia-vm sudo nvidia-vm --help...
  • Page 90  Note: Using nvidia-vm requires root or sudo privilege. This includes deleting VMs, running health-check, or other operations. sudo apt-get update sudo apt install -y dgx-bionic-updates-repo sudo apt update sudo apt full-upgrade -y sudo apt-cache policy dgx-kvm-image* dgx-kvm-sw sudo apt-get install dgx-kvm-sw <dgx-kvm-image-x-y-z>...
  • Page 91 CAUTION: Reverting the server back to a bare metal system destroys all guest GPU VMs that were created as well as any data. Be sure to save your data before removing the KVM software. sudo apt-get purge --auto-remove dgx-kvm-sw sudo reboot nvidia-vm...
  • Page 92 : This VM is assigned 1 GPU from index 0 my-lab-vm2-1g0 my-lab-vm3-4g8-11 : This VM is assigned 4 GPUs from index 8 through 11 About nvidia-vm nvidia-vm Syntax sudo nvidia-vm create --gpucount N --gpu-index X [--image] [options] where --gpu- count --gpu-...
  • Page 93: Command Help

    Managing the Images --user- data Using cloud-init to Initialize the Guest VM --meta- data Using cloud-init to Initialize the Guest VM [options] Command Help: [sudo] nvidia-vm create --help sudo sudo Command Examples:  sudo nvidia-vm create --gpu-count 4 --gpu-index 12...
  • Page 94 2g8-9 IMPORTANT: A value of 0x0 for the domain name is not supported.  sudo nvidia-vm create --gpu-count 2 --gpu-index 2 --image dgx- kvm-image-4-1-1 rootTue1308-2g2-3 dgx-kvm-image-4-1-1  Note: If you encounter the following message when creating a VM, Error setting up logfile: No write access to directory /home/$USER/.cache/virt-manager...
  • Page 95  --user-data <cloud-config>  --meta-data <meta-data file> $ nvidia-vm create --gpu-count <#> --verbose --user-data /home/lab/cloud-config --meta-data /home/lab/instance-data.json Using Cloud-init  Releases the CPUs, memory, GPUs, and NVLink  Retains allocation of the OS and data disks  Note: Since allocation of the OS and data disks are retained, the creation of other VMs is still impacted by the shut-down VM.
  • Page 96 About nvidia-vm. Syntax sudo nvidia-vm delete --domain <vm-domain> Command Help sudo nvidia-vm delete --help Command Examples  sudo nvidia-vm delete --domain dgx2vm-labTue1308-4g12-15  sudo nvidia-vm delete --domain ALL sudo nvidia-vm destroy --domain <vm-domain> --graceful sudo nvidia-vm destroy --domain <vm-domain> --graceful...
  • Page 97 --domain <vm-domain> --mode <shutdown-mode> acpi agent initctl signal paravirt virsh domifaddr <vm-domain> --source agent Network Configuration $ virsh domifaddr 1gpu-vm-1g1 --source agent Name MAC address Protocol Address ----------------------------------------------------------------------- 00:00:00:00:00:00 ipv4 127.0.0.1/8 ipv6 ::1/128 enp1s0 52:54:00:1e:23:2b ipv4 10.120.28.219/24...
  • Page 98  Creating a new user account sudo useradd -m <new-username> -p <new-password>  Deleting the nvidia user account deluser -r nvidia sudo usermod -a -G libvirt <new-username> sudo usermod -a -G libvirt-qemu <new-username> Using cloud-init to Initialize the Guest VM...
  • Page 99 VM may not work properly. To keep guest VMs running uninterrupted, save the KVM source image to another location before uninstalling it. About nvidia-vm. nvidia-vm Syntax sudo nvidia-vm image [options] Command Help sudo nvidia-vm image --help  apt-cache policy dgx-kvm-image*...
  • Page 100 Syntax apt show <kvm-image> Example apt show dgx-kvm-image-4-1-1 <snip> Description: NVIDIA DGX bionic KVM hard disk image DGX BaseOS image for KVM OS Version: Ubuntu 18.04 Kernel Version: 4.15.0-47.50 Nvidia Driver Version: 418.67 Nvidia Docker Version: 2.0.3+docker18.09.4-1 Nvidia Container Runtime Version: 2.0.0+docker18.09.4 -1 Libnvidia Container Version: 1.0.2-1...
  • Page 101 Ok to remove image package "dgx-kvm-image-x-y-z"? (y/N) : x-y-z IMPORTANT: If you uninstall KVM images without converting the system back to bare metal – or example, to recover space on the Hypervisor or to upgrade to a newer image - then you should make a copy of the image first.
  • Page 102 /dev/vda1 /dev/vdb1 /raid Resource Allocation Show storage pool $ virsh pool-list Name State Autostart ------------------------------------------- dgx-kvm-pool active Create a VM: $ sudo nvidia-vm create --gpu-count 1 --gpu-index 0...
  • Page 103 52:54:00:16:b9:ff 10.120.28.219/24 Viewing the Volume from the DGX-2 KVM Host $ virsh vol-list dgx-kvm-pool --details Name Path Type Capacity A llocation ----------------------------------------------------------------------------------------------- vol-dgx2vm-rootTue1616-1g0 /raid/dgx-kvm/vol-dgx2vm-labTue1616-1g0 file 1.74 TiB 3.71 GiB Viewing the Data Volume from the Guest VM...
  • Page 104 Configuration Host<->VM VM<->VM VM<->External macvtap macvtap Private KVM Networking Best Practices Guide --privateIP sudo nvidia-vm create --gpu-count 4 --gpu-index 12 --privateIP...
  • Page 105 IMPORTANT: Updating the DGX OS software may result in an over-write of the associated KVM image. Guest VMs created from this older image will no longer be available. To keep guest VMs, save the older KVM image to another location and then and then restore the image after updating the DGX OS.
  • Page 106 IMPORTANT: A KVM guest VM runs a thin-provisioned copy of the source image. If the source image is ever uninstalled, the guest VM may not work properly. To keep guest VMs running uninterrupted, save the KVM source image to another location before uninstalling it.
  • Page 107 vCPU/HT Memory (GB) 1446 InfiniBand OS Drive (GB) Data Drive (TB) 1.92 3.84 7.68 15.36 31.72 Ethernet macvtap macvtap macvtap macvtap macvtap NVLink    virtio-net virtio-blk    ...
  • Page 108       $ sudo nvidia-vm health-check [options]...
  • Page 109 --force --help --fulltest --timelimit $ sudo nvidia-vm health-check $ sudo nvidia-vm health-check --force --fulltest $ sudo nvidia-vm health-check show $ virsh shutdown <vm-name> $ sudo virt-edit <vm-name> /lib/systemd/system/nvidia- fabricmanager.service ExecStart=/usr/bin/nv-hostengine -l -g --log-level 4 --log-rotate -- log-filename /var/log/fabricmanager.log ExecStart=/usr/bin/nv-hostengine -l -g --log-level 4 -b ALL --log-...
  • Page 110 $ dcgmi health --host <vm-ip-address> --check +--------------------------+--------------------------------------+ | Health Monitor Report +==========================+======================================+ | Overall Health | Healthy +--------------------------+--------------------------------------+ sudo nvidia-vm create --gpu-count 8 --gpu-index 8 ERROR: GPU 12 is in unexpected state "missing", can't use it - BDF:e0:00.0 SXMID:13 UUID:GPU-b7187786-d894-2266-d11d-21124dc61dd3...
  • Page 111 ERROR: GPU 13 is in unexpected state "missing", can't use it - BDF:e2:00.0 SXMID:16 UUID:GPU-9a6a6a52-c6b6-79c3-086b-fcf2d5b1c87e ERROR: 2 GPU's are unavailable, unable to start this VM "dgx2vm- labMon1559-8g8-15" If you attempt to launch a VM with a failed GPU before the system has ...
  • Page 112 nvsysinfo     $ grep -i 'error|fail' $HOME/.cache/virt-manager/virt-install.log $ sudo egrep -i 'error|fail' /var/log/libvirt/qemu/<vm-name>.log $ virsh console <vm-name>...
  • Page 113 $ virsh net-list Name State Autostart Persistent ---------------------------------------------------------- macvtap-net active private-net active $ virsh domifaddr <vm-name> --source agent $ virsh domifaddr 1gpu-vm-1g2 --source agent Name MAC address Protocol Address ----------------------------------------------------------------- 00:00:00:00:00:00 ipv4 127.0.0.1/8 ipv6 ::1/128 enp1s0 52:54:00:3c:07:62 ipv4 10.120.28.227/24 ipv6 fe80::5054:ff:fe3c:762/64 docker0 02:42:9f:5c:39:da...
  • Page 114 Getting GPU Health Information from Within the VM :~$ sudo nvme list Node Model Namespace Usage Format FW Rev ------------ -------------- -------------------------- -- -------------------- ---------- -------- /dev/nvme0n1 S2X6NX0K501953 SAMSUNG MZ1LW960HMJP-00003 1 61.79 GB / 960.20 GB 512 B + 0 B CXV8601Q <snip>...
  • Page 115 Smart Log for NVME device:nvme9n1 namespace-id:ffffffff critical_warning <snip> ... critical_warning $ sudo mdadm -S -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Aug 13 08:23:52 2019 Raid Level : raid1 Array Size : 937034752 (893.63 GiB 959.52 GB) Used Dev Size : 937034752 (893.63 GiB 959.52 GB) Raid Devices : 2 Total Devices : 2...
  • Page 116 $ virsh pool-list --details Name State Autostart Persistent Capacity Allocation Available ---------------------------------------------------------------------------------- dgx-kvm-pool running 27.83 TiB 171.71 GiB 27.66 TiB images running 878.57 GiB 19.62 GiB 858.95 GiB  From within the guest VM, run the following command. :~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT...
  • Page 117             $ sudo virt-log -d <vm-name>  $ sudo virt-cat -d <vm-name> /var/log/syslog $ sudo virt-edit -d <vm-name> /var/log/syslog ...
  • Page 118 $ sudo virt-df -d <vm-name> $ sudo virt-df -d 1gpu-vm-1g0 Filesystem 1K-blocks Used Available Use% 1gpu-vm-1g0:/dev/sda1 51341792 3313160 45390912  $ sudo virt-filesystems -d <vm-name> DGX-2 Server Software Release Notes  Linux KVM: Guest OS debugging...
  • Page 119 NVIDIA DGX-2 Service Manual...
  • Page 120 Creating a Unique BMC Password...
  • Page 121   nvmecli Do not use a root file system Execute a shell in the installer environment’ $ sudo nvme list Node Model Namespace Usage Format FM Rev...
  • Page 122 ------------ -------- -------------- --------- -------------------- ------------ ----------- /dev/nvme0n1 Sxxxxxxx Samsung MZxxxx 88.99 GB / 960.20 GB B + 0 B CXV8501Q /dev/nvme1n1 Sxxxxxxx Samsung MZxxxx 90.11 GB / 960.20 GB B + 0 B CXV8501Q /dev/nvme2n1 18xxxxxx Micron_9200_xx 3.84 TB / 3.84 TB B + 0 B 101008R0...
  • Page 123 This process destroys all data and software customizations that you CAUTION: have made on the DGX-2 System. Be sure to back up any data that you want to preserve and push any Docker images that you want to keep to a trusted registry.
  • Page 124 Download the image ISO file. Restoring the DGX-2 Software Image https://docs.nvidia.com/dgx/dgx-os-server-release-notes/index.html#dgx-os-release- number-scheme https://docs.nvidia.com/dgx/pdf/DGX-OS-server-4.1-relnotes-update-guide.pdf These procedures apply only to upgrades within the same major  Note: release, such as 4.x → 4.y. It does not support upgrades across major releases, such as 3.x → 4.x..
  • Page 125 # DGX specific repositories: deb http://international.download.nvidia.com/dgx/repos /bionic bionic main restricted universe multiverse...
  • Page 126 - updates main restricted universe multiverse deb http://international.download.nvidia.com/dgx/repos/bionic bionic- r418+cuda10.1 main multiverse restricted universe deb-i386 http://international.download.nvidia.com/dgx/repos/bionic bionic main restricted universe multiverse deb-i386 http://international.download.nvidia.com/dgx/repos/bionic bionic-updates main restricted universe multiverse # Only for DGX OS 4.1.0 deb-i386 http://international.download.nvidia.com/dgx/repos/bionic bionic-r418+cuda10.1 main multiverse restricted universe # Clean unused items clean http://archive.ubuntu.com/ubuntu...
  • Page 127 / bionic-r418+cuda10.1 main multiverse restricted universe /etc/apt/sources.list.d/dgx-bionic-r450-cuda10-1- repo.list file:///media/usb/repository/mirror/international.download.nvidia.com/dgx/repos/bionic / bionic-r450+cuda11.0 main multiverse restricted universe /etc/apt/preferences.d/nvidia Package: * #Pin: origin international.download.nvidia.com Pin: release o=DGX Server Pin-Priority: 600 $ sudo apt update Get:1 file:/media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB] Get:1 file:/media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]...
  • Page 128 Get:2 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic InRelease [242 kB] Get:2 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic InRelease [242 kB] Get:3 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB] Get:4 file:/media/usb/repository/mirror/international.download.nvidia.com/dgx /repos/bionic bionic-r418+cuda10.1 InRelease [13.0 kB] Get:5 file:/media/usb/repository/mirror/international.download.nvidia.com/dgx /repos/bionic bionic InRelease [13.1 kB] Get:3 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB] Get:4 file:/media/usb/repository/mirror/international.download.nvidia.com/dgx /repos/bionic bionic-r418+cuda10.1 InRelease [13.0 kB]...
  • Page 129 /etc/apt/sources.list.d/dgx-bionic-r450-cuda11-0-repo.list /etc/apt/sources.list.d/dgx-bionic- r450-cuda11-0-repo.list $ sudo apt install cuda-toolkit-11-0 If you did not configure apt to use the NVIDIA DGX OS packages in the  Note: file /etc/apt/sources.list.d/dgx-bionic-r450-cuda11-0-repo.list, omit this step. If you try to install CUDA Toolkit 11.0, the attempt fails...
  • Page 130 > framework.tar docker load –i framework.tar docker images...
  • Page 131 https://cloudinit.readthedocs.io/en/latest/topics/examples.html name The user’s login name. The default file contains a dummy value which must be replaced with your own. primary_group Define the primary group. Defaults to a new group created named after the user. The default file contains a dummy value which must be replaced with your own. groups Optional.
  • Page 132 instance-data.json https://cloudinit.readthedocs. io/en/latest/topics/instancedata.html#format-of- instance-data-json...
  • Page 135  Clean, dry, and free of airborne particles (other than normal room dust).  Well-ventilated and away from sources of heat including direct sunlight and radiators.  Away from sources of vibration or physical shock.  In regions that are susceptible to electrical storms, we recommend you plug your system into a surge suppressor and disconnect telecommunication lines to your modem during an electrical storm.
  • Page 136  Do not attempt to modify or use the AC power cord(s) if they are not the exact type required to fit into the grounded electrical outlets.  The power cord(s) must meet the following criteria:...
  • Page 137  Turn off all peripheral devices connected to this product.  Turn off the system by pressing the power button to off.  Disconnect the AC power by unplugging all AC power cords from the system or wall outlet.  Disconnect all cables telecommunicat ion lines that...
  • Page 139 www.dtsc.ca.gov/perchlorate...
  • Page 140  Check first to make sure you have not left loose tools or parts inside the system.  Check that cables, add-in cards, and other components are properly installed.  Attach the covers to the chassis according to the product instructions.
  • Page 141 Federal Communications Commission (FCC) FCC Marking (Class A) NOTE: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment.
  • Page 142: Can Ices-3(A)/Nmb-3(A)

    CAN ICES-3(A)/NMB-3(A) The Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulation. Cet appareil numerique de la class A respecte toutes les exigences du Reglement sur le materiel brouilleur du Canada. European Conformity; Conformité Européenne (CE) This is a Class A product.
  • Page 143 A Japanese regulatory requirement, defined by specification JIS C 0950, 2008, mandates that manufacturers provide Material Content Declarations for certain categories of electronic products offered for sale after July 1, 2006. To view the JIS C 0950 material declaration for this product, visit www.nvidia.com...
  • Page 144 Japan RoHS Material Content Declaration 日本工業規格JIS C 0950:2008により、2006年7月1日以降に販売される特定分野の電気および電子機器について、製造者による含有物質の表示が義務付けられま す。 機器名称:DGX-2 特定化学物質記号 主な分類 Cr(VI) PBDE 筐体 除外項目 プリント基板 除外項目 プロセッサー 除外項目 マザーボード 除外項目 電源 除外項目 システムメモリ 除外項目 ハードディスクドライブ 除外項目 機械部品 (ファン、ヒートシンク、ベゼル..) 除外項目 ケーブル/コネクター 除外項目 はんだ付け材料 フラックス、クリームはんだ、ラベル、そ の他消耗品 注: 1.「0」は、特定化学物質の含有率が日本工業規格JIS C 0950:2008に記載されている含有率基準値より低いことを示します。...
  • Page 145 A Japanese regulatory requirement, defined by specification JIS C 0950: 2008, mandates that manufacturers provide Material Content Declarations for certain categories of electronic products offered for sale after July 1, 2006. Product Model Number: DGX-2 Symbols of Specified Chemical Substance...
  • Page 146 China RoHS Material Content Declaration 产品中有害物质的名称及含量 The Table of Hazardous Substances and their Content 根据中国《电器电子产品有害物质限制使用管理办法》 as required by China’s Management Methods for Restricted of Hazardous Substances Used in Electrical and Electronic Products 有害物质 Hazardous Substances 部件名称 Parts 汞 镉 六价铬 多溴联苯...
  • Page 147 Notice...

This manual is also suitable for:

Dgx-2h