Nvidia DGX A100 User Manual
Hide thumbs Also See for DGX A100:
Table of Contents

Advertisement

DGX A100 System
User Guide
DU-09821-001_v06
|
May 2022

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the DGX A100 and is the answer not in the manual?

Questions and answers

Subscribe to Our Youtube Channel

Summary of Contents for Nvidia DGX A100

  • Page 1 DGX A100 System User Guide DU-09821-001_v06 May 2022...
  • Page 2: Table Of Contents

    Table of Contents Chapter 1. Introduction.................... 1 Hardware Overview ......................2 1.1.1 DGX A100 Models and Component Descriptions ............2 1.1.2 Mechanical Specifications ................... 3 1.1.3 Power Specifications....................4 1.1.3.1 Support for N+N Redundancy ................4 1.1.3.2 DGX A100 Locking Power Cord Specification ............4 1.1.3.3...
  • Page 3 Chapter 4. Quick Start and Basic Operation ............23 Installation and Configuration ..................23 Registration ........................23 Obtaining an NGC Account ....................24 Turning DGX A100 On and Off ..................24 4.4.1 Startup Considerations ....................24 4.4.2 Shutdown Considerations ..................24 Verifying Functionality –...
  • Page 4 9.1.2 Update Instructions ....................53 Restoring the DGX A100 Software Image ............... 53 9.2.1 Obtaining the DGX A100 Software ISO Image and Checksum File ......54 9.2.2 Remotely Reimaging the System ................54 9.2.3 Creating a Bootable Installation Medium ..............55 9.2.3.1...
  • Page 5 13.1 User Security Measures....................75 13.1.1 Securing the BMC Port ....................75 13.2 System Security Measures ....................75 13.2.1 Secure Flash of DGX A100 Firmware ................ 75 13.2.1.1 Encryption ......................75 13.2.1.2 Signing ......................... 76 13.2.1.3 NVSM Security ..................... 76 13.3 Secure Data Deletion .......................
  • Page 6: Chapter 1. Introduction

    Chapter 1. Introduction The NVIDIA DGX™ A100 system is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA A100 Tensor Core GPUs. This document is for users and administrators of the DGX A100 system.
  • Page 7: Hardware Overview

    Introduction Hardware Overview 1.1.1 DGX A100 Models and Component Descriptions There are two models of the NVIDIA DGX A100 system: the NVIDIA DGX A100 640GB system and the NVIDIA DGX A100 320GB system. Table 1-1. Model Differentiation Component NVIDIA DGX A100 640GB...
  • Page 8: Mechanical Specifications

    1.1.2 Mechanical Specifications Table 1-3. Mechanical Specifications Feature Description Form Factor 6U Rackmount Height 10.4” (264 mm) Width 19" (482.3 mm) max Depth 35.3" (897.1 mm) max System Weight 271.5 lbs (123.16 kg) max DGX A100 System DU-09821-001_v06 | 3...
  • Page 9: Power Specifications

    DGX A100 Locking Power Cord Specification The DGX A100 is shipped with a set of six (6) locking power cords that have been qualified for use with the DGX A100 to ensure regulatory compliance. Two locking power cord types are approved - switch-locking for the PSU side and twist-locking for the PSU side.
  • Page 10: Using The Locking Power Cords

    To UNLOCK the power cord, move the switch to the unlocked position (indicator will show GREEN) To LOCK the power cord, move the switch to the locked position (indicator should show only RED) DGX A100 System DU-09821-001_v06 | 5...
  • Page 11: Locking/Unlocking The Psu Side (Cords With Twist-Lock Mechanism)

    Environmental Specifications Feature Specification Operating Temperature ¤ ¤ ¤ ¤ C to 30 C (41 F to 86 Relative Humidity 20% to 80% non-condensing Airflow 840 CFM @ 80% fan PWM Heat Output 22,179 BTU/hr DGX A100 System DU-09821-001_v06 | 6...
  • Page 12: Front Panel Connections And Controls

    With a Bezel Table 1-7. Front Panel Controls Control Description Press to turn the DGX A100 system On or Off Power Button Green flashing (1 Hz): Standby (BMC booted) Green flashing (4 Hz): POST in progress Green solid On: Power On...
  • Page 13: With The Bezel Removed

    Introduction 1.1.5.2 With the Bezel Removed Important: See “Turning DGX A100 On and Off” for instructions on how to properly turn the system on or off. 1.1.6 Rear Panel Modules DGX A100 System DU-09821-001_v06 | 8...
  • Page 14: Motherboard Connections And Controls

    BMC Reset button Press to manually reset the BMC See “Network Connections, Cables, and Adaptors” for details on the network connections. 1.1.8 Motherboard Tray Components DGX A100 System DU-09821-001_v06 | 9...
  • Page 15: Gpu Tray Components

    Introduction 1.1.9 GPU Tray Components Network Connections, Cables, and Adaptors 1.2.1 Network Ports DGX A100 System DU-09821-001_v06 | 10...
  • Page 16 When switching from the default Ethernet to InfiniBand, the InfiniBand port designations will vary depending on changes made to the other ports. Based on systems updated with DGX A100 Firmware Update Container 20.10.9 or later Based on systems updated with DGX A100 Firmware Update Container 20.05.12.3 or earlier.
  • Page 17: Supported Network Cables And Adaptors

    1.2.2 Supported Network Cables and Adaptors The DGX A100 system is not shipped with network cables or adaptors. You will need to purchase supported cables or adaptors for your network. The ConnectX-6 firmware determines which cables and adaptors are supported. For a list of cables and adaptors compatible with the Mellanox ConnectX-6 VPI cards installed in the DGX A100 system.
  • Page 18: Additional Documentation

    Customer Support Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX A100 system. Also contact NVIDIA Enterprise Support for assistance in moving the DGX A100 system. For contracted Enterprise Support questions, you can send an email to ...
  • Page 19: Chapter 2. Connecting To The Dgx A100

    2.1.1 Direct Connection At either the front or the back of the DGX A100 system, connect a display to the VGA connector, and a keyboard to any of the USB ports. Note: The display resolution must be 1440x900 or lower.
  • Page 20 Connecting to the DGX A100 Figure 2-1. DGX A100 Server Front View Figure 2-2. DGX A100 Server Rear View DGX A100 System DU-09821-001_v06 | 15...
  • Page 21: Remote Connection Through The Bmc

    Username: <administrator-username>  Password: <bmc-password>  Make sure you have connected the BMC port on the DGX A100 system to your LAN. Open a browser within your LAN and go to: https://<bmc-ip-address>/ Make sure popups are allowed for the BMC address.
  • Page 22 From the left-side navigation menu, click Remote Control. The Remote Control page allows you to open a virtual Keyboard/Video/Mouse (KVM) on the DGX A100 system, as if you were using a physical monitor and keyboard connected to the front of the system.
  • Page 23: Ssh Connection To The Os

    Connecting to the DGX A100 SSH Connection to the OS After the system has been configured, you can also establish an SSH connection to the DGX A100 OS through the network port. See “Network Ports” to identify the port to use, and “Configuring a BMC Static IP Address for the Network Ports”...
  • Page 24: Chapter 3. First Boot Setup

    System Setup These instructions describe the setup process that occurs the first time the DGX A100 system is powered on after delivery or after the server is re-imaged.
  • Page 25 If the DGX OS was installed with an encrypted root filesystem, you will be prompted to unlock the drive. Enter “nvidia3d” at the crypt: prompt. You are presented with end user license agreements (EULAs) for the NVIDIA software. Accept the EULA to proceed with the installation. Perform the steps to configure the DGX A100 software.
  • Page 26: Post Setup Tasks

    This step appears only if you installed the system with an encrypted root filesystem during DGX OS installation. Choose a primary network interface for the DGX A100 system; for example, enp226s0. This should typically be the interface that you will use for subsequent system configuration or in-band management.
  • Page 27: Obtain Software Updates

    Obtain Software Updates Update the software to ensure you are running the latest version. Updating the software ensures your DGX A100 system contains important updates, including security updates. The Ubuntu Security Notice site (https://usn.ubuntu.com/) lists known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the DGX OS software.
  • Page 28: Chapter 4. Quick Start And Basic Operation

    NGC for DGX account. If you did not receive the information, open a case with the NVIDIA Enterprise Support Team by going to the NVIDIA Enterprise Support Portal. The site provides ways of contacting the NVIDIA Enterprise Services team for support without requiring an NVIDIA Enterprise Support account.
  • Page 29: Obtaining An Ngc Account

    Observe the following startup and shutdown instructions. 4.4.1 Startup Considerations To keep your DGX A100 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components can complete their initialization. 4.4.2 Shutdown Considerations...
  • Page 30: Running A Preflight Stress Test

    Quick Start and Basic Operation The following are the steps for performing a health check on the DGX A100 System, and verifying the Docker and NVIDIA driver installation. Establish an SSH connection to the DGX A100 System. Run a basic system check.
  • Page 31: Running The Ngc Containers With Gpu Support

    Running the NGC Containers with GPU Support To obtain the best performance when running NGC containers on DGX A100 systems, two methods of providing GPU support for Docker containers have been developed: Native GPU support (included in Docker 19.03 and later) ...
  • Page 32: Using The Nvidia Container Runtime For Docker

    Use docker run with nvidia as the default runtime.  You can set nvidia as the default runtime, for example, by adding the following line to the / etc/docker/daemon.json configuration file as the first entry. "default-runtime": "nvidia", The following is an example of how the added line appears in the JSON file.
  • Page 33: Managing Cpu Mitigations

    CPU mitigations are disabled if the output consists of multiple lines prefixed with  Vulnerable. Example KVM: Vulnerable Mitigation: PTE Inversion; VMX: vulnerable Vulnerable; SMT vulnerable Vulnerable Vulnerable Vulnerable: user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerable, IBPB: disabled, STIBP: disabled Vulnerable DGX A100 System DU-09821-001_v06 | 28...
  • Page 34: Disabling Cpu Mitigations

    $ sudo apt purge nv-mitigations-off Reboot the system. Verify CPU mitigations are enabled. $ cat /sys/devices/system/cpu/vulnerabilities/* The output should include several Mitigations lines. See “Determining the CPU Mitigation State of the DGX System” for example output. DGX A100 System DU-09821-001_v06 | 29...
  • Page 35: Chapter 5. Additional Features And Instructions

    Chapter 5. Additional Features and Instructions This chapter describes specific features of the DGX A100 server to consider during setup and operation. Managing the DGX Crash Dump Feature The DGX OS includes a script to manage this feature. 5.1.1 Using the Script To enable only dmesg crash dumps, enter the following command: ...
  • Page 36: Connecting To Serial Over Lan To View The Console

    While dumping vmcore, the BMC screen console goes blank approximately 11 minutes after the crash dump is started. To view the console output during the crash dump, connect to serial over LAN as follows: $ ipmitool -I lanplus -H <bmc-ip-address> -U <BMC-USERNAME> -P <BMC-PASSWORD> sol activate DGX A100 System DU-09821-001_v06 | 31...
  • Page 37: Chapter 6. Managing The Dgx A100 Self-Encrypting Drives

    The NVIDIA DGX OS software supports the ability to manage self-encrypting drives (SEDs), including setting an Authentication Key for locking and unlocking the drives on NVIDIA DGX™ A100 systems. You can manage only the SED data drives. The software cannot be used to manage OS drives even if they are SED-capable.
  • Page 38: Installing The Software

    Trusted Platform Module  The NVIDIA DGX A100 incorporates Trusted Platform Module 2.0 (TPM 2.0) which can be enabled from the system BIOS and used in conjunction with the nv-disk-encrypt tool. Once enabled, the nv-disk-encrypt tool uses the TPM for encryption and then stores the vault and SED authentication keys on the TPM instead of on the file system.
  • Page 39: How To Tell If Drives Support Block Sid

    Block SID request - but you can select which task to perform as each task is independent of the other. Reboot the DGX A100, then press [Del] or [F2] at the NVIDIA splash screen to enter the BIOS Setup.
  • Page 40: Initializing The System For Drive Encryption

    Managing the DGX A100 Self-Encrypting Drives Press F10 at the prompt. After the system boots, you can proceed to initialize drive encryption. Initializing the System for Drive Encryption Note: Before initializing drive encryption, review the information in “Configuring Trusted Computing” and follow the configuration instructions if needed.
  • Page 41: Enabling Drive Locking

    Managing the DGX A100 Self-Encrypting Drives NVIDIA strongly recommends using this option for best security, otherwise the software will use a default salt value instead of a randomly generated one. -r: Generates random passwords for each drive. This avoids the need to create a JSON file ...
  • Page 42: Creating The Drive/Password Mapping Json Files And Using It To Initialize The System

    Managing the DGX A100 Self-Encrypting Drives Lock Enabled: Are locks enabled on this drive? It will be in this state after initialization (nv-  disk-encrypt init). MBR done: This setting is only relevant for drives that support MBR shadowing. On drives ...
  • Page 43: Example 2: Generating Random Passwords

    Managing the DGX A100 Self-Encrypting Drives Initialize the system and then enable locking. The following command assumes you have placed the JSON file in the /tmp directory. $ sudo nv-disk-encrypt init -f /tmp/<your-file>.json -g $ sudo nv-disk-encrypt lock Provide a password for the vault when prompted.
  • Page 44: Exporting The Vault

    Erasing your Data CAUTION: Be aware when executing this that all data will be lost. On DGX A100 systems, these drives generally form a RAID 0 array - this will also be destroyed when performing an erase. After initializing the system for SED management, use the nv-disk-encrypt command to erase data on your drives after stopping cachefilesd and unmounting the RAID array as follows.
  • Page 45: Changing Disk Passwords, Adding Disks, Or Replacing Disks

    Recovering From Lost Keys NVIDIA recommends backing up your keys and storing them in a secure location. If you’ve lost the key used to initialize and lock your drives, you will not be able to unlock the drive again. If this happens, the only way to recover is to perform a factory-reset, which will result in a loss of data.
  • Page 46: Chapter 7. Network Configuration

    Chapter 7. Network Configuration This chapter describes key network considerations and instructions for the DGX A100 System. Configuring Network Proxies If your network requires use of a proxy server, you will need to set up configuration files to ensure the DGX A100 System communicates through the proxy.
  • Page 47: For Docker

    IP addresses are used by your network. If your network does not conflict with the default Docker IP address range, no changes are needed, and you can skip this section. However, if your network uses the addresses within this range for the DGX A100 system, you should change the default Docker network addresses.
  • Page 48: Open Ports

    If port 443 is proxied through a corporate firewall, WebSocket protocol traffic must be supported. Connectivity Requirements for NGC Containers To run NVIDIA NGC containers from the NGC container registry, your network must be able to access the following URLs: http://archive.ubuntu.com/ubuntu/  http://security.ubuntu.com/ubuntu/ ...
  • Page 49: Configuring A Static Ip Address For The Bmc

    This section describes how to set a static IP address for the BMC from the Ubuntu command line. • Note: If you cannot access the DGX A100 System remotely, then connect a display (1440x900 or lower resolution) and keyboard directly to the DGX A100 system. To view the current settings, enter the following command.
  • Page 50: Configuring A Bmc Static Ip Address Using The System Bios

    Configuring a BMC Static IP Address for the Network Ports During the initial boot setup process for the DGX A100 System, you had an opportunity to configure static IP addresses for a single network interface. If you did not set this up at that time, you can configure the static IP addresses from the Ubuntu command line using the following instructions.
  • Page 51: Switching Between Infiniband And Ethernet

    Switching Between InfiniBand and Ethernet The NVIDIA DGX A100 System is equipped with eight Mellanox ConnectX-6 single-port network cards on the I/O board, typically used for cluster communications. By default, these are configured as InfiniBand ports, but you have the option to convert these to Ethernet ports.
  • Page 52: Starting The Mellanox Software Tools And Determining The Current Port Configuration

    Likewise, if the port configuration is set to Ethernet, then the switch should also be Ethernet. The DGX A100 is also equipped with one (and optionally two) dual-port connections typically used for network storage and configured by default for Ethernet. These can also be configured for InfiniBand.
  • Page 53 <config-number> is ‘1’ for InfiniBand and ‘2’ for Ethernet. Example setting slot 0 to Ethernet $ sudo mlxconfig -y -d /dev/mst/mt4123_pciconf2 set LINK_TYPE_P1=2 Example setting slot 1 to InfiniBand $ sudo mlxconfig -y -d /dev/mst/mt4123_pciconf3 set LINK_TYPE_P1=1 DGX A100 System DU-09821-001_v06 | 48...
  • Page 54: Chapter 8. Configuring Storage

    System, and how to cache the NFS using the DGX A100 SSDs for improved performance. Disabling cachefilesd The DGX A100 system uses cachefilesd to manage caching of the NFS. If you do not want cachefilesd enabled, you can disable it as follows.
  • Page 55: Setting Filesystem Quotas

    Switching Between RAID 0 and RAID As supplied from the factory, the RAID level of the DGX A100 RAID array is RAID 0. RAID 0 provides the maximum storage capacity but does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost.
  • Page 56: Configuring Support For Custom Drive Partitioning

    Configuring Support for Custom Drive Partitioning DGX A100 systems incorporate data drives configured as RAID 0 by default. You can alter the default configuration by adding or removing drives, or by switching between a RAID 0 configuration and a RAID 5 configuration. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.
  • Page 57: Chapter 9. Updating And Restoring The Software

    These instructions explain how to update the DGX A100 software through an internet connection to the NVIDIA public repository. The process updates a DGX A100 system image to the latest released versions of the entire DGX A100 software stack, including the drivers, for the latest version within a specific release.
  • Page 58: Update Instructions

    Image If the DGX A100 software image becomes corrupted (or the OS NVMe drives are replaced), restore the DGX A100 software image to its original factory condition from a pristine copy of the image. The process for restoring the DGX A100 software image is as follows: Obtain an ISO file that contains the image from NVIDIA Enterprise Support as explained in “Obtaining the DGX A100 Software ISO Image and Checksum File”...
  • Page 59: Obtaining The Dgx A100 Software Iso Image And Checksum File

    Obtaining the DGX A100 Software ISO Image and Checksum File To ensure that you restore the latest available version of the DGX A100 software image, obtain the current ISO image file from NVIDIA Enterprise Support. A checksum file is provided for the image to enable you to verify the bootable installation medium that you create from the image file.
  • Page 60: Creating A Bootable Installation Medium

    After the installation is completed, the system ejects the virtual CD and then reboots into the OS. Refer to “First Boot Setup” for the steps to take when booting up the DGX A100 system for the first time after a fresh installation.
  • Page 61: Creating A Bootable Usb Flash Drive By Using The Dd Command

    Ensure that the following prerequisites are met: The correct DGX A100 software image is saved to your local disk. For more information,  see “Obtaining the DGX A100 Software ISO Image and Checksum File” .
  • Page 62 Updating and Restoring the Software The correct DGX A100 software image is saved to your local disk. For more information,  see “Obtaining the DGX A100 Software ISO Image and Checksum File” . The USB flash drive has a capacity of at least 8 GB.
  • Page 63: Re-Imaging The System From A Usb Flash Drive

    After the installation is completed, the system then reboots into the OS. Refer to “First Boot Setup” for the steps to take when booting up the DGX A100 system for the first time after a fresh installation.
  • Page 64: Advanced Installation Options (Encrypted Root - Dgx Os 5 Or Later)

    This overwrites any data or file systems that may exist on the OS disk as well as the RAID disks. Since the RAID array on the DGX A100 system is intended to be used as a cache and not for long- term data storage, this should not be disruptive. However, if you are an advanced user...
  • Page 65: Boot Into Live Environment (Dgx Os 5 Or Later)

    It is time consuming, and the installation media generally is not the real source of the problem. In normal operation, this option should not be selected. DGX A100 System DU-09821-001_v06 | 60...
  • Page 66: Chapter 10. Using The Bmc

    10.1.1 Connecting to the BMC Make sure you have connected the BMC port on the DGX A100 system to your LAN. Open a browser within your LAN and go to: https://<bmc-ip-address>/ The BMC is supported on the following browsers: Internet Explorer 11 and later •...
  • Page 67: Overview Of Bmc Controls

    Provides status and readings for system sensors, such as SSD, PSUs, voltages, CPU temperatures, DIMM temperatures, and fan speeds. System Inventory Displays inventory information of system modules: System, Processor, Memory Controller, BaseBoard, Power, FRU Information Thermal, PCIE Device, PCIE Function, and Storage. DGX A100 System DU-09821-001_v06 | 62...
  • Page 68: Common Bmc Tasks

    Order Settings, Platform Event Filter, Services, SMTP Settings, SSL Settings, System Firewall, User Management, Video Recording Remote Control Opens the KVM Launch page for accessing the DGX A100 console remotely. Power Control Perform the following power actions: Power On, Power Off, Power Cycle, Hard Reset, ACP/Shutdown...
  • Page 69: Using The Remote Console

    Log out and then log back in with the new credentials. 10.3.2 Using the Remote Console Click Remote Control from the left-side navigation menu. Click Launch KVM to start the remote KVM and access the DGX A100 console. DGX A100 System DU-09821-001_v06 | 64...
  • Page 70: Setting Up Active Directory Or Ldap/E-Directory

    To view available configured and unconfigured slots, click All in the upper-left corner of  the page. To view available configured slots, click Configured in the upper-left corner of the page.  DGX A100 System DU-09821-001_v06 | 65...
  • Page 71: Uploading Or Generating Ssl Certificates

    The View SSL Certificate page displays the basic information about the uploaded SSL certificate. Certificate Version, Serial Number, Algorithm, and Public Key  Issuer information  Valid Date range  Issued to information  DGX A100 System DU-09821-001_v06 | 66...
  • Page 72: Generating The Ssl Certificate

    Special characters are not allowed. • Email Address Email address of the organization (mandatory) Valid for Validity of the certificate. Key Length Enter a range from 1 to 3650 (days) Click Save to generate the new certificate. DGX A100 System DU-09821-001_v06 | 67...
  • Page 73: Uploading The Ssl Certificate

    Obtain the CA certificate from the signing authority that was used to sign the SSL certificate. Copy the CA certificate onto a USB thumb drive or to /boot/efi on the A100 OS. Access the DGX A100 console from a locally connected keyboard and mouse or through the BMC remote console. Reboot the server To enter BIOS setup menu, when prompted, press DEL.
  • Page 74 Using the BMC Select Server CA Configuration. Select Enroll Cert. DGX A100 System DU-09821-001_v06 | 69...
  • Page 75 Using the BMC Select Enroll Cert Using File. Select the device where you stored the certificate. Navigate the file structure and select the certificate. DGX A100 System DU-09821-001_v06 | 70...
  • Page 76 Using the BMC DGX A100 System DU-09821-001_v06 | 71...
  • Page 77: Chapter 11. Sbios Settings

    Instructions for these use cases are provided in this document. Do not change settings in the SBIOS other than those described in this or other DGX A100 user documents. Contact NVIDIA Enterprise Services before making other changes. 11.1...
  • Page 78: Configuring Boot Order

    The following instructions describe how to set the boot order at boot time. You can also set the boot order from the SBIOS setup > Boot screen. Access the DGX A100 console, either from a locally connected keyboard and mouse or through the BMC remote console.
  • Page 79: Chapter 12. Multi-Instance Gpu

    Chapter 12. Multi-Instance GPU Multi-Instance GPU (MIG) is a new capability of the NVIDIA A100 GPU. MIG uses spatial partitioning to carve the physical resources of a single A100 GPU into as many as seven independent GPU instances. These instances run simultaneously, each with its own memory, cache, and compute streaming multiprocessors.
  • Page 80: Chapter 13. Security

    13.1 User Security Measures The NVIDIA DGX A100 system is a specialized server designed to be deployed in a data center. It must be configured to protect the hardware from unauthorized access and unapproved use. The DGX A100 system is designed with a dedicated BMC Management Port and multiple Ethernet network ports.
  • Page 81: Signing

    13.3 Secure Data Deletion This section explains how to securely delete data from the NVIDIA DGX A100 system SSDs to permanently destroy all the data that was stored there. This performs a more secure SSD data deletion than merely deleting files or reformatting the SSDs.
  • Page 82 $ dpkg -i /usr/lib/live/mount/rootfs/filesystem.squashfs/curtin/repo/nvme- cli_1.9-1ubuntu0.1_amd64.deb Run nvme format -s1 on all storage devices listed. Syntax: $ nvme format -s1 <device-path> where <device-path> is the specific storage node as listed in the previous step. For example, /dev/nvme0n1. DGX A100 System DU-09821-001_v06 | 77...
  • Page 83: Chapter 14. Redfish Apis Support

    Redfish is a web-based management protocol, and the Redfish server is integrated into the DGX A100 BMC firmware. By default, Redfish support is enabled in the DGX A100 BMC and the BIOS. By using the Redfish interface, administrator-privileged users can browse physical resources at the chassis and system level through a web-based user interface.
  • Page 84 Redfish Schema 2019.1  For a list of the known issues and limitations with Redfish support that are specific to the firmware version you are running, refer to the DGX A100 System Firmware Update Container Release Notes. DGX A100 System...
  • Page 85: Appendix A. Installing Software On Air-Gapped Dgx 100 Systems

    CAUTION: This process destroys all data and software customizations that you have made on the DGX A100 System. Be sure to back up any data that you want to preserve and push any Docker images that you want to keep to a trusted registry.
  • Page 86 The procedure below describes how to download all the necessary packages to create a mirror of the repositories that are needed to update NVIDIA DGX A100 systems. For more information on DGX OS versions and the release notes available, go to DGX OS Server Releases.
  • Page 87 The target directory must be owned by the user apt-mirror or the replication will not work. Configure the path of the destination directory in /etc/apt/mirror.list and use the included list of repositories below to retrieve the packages for both Ubuntu base OS and the NVIDIA DGX OS packages:...
  • Page 88 The instructions in this section are to be performed on the target air-gapped DGX system. Prerequisites The target DGX A100 system is installed, has gone through the first boot process, and is  ready to be updated with the latest packages.
  • Page 89 If present, remove the file /etc/apt/sources.list.d/docker.list as it is no longer needed and it will eliminate error messages during the update process. Configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/ dgx- bionic-r450-cuda11-0-repo.list. $ echo "deb file:///media/usb/repository/mirror/international.download.nvidia.com/dgx/...
  • Page 90 250GB. An efficient way to move large amount of data; for example, shared storage in a DMZ, or  portable USB drives that can be brought into the air-gapped area. DGX A100 System DU-09821-001_v06 | 85...
  • Page 91 Configure the path of the destination directory in /etc/apt/mirror.list and use the included list of repositories below to retrieve the packages for both Ubuntu base OS as well as the NVIDIA DGX OS packages: ############# config ################## # set base_path /media/usb/repository...
  • Page 92 The instructions in this section are to be performed on the target air-gapped DGX system. Prerequisites The target DGX A100 system is installed, has gone through the first boot process, and is  ready to be updated with the latest packages.
  • Page 93 Redfish APIs Support Configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/ dgx.list. file:///media/usb/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/fo cal/ x86_64/ focal main dgx file:///media/usb/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/fo cal/ x86_64/ focal-updates main dgx Configure apt to use the NVIDIA CUDA packages in the /etc/apt/sources.list.d/ cuda- compute-repo.list file.
  • Page 94 Redfish APIs Support Installing Docker Containers This method applies to Docker containers hosted on the NVIDIA NGC Container Registry, and requires that you have an active NGC account. On a system with internet access, log in to the NGC Container Registry by entering the following command and credentials.
  • Page 95: Appendix B. Safety

    Indicates the presence of a hazard that may result in serious personal injury if the WARNING is ignored. Indicates potential hazard if indicated information is ignored. Indicates shock hazards that result in serious injury or death if safety instructions are not followed DGX A100 System DU-09821-001_v06 | 90...
  • Page 96 Provided with a properly grounded wall outlet.  Provided with sufficient space to access the power supply cord(s), because they serve as  the product's main power disconnect. DGX A100 System DU-09821-001_v06 | 91...
  • Page 97 To avoid risk of electric shock, tum off the server and disconnect the power cords, telecommunications systems, networks, and modems attached to the server before opening it. DGX A100 System DU-09821-001_v06 | 92...
  • Page 98 Power down the server and disconnect all power cords before adding or replacing any non  hot-plug component. When replacing a hot-plug power supply, unplug the power cord to the power supply being  replaced before removing the power supply from the server. DGX A100 System DU-09821-001_v06 | 93...
  • Page 99 Circuit Overloading- Consideration should be given to the connection of the equipment to the supply circuit and the effect that overloading of the circuits might have on overcurrent protection and supply wiring. Appropriate consideration of equipment nameplate ratings should be used when addressing this concern. DGX A100 System DU-09821-001_v06 | 94...
  • Page 100 NICKEL NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal foam is not intended for direct and prolonged skin contact. Please use the handles to remove, attach or carry the bezel. While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you’re susceptible to nickel-related reactions.
  • Page 101 Access is through the use of a TOOL or lock and key, or other means of security, and is  controlled by the authority responsible for the location DGX A100 System DU-09821-001_v06 | 96...
  • Page 102: Appendix C. Compliance

    Appendix C. Compliance The NVIDIA DGX A100 Server is compliant with the regulations listed in this section. United States Federal Communications Commission (FCC) FCC Marking (Class A) This device complies with part 15 of the FCC Rules. Operation is subject to the following two...
  • Page 103 The full text of EU declaration of conformity is available at the following internet address: www.nvidia.com/support. A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA GmbH (Bavaria Towers – Blue Tower, Einsteinstrasse 172, D-81677 Munich, Germany). Australia and New Zealand Australian Communications and Media Authority This product meets the applicable EMC requirements for Class A, I.T.E equipment...
  • Page 104 A Japanese regulatory requirement, defined by specification JIS C 0950, 2008, mandates that manufacturers provide Material Content Declarations for certain categories of electronic products offered for sale after July 1, 2006. To view the JIS C 0950 material declaration for this product, visit DGX A100 System DU-09821-001_v06 | 99...
  • Page 105 Declarations for certain categories of electronic products offered for sale after July 1, 2006. Product Model Number: P3687 Server Symbols of Specified Chemical Substance Major Classification Cr(VI) PBDE Chassis Exempt Exempt Processor Exempt Motherboard Exempt Power supply Exempt DGX A100 System DU-09821-001_v06 | 100...
  • Page 106 Class A Equipment (Industrial Broadcasting & Communication Equipment). This equipment Industrial (Class A) electromagnetic wave suitability equipment and seller or user should take notice of it, and this equipment is to be used in the places except for home. DGX A100 System DU-09821-001_v06 | 101...
  • Page 107 Redfish APIs Support Korea RoHS Material Content Declaration DGX A100 System DU-09821-001_v06 | 102...
  • Page 108 Redfish APIs Support DGX A100 System DU-09821-001_v06 | 103...
  • Page 109 Redfish APIs Support China China Compulsory Certificate No certification is needed for China. The NVIDIA DGX A100 is a server with power consumption greater than 1.3 kW. China RoHS Material Content Declaration 产品中有害物质的名称及含量 The Table of Hazardous Substances and their Content 根据中国《电器电子产品有害物质限制使用管理办法》...
  • Page 110 All parts named in this table with an “X”are in compliance with the European Union’s RoHS Legislation. Note: The referenced Environmental Protection Use Period Marking was determined according to normal operating use conditions of the product such as temperature and humidity. DGX A100 System DU-09821-001_v06 | 105...
  • Page 111 Redfish APIs Support C.10 Taiwan Bureau of Standards, Metrology & Inspection (BSMI) Taiwan RoHS Material Content Declaration DGX A100 System DU-09821-001_v06 | 106...
  • Page 112 Federal Agency of communication (FAC) This device complies with the rules set forth by Federal Agency of Communications and the Ministry of Communications and Mass Media Federal Security Service notification has been filed. C.12 Israel DGX A100 System DU-09821-001_v06 | 107...
  • Page 113 SANS 2332: 2017/CISPR 32:2015 SANS 2335:2018/ CISPR 35:2016 National Regulator of Compulsory Specification (NRCS) This device complies with following standard under VC 8055: SANS IEC 60950-1 C.15 Great Britain (England, Wales, and Scotland UK Conformity Assessed DGX A100 System DU-09821-001_v06 | 108...
  • Page 114 Electronic Equipment (As Amended) A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA Ltd. (100 Brook Drive, 3rd Floor Green Park, Reading RG2 6UJ, United Kingdom) DGX A100 System DU-09821-001_v06 | 109...
  • Page 115 NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

Table of Contents