Nvidia DGX A100 User Manual
Hide thumbs Also See for DGX A100:
Table of Contents

Advertisement

NVIDIA DGX A100
User Guide
DU-09821-001 _v01
  |  
November   2022

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the DGX A100 and is the answer not in the manual?

Questions and answers

Стретьев
February 28, 2025

3d - the fashion designer wants to put this video card NVIDIA A100 80GB PCIe in a regular home PC. When reviewing the documentation and images, no way to connect the video card to the monitor was found. Please send instructions or indicate where to download it.

User image 67c2003c0b499

Subscribe to Our Youtube Channel

Summary of Contents for Nvidia DGX A100

  • Page 1 NVIDIA DGX A100 User Guide DU-09821-001 _v01   |   November   2022...
  • Page 2: Table Of Contents

    Table of Contents Chapter 1. Introduction to the NVIDIA DGX A100 System........... 1 1.1.  Hardware Overview........................1 1.1.1. DGX A100 Models and Component Descriptions..............2 1.1.2. Mechanical Specifications....................3 1.1.3. Power Specifications......................4 1.1.3.1. Support for N+N Redundancy..................4 1.1.4. DGX A100 Locking Power Cord Specification..............4 1.1.5. Using the Locking Power Cords..................5 1.1.6. Environmental Specifications....................
  • Page 3 Chapter 5. Additional Features and Instructions.............. 30 5.1. Managing the DGX Crash Dump Feature................30 5.1.1.  Using the Script........................30 5.1.2. Connecting to Serial Over LAN to View the Console............30 Chapter 6. Managing the DGX A100 Self-Encrypting Drives..........31 6.1.  Overview........................... 31 6.2. Installing the Software......................32 6.3. Configuring Trusted Computing.....................32 6.3.1. Determining Whether Drives Support SID..............
  • Page 4 9.1.1. Connectivity Requirements for Software Updates............50 9.1.2. Update Instructions......................51 9.2. Restoring the DGX A100 Software Image................51 9.2.1. Obtaining the DGX A100 Software ISO Image and Checksum File........ 51 9.2.2. Remotely Reimaging the System..................52 9.2.3. Creating a Bootable Installation Medium............... 53 9.2.3.1. Creating a Bootable USB Flash Drive by Using the dd Command......54 9.2.3.2. Creating a Bootable USB Flash Drive by Using Akeo Rufus........
  • Page 5 13.2.1.1.  Encryption.........................77 13.2.1.2.  Signing........................78 13.2.1.3. NVSM Security......................78 13.3. Secure Data Deletion......................78 13.3.1.  Prerequisites........................78 13.3.2.  Instructions........................78 Chapter 14. Redfish APIs Support..................80 14.1. Supported Redfish Features....................80 Appendix A. Installing Software on Air-Gapped DGX A100 Systems.........82 A.1. Installing NVDIA DGX A100 Software..................82 NVIDIA DGX A100 DU-09821-001 _v01   |   v...
  • Page 6 A.2. Reimaging the System......................82 A.3. Creating a Local Mirror of the NVIDIA and Canonical Repositories........83 A.3.1. Creating the Mirror in a DGX OS 4 System..............83 A.3.2. Configuring the Target Air-Gapped DGX OS 4 System...........85 A.3.3. Configuring the Target Air-Gapped DGX OS 5 System...........87 A.4. Installing Docker Containers....................
  • Page 7 Table 3. Mechanical Specifications ....................3 Table 4. Power Specifications ......................4 Table 5. Motherboard Controls ......................9 Table 6. Network Port Mapping ....................... 12 Table  7. Open Ports .......................... 42 Table 8. BMC Main Controls ......................61 Table 9. BMC Main Controls ......................66 NVIDIA DGX A100 DU-09821-001 _v01   |   vii...
  • Page 8 NVIDIA DGX A100 DU-09821-001 _v01   |   viii...
  • Page 9: Chapter 1. Introduction To The Nvidia Dgx A100 System

    The NVIDIA DGX A100 System is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA A100 Tensor Core GPUs. This document is for users and administrators of the DGX A100 system.
  • Page 10: Dgx A100 Models And Component Descriptions

    Introduction to the NVIDIA DGX A100 System 1.1.1.  DGX A100 Models and Component Descriptions There are two models of the NVIDIA DGX A100 system: the NVIDIA DGX A100 640GB system and the NVIDIA DGX A100 320GB system. Model Differentiation Table 1.
  • Page 11: Mechanical Specifications

    Introduction to the NVIDIA DGX A100 System Component Description Table 2. Component Description Component Description NVIDIA A100 GPU 2x AMD EPYC 7742 CPU w/64 cores NVSwitch 600 GB/s GPU-to-GPU bandwidth Storage (OS) 1.92 TB NVMe M.2 SSD (ea) in RAID 1 array Storage (Data Cache) 3.84 TB NVMe U.2 SED (ea) in RAID 0 array...
  • Page 12: Power Specifications

    1.1.4.  DGX A100 Locking Power Cord Specification The DGX A100 is shipped with a set of six (6) locking power cords that have been qualified for use with the DGX A100 to ensure regulatory compliance. The following locking power cord types are approved: ‣...
  • Page 13: Using The Locking Power Cords

    Introduction to the NVIDIA DGX A100 System Power Cord Feature Specification Plug Standard C19/C20 Dimension 1200mm length Compliance Cord: UL62, IEC60227 Connector/Plug: IEC60320-1 1.1.5.  Using the Locking Power Cords This section provides information about how to use the locking power cords.
  • Page 14 Introduction to the NVIDIA DGX A100 System   Locking/Unlocking the PSU Side (Cords with Twist-Lock Mechanism) Power Supply (System) side - Twist locking ‣ To INSERT or REMOVE make sure the cable is UNLOCKED and push/ pull into/out of the socket.
  • Page 15: Environmental Specifications

    Front Panel Connections and Controls This section provides information about the front panel, connections, and controls of the DGX A100 system. 1.1.7.1.  With a Bezel Here is an image of the DGX A100 system with a bezel.     Control...
  • Page 16: With The Bezel Removed

    Turning DGX A100 On and Off for instructions on how to properly turn the system on or off. 1.1.8.  Rear Panel Modules Here is an image that shows the real panel modules on DGX A100.   NVIDIA DGX A100 DU-09821-001 _v01   |   8...
  • Page 17: Motherboard Connections And Controls

    Introduction to the NVIDIA DGX A100 System   1.1.9.  Motherboard Connections and Controls Here is an image that shows the motherboard connections and controls in a DGX A100 system.     Table 5. Motherboard Controls Control Description Power Button Press to turn the system On or Off.
  • Page 18: Motherboard Tray Components

    1.1.10.  Motherboard Tray Components Here is an image that shows the motherboard tray components in DGX A100.     1.1.11.  GPU Tray Components Here is an image of the GPU tray components in a DGX A100 system.   NVIDIA DGX A100 DU-09821-001 _v01   |   10...
  • Page 19: Network Connections, Cables, And Adaptors

    1.2.  Network Connections, Cables, and Adaptors This section provides information about network connections, cables, and adaptors. 1.2.1.  Network Ports Here is an image that shows the network ports on a DGX A100 system.     NVIDIA DGX A100 DU-09821-001 _v01   |   11...
  • Page 20: Supported Network Cables And Adaptors

    1.2.2.  Supported Network Cables and Adaptors The DGX A100 system is not shipped with network cables or adaptors. You will need to purchase supported cables or adaptors for your network. The ConnectX-6 or ConnectX-7 firmware determines which cables and adaptors are supported.
  • Page 21: Dgx A100 System Topology

    Here is an image of the DGX A100 system topology.     1.4.  DGX OS Software The DGX A100 system comes pre-installed with a DGX software stack incorporating the following components: ‣ An Ubuntu server distribution with supporting packages. ‣...
  • Page 22: Additional Documentation

    Introduction to the NVIDIA DGX A100 System Provides active health monitoring and system alerts for NVIDIA DGX nodes in a data center. It also provides simple commands for checking the health of the DGX A100 system from the command line.
  • Page 23: Chapter 2. Connecting To The Dgx A100

    DGX OS Server software installs Docker Engine which uses the 172.17.xx.xx sub-net by default for Docker containers. If the DGX A100 system is on the same sub-net, you will not be able to establish a network connection to the DGX A100 system.
  • Page 24 Connecting to the DGX A100     NVIDIA DGX A100 DU-09821-001 _v01   |   16...
  • Page 25: Remote Connection Through The Bmc

      2.1.2.  Remote Connection through the BMC Here is some information about how you can remotely connect to DGX A100 through the BMC. Note: CBMC Security NVIDIA recommends that customers follow best security practices for BMC management (IPMI port). These include, but are not limited to, such measures as: ‣...
  • Page 26 ‣ <bmc-password> Password: 1. Make sure you have connected the BMC port on the DGX A100 system to your LAN. 2. Open a browser within your LAN and go to https://%3Cbmc-ip-address%3E/ 3. Make sure popups are allowed for the BMC address.
  • Page 27: Ssh Connection To The Os

    Connecting to the DGX A100 6. Click Launch KVM. The DGX A100 console appears in your browser.     2.2.  SSH Connection to the OS Here is some information about how you can connect to the OS by using SSH.
  • Page 28: Chapter 3. First Boot Setup

    Chapter 3. First Boot Setup This section provides information about the set up process after you first boot the DGX A100 system. While NVIDIA partner network personnel or NVIDIA field service engineers will install the DGX A100 system at the site and perform the first boot setup, the first boot setup instructions are provided here for reference and to support any reimaging of the server.
  • Page 29 3. If the DGX OS was installed with an encrypted root filesystem, you will be prompted to unlock the drive. 4. Enter “nvidia3d” at the crypt: prompt. 5. You are presented with end user license agreements (EULAs) for the NVIDIA software. 6. Accept the EULA to proceed with the installation. NVIDIA DGX A100...
  • Page 30 This step appears only if you installed the system with an encrypted root filesystem during DGX OS installation. i). Choose a primary network interface for the DGX A100 system; for example, enp226s0. This should typically be the interface that you will use for subsequent system configuration or in-band management.
  • Page 31: Post Setup Tasks

    If prompted, fill in requested networking information, such as name server or domain name. k). Choose a host name for the DGX A100 system. After completing the setup process, the DGX A100 system reboots automatically and then presents the login prompt. 3.2. ...
  • Page 32: Chapter 4. Quick Start And Basic Operation

    4.1.  Installation and Configuration Before you install DGX A100, ensure you have given all relevant site information to your Installation Partner. Important: Your DGX A100 System must be installed by NVIDIA partner network personnel or NVIDIA field service engineers. If not performed accordingly, your DGX A100 hardware warranty will be voided.
  • Page 33: Startup Considerations

    Quick Start and Basic Operation 4.4.1.  Startup Considerations To keep your DGX A100 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components can complete their initialization. 4.4.2.  Shutdown Considerations...
  • Page 34: Running The Pre-Flight Test

    4.7.  Running NGC Containers with GPU Support To obtain the best performance when running NGC containers on DGX A100 systems, the following methods of providing GPU support for Docker containers are available: ‣ Native GPU support (included in Docker 19.03 and later) ‣...
  • Page 35: Using Native Gpu Support

    ‣ Use docker run with nvidia as the default runtime. You can set nvidia as the default runtime, for example, by adding the following line to the configuration file as the first entry. etc/docker/daemon.json "default-runtime": "nvidia", Here is an example of how the added line appears in the JSON file.
  • Page 36: Managing Cpu Mitigations

    CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable Example KVM: Vulnerable Mitigation: PTE Inversion; VMX: vulnerable Vulnerable; SMT vulnerable Vulnerable Vulnerable Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerable, IBPB: disabled, STIBP: disabled Vulnerable NVIDIA DGX A100 DU-09821-001 _v01   |   28...
  • Page 37: Disabling Cpu Mitigations

    2. Reboot the system. 3. Verify CPU mitigations are enabled. cat /sys/devices/system/cpu/vulnerabilities/* The output should include several lines. See Determining the CPU Mitigation Mitigations State of the DGX System for example output. NVIDIA DGX A100 DU-09821-001 _v01   |   29...
  • Page 38: Chapter 5. Additional Features And Instructions

    Chapter 5. Additional Features and Instructions This chapter describes specific features of the DGX A100 server to consider during setup and operation. 5.1.  Managing the DGX Crash Dump Feature The DGX OS includes a script to manage this feature. 5.1.1.  Using the Script This section provides information about how to use the script to manage DGX crash dumps.
  • Page 39: Chapter 6. Managing The Dgx A100 Self-Encrypting Drives

    The NVIDIA DGX OS software supports the ability to manage self-encrypting drives (SEDs), ™ including setting an Authentication Key for locking and unlocking the drives on NVIDIA DGX A100 systems. You can manage only the SED data drives. The software cannot be used to manage OS drives even if they are SED-capable.
  • Page 40: Installing The Software

    Configuring Trusted Computing Here is some information about the controls that are required to configure Trusted Computing (TC). The DGX A100 system BIOS provides setup controls for configuring the following TC features: ‣ Trusted Platform Module The NVIDIA DGX A100 incorporates Trusted Platform Module 2.0 (TPM 2.0) which can be enabled from the system BIOS and used in conjunction with the nv-disk-encrypt tool.
  • Page 41: Enabling The Tpm And Preventing The Bios From Sending Block Sid Requests

    This section provides instructions to enable the TPM and prevent the SBIOS from sending Block SID request. Each task is independent, so you can select which task to complete. 1. Reboot the DGX A100, then press [Del] or [F2] at the NVIDIA splash screen to enter the BIOS Setup.
  • Page 42: Enabling Drive Locking

    /etc/nv-disk-encrypt/.dgxenc.salt each drive password. Salt values are characters added to a password for enhanced password security. NVIDIA strongly recommends using this option for best security, otherwise the software will use a default salt value instead of a randomly generated one.
  • Page 43: Determining Which Drives Can Be Managed As Self-Encrypting

    Managing the DGX A100 Self-Encrypting Drives 6.6.1.1.  Determining Which Drives Can be Managed as Self-Encrypting Here is some information about how you can determine which drives can be managed as self- encrypting. Review the storage layout of the DGX system to determine which drives are eligible to be managed as SEDs.
  • Page 44: Creating The Drive/Password Mapping Json Files And Using It To Initialize The System

    Managing the DGX A100 Self-Encrypting Drives   Alternatively, you can specify the output be presented in JSON format by using the option. $ sudo nv-disk-encrypt info -j In this case, drives that can be used for encryption are indicate by the following: "sed_capable": true "used_for_boot": false...
  • Page 45: Example 3: Specifying Passwords One At A Time When Prompted

    Here is some information about how you can erase your data. WARNING: Be aware when executing this that all data will be lost. On DGX A100 systems, these drives generally form a RAID 0 array, and this array will also be destroyed when you perform an erase.
  • Page 46: Clearing The Tpm

    TPM is to clear the TPM's contents. After clearing the TPM, you will need to re-initialize the vault and SED authentication keys. 1. Reboot the DGX A100, then press [Del] or [F2] at the NVIDIA splash screen to enter the BIOS Setup.
  • Page 47: Recovering From Lost Keys

    6.12.  Recovering From Lost Keys NVIDIA recommends backing up your keys and storing them in a secure location. If you’ve lost the key used to initialize and lock your drives, you will not be able to unlock the drive again. If this happens, the only way to recover is to perform a factory-reset, which will result in a loss of data.
  • Page 48: Chapter 7. Network Configuration

    Chapter 7. Network Configuration This chapter describes key network considerations and instructions for the DGX A100 System. 7.1.  Configuring Network Proxies If your network requires use of a proxy server, you will need to set up configuration files to ensure the DGX A100 System communicates through the proxy.
  • Page 49: For Docker

    IP addresses are used by your network. If your network does not conflict with the default Docker IP address range, no changes are needed, and you can skip this section. However, if your network uses the addresses within this range for the DGX A100 system, you should change the default Docker network addresses.
  • Page 50: Open Ports

    If port 443 is proxied through a corporate firewall, WebSocket protocol traffic must be supported. 7.4.  Connectivity Requirements for NGC Containers To run NVIDIA NGC containers from the NGC container registry, your network must be able to access the following URLs: ‣ http://archive.ubuntu.com/ubuntu/ ‣ http://security.ubuntu.com/ubuntu/ ‣...
  • Page 51: Configuring A Static Ip Address For The Bmc

    This section describes how to set a static IP address for the BMC from the Ubuntu command line. Note: If you cannot access the DGX A100 System remotely, then connect a display (1440x900 or lower resolution) and keyboard directly to the DGX A100 system.
  • Page 52: Configuring A Bmc Static Ip Address By Using The System Bios

    DGX A100 System remotely, and this process involves setting the BMC IP address during system boot. 1. Connect a keyboard and display (1440 x 900 maximum resolution) to the DGX A100 System and turn on the DGX A100 System.
  • Page 53: Switching Between Infiniband And Ethernet

    Switching Between InfiniBand and Ethernet The NVIDIA DGX A100 System is equipped with up to eight NVIDIA ConnectX-6 or ConnectX-7 single-port network cards on the I/O board, typically used for cluster communications. By default, these are configured as InfiniBand ports, but you have the option to convert these to Ethernet ports.
  • Page 54: Starting The Mellanox Software Tools And Determining The Current Port Configuration

    Here is an example to set slot 0 to Ethernet: $ sudo mlxconfig -y -d /dev/mst/mt4123_pciconf2 set LINK_TYPE_P1=2 Here is an example that sets slot 1 to InfiniBand: $ sudo mlxconfig -y -d /dev/mst/mt4123_pciconf3 set LINK_TYPE_P1=1 NVIDIA DGX A100 DU-09821-001 _v01   |   46...
  • Page 55: Chapter 8. Configuring Storage

    NFS storage for long-term data storage. The instructions in this section describe how to mount the NFS on the DGX A100 System and how to cache the NFS using the DGX A100 SSDs for improved performance.
  • Page 56: Setting Filesystem Quotas

    5, the total storage capacity of the RAID array is reduced. Before you change the RAID level of the DGX A100 RAID array, back up all data on the array that you want to preserve. Changing the RAID level of the DGX A100 RAID array erases all data stored on the array.
  • Page 57: Configuring Support For Custom Drive Partitioning

    Configuring Support for Custom Drive Partitioning DGX A100 systems incorporate data drives configured as RAID 0 by default. You can alter the default configuration by adding or removing drives, or by switching between a RAID 0 configuration and a RAID 5 configuration.
  • Page 58: Chapter 9. Updating And Restoring The Software

    Chapter 9. Updating and Restoring the Software This section provides information about how to update or restore software on your DGX A100 system. 9.1.  Updating the DGX A100 Software You must register your DGX A100 system to receive email notification whenever a new software update is available.
  • Page 59: Update Instructions

    Restoring the DGX A100 Software Image If the DGX A100 software image becomes corrupted or the OS SSD was replaced after a failure, restore the DGX A100 software image to its original factory condition from a pristine copy of the image.
  • Page 60: Remotely Reimaging The System

    Click OK at the Power Control dialogs, then wait for the system to power down and then come back online. c). As the system boots, press [F11] when the NVIDIA logo appears to get to the boot menu. d). Browse to locate the Virtual CD that corresponds to the inserted ISO, then boot the system from it.
  • Page 61: Creating A Bootable Installation Medium

    Updating and Restoring the Software f). Press Enter. The DGX A100 system will reboot from ISO image and proceed to install the image. This can take approximately 15 minutes. Note: The Mellanox InfiniBand driver installation can take up to 30 minutes, depending on how many cards undergo a firmware update.
  • Page 62: Creating A Bootable Usb Flash Drive By Using The Dd Command

    The USB flash drive has a capacity of at least 16 GB. ‣ This requirement applies only to DGX A100: The partition scheme on the USD flash drive is a CPT partition scheme for UEFI. 1. Plug the USB flash drive into one of the USB ports of your Linux system.
  • Page 63: Creating A Bootable Usb Flash Drive By Using Akeo Rufus

    In Cluster Size, select 4096 bytes (Default). 5. Click Start. Because the image is a hybrid ISO file, you are prompted to select whether to write the image in ISO Image (file copy) mode or DD Image (disk image) mode.   NVIDIA DGX A100 DU-09821-001 _v01   |   55...
  • Page 64: Reimaging The System From A Usb Flash Drive

    RAID disks. Since the RAID array on the DGX A100 system is intended to be used as a cache and not for long- term data storage, this should not be disruptive. However, if you are an advanced user and have set up the disks for a non-cache purpose and want to keep the data on those drives, then select the Install DGX Server without formatting RAID option at the boot menu during the boot installation.
  • Page 65: Advanced Installation Option (Encrypted Root - Dgx Os 5 Or Later)

    When booting into the live environment, log in as root (a password is not needed). In a normal operation, this option should not be selected. NVIDIA DGX A100 DU-09821-001 _v01   |   57...
  • Page 66: Check Disc For Defects (Dgx Os 5 Or Later)

    It is time consuming, and the installation media generally is not the real source of the problem. In a normal operation, this option should not be selected. NVIDIA DGX A100 DU-09821-001 _v01   |   58...
  • Page 67: Chapter 10. Using The Bmc

    10.1.  Connecting to the BMC Here are the steps to connect to the BMC on a DGX A100 system. Before you begin, ensure that you have connected the BMC port on the DGX A100 system to your LAN. https://<bmc-ip-address>/ 1.
  • Page 68: Overview Of Bmc Controls

    Using the BMC   10.2.  Overview of BMC Controls The left-side navigation menu bar on the BMC main page contains the primary controls.   NVIDIA DGX A100 DU-09821-001 _v01   |   60...
  • Page 69 Displays inventory information of system modules. FRU Information System, Processor, Memory Controller, BaseBoard, Power, Thermal, PCIE Device, PCIE Function, and Storage. GPU Information Provides basic information on all the GPUs in the systems, including GUID, VBIOS version, InfoROM NVIDIA DGX A100 DU-09821-001 _v01   |   61...
  • Page 70: Common Bmc Tasks

    SSL Settings, System Firewall, User Management, and Video Recording Remote Control Opens the KVM Launch page to remotely access the DGX A100 console. Power Control Perform the following power actions: Power On, Power Off, Power Cycle, Hard Reset, and ACP/Shutdown...
  • Page 71: Using The Remote Console

    10.3.2.  Using the Remote Console Here is some information about how to log in to the remote console. 1. Click Remote Control from the left-side navigation menu. 2. Click Launch KVM to start the remote KVM and access the DGX A100 console.    ...
  • Page 72: Setting Up Active Directory Or Ldap/E-Directory

    1. From the side navigation menu, click Settings > External User Services .     2. Click Active Directory Settings or LDAP/E-Directory Settings and follow the instructions.     10.3.4.  Configuring Platform Event Filters From the side navigation menu, click Settings and click Platform Event Filters.   NVIDIA DGX A100 DU-09821-001 _v01   |   64...
  • Page 73: Uploading Or Generating Ssl Certificates

    You can set up a new certificate by generating a (self-signed) SSL or by uploading an SSL (for example, to use a Trusted CA-signed certificate). From the side navigation menu, click Settings > External User Services .     Refer to the following sections for more information: NVIDIA DGX A100 DU-09821-001 _v01   |   65...
  • Page 74: Viewing The Ssl Certificate

    1. From the SSL Setting page, select Generate SSL Certificate.     2. Enter the information as described in the following table. Table 9. BMC Main Controls Items Description/Requirements Common Name (CN) The common name for which the certificate is to be generated. NVIDIA DGX A100 DU-09821-001 _v01   |   66...
  • Page 75 Special characters are not allowed. Email Address Email address of the organization (mandatory) Valid for Validity of the certificate. Key Length Enter a range from 1 to 3650 (days) 3. To generate the new certificate, click Save . NVIDIA DGX A100 DU-09821-001 _v01   |   67...
  • Page 76: Uploading The Ssl Certificate

    2. Copy the CA certificate onto a USB thumb drive or to /boot/efi on the A100 OS. 3. Access the DGX A100 console from a locally connected keyboard and mouse or through the BMC remote console. 4. Reboot the server 5.
  • Page 77 Using the BMC   7. Select Server CA Configuration.     8. Select Enroll Cert.   NVIDIA DGX A100 DU-09821-001 _v01   |   69...
  • Page 78 Using the BMC   9. Select Enroll Cert Using File. 10.Select the device where you stored the certificate.     11.Navigate the file structure and select the certificate.   NVIDIA DGX A100 DU-09821-001 _v01   |   70...
  • Page 79 Using the BMC   NVIDIA DGX A100 DU-09821-001 _v01   |   71...
  • Page 80: Chapter 11. Sbios Settings

    Chapter 11. SBIOS Settings The NVIDIA DGX A100 system comes with a system BIOS with optimized settings for the DGX system. There might be situations where the settings need to be changed, such as changes in the boot order, changes to enable PXE booting, or changes in the BMC network settings.
  • Page 81: Configuring The Boot Order

    The following instructions describe how to set the boot order at boot time. You can also set the boot order from the SBIOS setup > Boot screen. 1. Access the DGX A100 console, either from a locally connected keyboard and mouse or through the BMC remote console.
  • Page 82: Configuring The Local Terminal To Access The Sbios Settings Screen

    Keyboard and Monitor, and the other is through Serial-over-Lan (SOL) protocol using the IPMI tools. Below are the instructions on how to configure a terminal with the correct settings to access the SBIOS configuration screens using SOL. NVIDIA DGX A100 DU-09821-001 _v01   |   74...
  • Page 83: If Using The Ipmi Sol Protocol

    3. Type to launch the terminal with the set locale. xterm 4. From within the new xterm, use ipmitool to connect to the DGX A100 SOL console: ipmitool -I lanplus -H {IP Address} -U admin -P dgxluna.admin sol activate For Windows or Macintosh users...
  • Page 84: Chapter 12. Multi-Instance Gpu

    Chapter 12. Multi-Instance GPU Multi-Instance GPU (MIG) is a new capability of the NVIDIA A100 GPU. MIG uses spatial partitioning to carve the physical resources of an A100 GPU into up to seven independent GPU instances. These instances run simultaneously, each with its own memory, cache, and compute streaming multiprocessors.
  • Page 85: Chapter 13. Security

    This section provides information about security measures in the DGX A100 system. 13.1.  User Security Measures The NVIDIA DGX A100 system is a specialized server designed to be deployed in a data center. It must be configured to protect the hardware from unauthorized access and unapproved use.
  • Page 86: Signing

    Security. 13.3.  Secure Data Deletion This section explains how to securely delete data from the DGX A100 system SSDs to permanently destroy all the data that was stored there. This process performs a more secure SSD data deletion than merely deleting files or reformatting the SSDs.
  • Page 87 $ dpkg -i /usr/lib/live/mount/rootfs/filesystem.squashfs/curtin/repo/nvme- cli_1.9-1ubuntu0.1_amd64.deb 6. Run nvme format -s1 on all storage devices listed. $ nvme format -s1 <device-path> where <device-path> is the specific storage node as listed in the previous step. For example, dev/nvme0n1 NVIDIA DGX A100 DU-09821-001 _v01   |   79...
  • Page 88: Chapter 14. Redfish Apis Support

    Redfish is a web-based management protocol, and the Redfish server is integrated into the DGX A100 BMC firmware. By default, Redfish support is enabled in the DGX A100 BMC and the BIOS. By using the Redfish interface, administrator-privileged users can browse physical resources at the chassis and system level through a web-based user interface.
  • Page 89 Redfish Schema 2019.1 Now Available For a list of the known issues and limitations with Redfish support that are specific to the firmware version you are running, refer to the DGX A100 System Firmware Update Container Release Notes. NVIDIA DGX A100...
  • Page 90: Appendix A. Installing Software On Air-Gapped Dgx A100 Systems

    CAUTION: This process destroys all data and software customizations that you have made on the DGX A100 System. Be sure to back up any data that you want to preserve and push any Docker images that you want to keep to a trusted registry.
  • Page 91: A.3. Creating A Local Mirror Of The Nvidia And Canonical Repositories

    The procedure below describes how to download all the necessary packages to create a mirror of the repositories that are needed to update NVIDIA DGX systems. The steps are specific to versions 4.0.X and 4.1.X, but they can be edited to work with other versions.
  • Page 92 Installing Software on Air-Gapped DGX A100 Systems 4. Configure the path of the destination directory in and use the /etc/apt/mirror.list included list of repositories below to retrieve the packages for both Ubuntu base OS and the NVIDIA DGX OS packages.
  • Page 93: A.3.2. Configuring The Target Air-Gapped Dgx Os 4 System

    | sudo tee /etc/ apt/sources.list.d/dgx-bionic-r418-cuda10-1-repo.list 6. Optional: (For DGX OS Release 4.5 and later only) If you want to use the R450 NVIDIA graphics driver and CUDA Toolkit 11.0, configure to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgx-bionic-r450-cuda11-0-repo.list...
  • Page 94 "deb file:///media/usb/repository/mirror/international.download.nvidia.com/dgx/ repos/bionic/ bionic-r450+cuda11.0 main multiverse restricted universe" | sudo tee /etc/ apt/sources.list.d/dgx-bionic-r450-cuda11-0-repo.list Note: If you want to continue using earlier releases, for example the R418 NVIDIA graphic driver and CUDA Toolkit 10.1, omit this step. 7. Edit the file to update the Pin parameter as follows.
  • Page 95: A.3.3. Configuring The Target Air-Gapped Dgx Os 5 System

    249 packages can be upgraded. Run 'apt list --upgradable' to see them. 9. Upgrade the system using the newly configured local repositories. sudo apt full-upgrade If you configured to use the NVIDIA DGX OS packages in the file /etc/apt/ , the NVIDIA graphics driver is sources.list.d/dgx-bionic-r450-cuda11-0-repo.list upgraded to the R450 driver and the package sources are updated to obtain future updates from the R450 driver repositories.
  • Page 96: A.4. Installing Docker Containers

    A.4.  Installing Docker Containers This method applies to Docker containers hosted on the NVIDIA NGC Container Registry, and requires that you have an active NGC account. 1. On a system with internet access, log in to the NGC Container Registry by entering the following command and credentials.
  • Page 97 Installing Software on Air-Gapped DGX A100 Systems 6. Transfer the image to the air-gapped system using removable media such as a USB flash drive. 7. Load the NVIDIA Docker image. $ docker load –i framework.tar 8. Verify the image is on your system.
  • Page 98: Appendix B. Safety

    Appendix B. Safety This section provides information about how to safely use the DGX A100 system. B.1.  Safety Information To reduce the risk of bodily injury, electrical shock, fire, and equipment damage, read this document and observe all warnings and precautions in this guide before installing or maintaining your server product.
  • Page 99     The rail racks are designed to carry only the weight of the server system. Do not use rail- mounted equipment as a workspace. Do not place additional load onto any rail-mounted equipment.   NVIDIA DGX A100 DU-09821-001 _v01   |   91...
  • Page 100: B.3. Intended Application Uses

    B.4.  Site Selection Here is some information about how to select the correct site for the DGX A100 system. Choose a site that is: ‣ Clean, dry, and free of airborne particles (other than normal room dust).
  • Page 101: B.6. Electrical Precautions

    The power cord must have safety ground pin or contact that is suitable for the electrical outlet. ‣ The power supply cord(s) is/ are the main disconnect device to AC power. The socket outlet(s) must be near the equipment and readily accessible for disconnection. NVIDIA DGX A100 DU-09821-001 _v01   |   93...
  • Page 102: B.7. System Access Warnings

    B.7.  System Access Warnings Here is some information about system access warnings for the DGX A100 system. To avoid personal injury or property damage, the following safety instructions apply whenever accessing the inside of the product: ‣...
  • Page 103: B.9. Electrostatic Discharge

    (for example, the use of power strips). B.9.  Electrostatic Discharge Here is some information about how to handle electric discharges (ESD) in the DGX A100 system. CAUTION: ESD can damage drives, boards, and other parts. We recommend that you perform all procedures at an ESD workstation.
  • Page 104: B.10.  Other Hazards

        NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal foam is not intended for direct and prolonged skin contact. Please use the handles to remove, attach or carry the bezel. While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you’re susceptible to nickel-related reactions.
  • Page 105 ‣ Access is through the use of a TOOL or lock and key, or other means of security, and is controlled by the authority responsible for the location. NVIDIA DGX A100 DU-09821-001 _v01   |   97...
  • Page 106: Appendix C. Compliance

    Appendix C. Compliance The NVIDIA DGX A100 Server is compliant with the regulations listed in this section. C.1.  United States Federal Communications Commission (FCC) FCC Marking (Class A) This device complies with part 15 of the FCC Rules. Operation is subject to the following two...
  • Page 107: C.3.  Canada

    EMC Directive A, I.T.E Equipment. ‣ Low Voltage Directive for electrical safety. ‣ RoHS Directive for hazardous substances. ‣ Energy-related Products Directive (ErP). The full text of EU declaration of conformity is available at the following internet address: www.nvidia.com/support. NVIDIA DGX A100 DU-09821-001 _v01   |   99...
  • Page 108: C.5. Australia And New Zealand

    Compliance A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA GmbH (Bavaria Towers – Blue Tower, Einsteinstrasse 172, D-81677 Munich, Germany). C.5.  Australia and New Zealand Australian Communications and Media Authority  ...
  • Page 109 This is a Class A product. In a domestic environment this product may cause radio interference, in which case the user may be required to take corrective actions. VCCI-A.     Japan RoHS Material Content Declaration   NVIDIA DGX A100 DU-09821-001 _v01   |   101...
  • Page 110 Compliance           NVIDIA DGX A100 DU-09821-001 _v01   |   102...
  • Page 111: C.8.  South Korea

    Industrial (Class A) electromagnetic wave suitability equipment and seller or user should take notice of it, and this equipment is to be used in the places except for home. Korea RoHS Material Content Declaration     NVIDIA DGX A100 DU-09821-001 _v01   |   103...
  • Page 112: C.9.  China

    Compliance     C.9.  China China Compulsory Certificate No certification is needed for China. The NVIDIA DGX A100 is a server with power consumption greater than 1.3 kW. China RoHS Material Content Declaration   NVIDIA DGX A100 DU-09821-001 _v01   |   104...
  • Page 113 Compliance     NVIDIA DGX A100 DU-09821-001 _v01   |   105...
  • Page 114: C.10.  Taiwan

    Compliance   C.10.  Taiwan Bureau of Standards, Metrology & Inspection (BSMI)     Taiwan RoHS Material Content Declaration   NVIDIA DGX A100 DU-09821-001 _v01   |   106...
  • Page 115: C.11. Russia/Kazakhstan/Belarus

    применения опасных веществ в изделиях электротехники и радиоэлектроники" (ТР ЕАЭС 037/2016) Federal Agency of communication (FAC) This device complies with the rules set forth by Federal Agency of Communications and the Ministry of Communications and Mass Media. Federal Security Service notification has been filed. NVIDIA DGX A100 DU-09821-001 _v01   |   107...
  • Page 116: C.12.  Israel

    2016". It does not contain lead, mercury, hexavalent chromium, polybrominated biphenyls or polybrominated diphenyl ethers in concentrations exceeding 0.1 weight % and 0.01 weight % for cadmium, except for where allowed pursuant to the exemptions set in Schedule 2 of the Rule. NVIDIA DGX A100 DU-09821-001 _v01   |   108...
  • Page 117: C.14.  South Africa

    SI 2012/3032: The Restriction of the Use of Certain Hazardous Substances in Electrical and Electronic Equipment (As Amended) A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA Ltd. (100 Brook Drive, 3rd Floor Green Park, Reading RG2 6UJ, United Kingdom) NVIDIA DGX A100 DU-09821-001 _v01   |   109...
  • Page 118 Copyright © 2022 NVIDIA Corporation & Affiliates. All rights reserved. NVIDIA Corporation  |  2788 San Tomas Expressway, Santa Clara, CA 95051 www.nvidia.com...

Table of Contents

Save PDF