Chapter 1. Introduction to the NVIDIA DGX H100 System The NVIDIA DGX H100 System is the universal system purpose-built for all AI infrastructure and work- loads, from analytics to training to inference. The system is built on eight NVIDIA H100 Tensor Core...
NVIDIA DGX H100 User Guide 1.1. Hardware Overview 1.1.1. DGX H100 Models and Component Descriptions There are two models of the NVIDIA DGX H100 system: the NVIDIA DGX H100 640GB system and the NVIDIA DGX H100 320GB system. Table 1: Table 1. Component Description...
BMC will be available. 1.1.4. DGX H100 Locking Power Cord Specification The DGX H100 is shipped with a set of six (6) locking power cords which have been qualified for use with the DGX H100 to ensure regulatory compliance. Warning: To avoid electric shock or fire, only use the NVIDIA-provided power cords to connect power to the DGX H100.
Locking/Unlocking the PSU Side (Cords with Twist-Lock Mechanism) Power Supply (System) side - Twist locking ▶ To INSERT or REMOVE make sure the cable is UNLOCKED and push/ pull into/out of the socket. Chapter 1. Introduction to the NVIDIA DGX H100 System...
Heat Output 38,557 BTU/hr 1.1.7. Front Panel Connections and Controls This section provides information about the front panel, connections, and controls of the DGX H100 system. 1.1.7.1 With a Bezel Here is an image of the DGX H100 system with a bezel.
Page 12
NVIDIA DGX H100 User Guide Control Description Power Button Press to turn the DGX H100 system On or Off. ▶ Green flashing (1 Hz): Standby (BMC booted) ▶ Green flashing (4 Hz): POST in progress ▶ Green solid On: Power On...
Important: Refer to the section First Boot Setup for instructions on how to properly turn the system on or off. 1.1.8. Rear Panel Modules Here is an image that shows the real panel modules on DGX H100. 1.1. Hardware Overview...
NVIDIA DGX H100 User Guide 1.1.9. Motherboard Connections and Controls Here is an image that shows the motherboard connections and controls in a DGX H100 system. Chapter 1. Introduction to the NVIDIA DGX H100 System...
Reset Press to manually reset the BMC. button Network Connections, Cables, and Adaptors for details on the network connections. 1.1.10. Motherboard Tray Components Here is an image that shows the motherboard tray components in DGX H100. 1.1. Hardware Overview...
NVIDIA DGX H100 User Guide 1.1.11. GPU Tray Components Here is an image of the GPU tray components in a DGX H100 system. Chapter 1. Introduction to the NVIDIA DGX H100 System...
NVIDIA DGX H100 User Guide 1.2. Network Connections, Cables, and Adaptors This section provides information about network connections, cables, and adaptors. 1.2.1. Network Ports Here is an image that shows the network ports on a DGX H100 system. 1.2. Network Connections, Cables, and Adaptors...
NVIDIA DGX H100 User Guide 1.2.3. Network Modules ▶ New form factor for aggregate PCIe network devices ▶ Consolidates four ConnectX-7 networking cards into a single device ▶ Two networking modules are installed on interposer board ▶ Interposer board connects to CPUs on one end and to GPU tray on the other ▶...
NVIDIA DGX H100 User Guide 1.2.4. Supported Network Cables and Adaptors The DGX H100 system is not shipped with network cables or adaptors. You will need to purchase supported cables or adaptors for your network. The ConnectX-7 firmware determines which cables and adaptors are supported. For a list of cables...
▶ NVIDIA System Management (NVSM) Provides active health monitoring and system alerts for NVIDIA DGX nodes in a data center. It also provides simple commands for checking the health of the DGX H100 system from the command line. ▶ Data Center GPU Management (DCGM) This software enables node-wide administration of GPUs and can be used for cluster and data-center level management.
(daemon for managing cache data storage) 1.5. Customer Support Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing prob- lems with your DGX H100 system. Also contact NVIDIA Enterprise Support for assistance in moving the DGX H100 system. ▶...
DGX OS Server software installs Docker Engine which uses the 172.17.xx.xx subnet by default for Docker containers. If the DGX H100 system is on the same subnet, you will not be able to establish a network connection to the DGX H100 system.
Page 24
NVIDIA DGX H100 User Guide Chapter 2. Connecting to the DGX H100...
NVIDIA DGX H100 User Guide 2.1.2. Remote Connection through the BMC Here is some information about how you can remotely connect to DGX H100 through the BMC. NVIDIA recommends that customers follow best security practices for BMC management (IPMI port).
Page 26
From the navigation menu, click Remote Control. The Remote Control page enables you to open a virtual Keyboard/Video/Mouse (KVM) on the DGX H100 system, as if you were using a physical monitor and keyboard connected to the front of the system.
NVIDIA DGX H100 User Guide 2.2. SSH Connection to the OS After the system has been configured, you can also establish an SSH connection to the DGX H100 OS through the network port. Refer to Network Ports to identify the port to use.
Page 28
NVIDIA DGX H100 User Guide Chapter 2. Connecting to the DGX H100...
Chapter 3. First Boot Setup This section provides information about the set up process after you first boot the DGX H100 system. While NVIDIA partner network personnel or NVIDIA field service engineers will install the DGX H100 system at the site and perform the first boot setup, the first boot setup instructions are provided here for reference and to support any reimaging of the server.
Page 30
NVIDIA DGX H100 User Guide ▶ Using the Remote BMC Refer to First Boot Process for DGX Servers in the NVIDIA DGX OS 6 User Guide for information about the following topics: ▶ Optionally encrypt the root file system. Chapter 3. First Boot Setup...
3.2.2. Enabling the SRP Daemon The NVIDIA networking drivers provide the SRP daemon software. The daemon is disabled by default. Enabling the daemon is required if you want to use RDMA over Infiniband. You can enable the daemon by running the following commands:...
Page 32
NVIDIA DGX H100 User Guide Chapter 3. First Boot Setup...
4.1. Installation and Configuration Before you install DGX H100, ensure you have given all relevant site information to your Installation Partner. Important: Your DGX H100 System must be installed by NVIDIA partner network personnel or NVIDIA field service engineers.
Observe the following startup and shutdown instructions. 4.4.1. Startup Considerations To keep your DGX H100 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components can complete their initialization.
--gpus all --rm nvcr.io∕nvidia∕cuda:12.1.1-ubuntu22.04 nvidia-smi The preceding command pulls the nvidia∕cuda container image layer by layer, then runs the nvidia-smi command. When complete, the output shows the NVIDIA Driver version and a description of each installed GPU. For more information, refer to Containers For Deep Learning Frameworks User Guide.
20 minutes. sudo nvsm stress-test --force 4.7. Running NGC Containers with GPU Support To obtain the best performance when running NGC containers on DGX H100 systems, the following methods of providing GPU support for Docker containers are available: ▶...
GPU-accelerated containers using this command and the new runtime will be used. ▶ Use docker run with nvidia as the default runtime. You can set nvidia as the default runtime, for example, by adding the following line to the ∕ etc∕docker∕daemon.json configuration file as the first entry. "default-runtime": "nvidia", Here is an example of how the added line appears in the JSON file.
NVIDIA DGX H100 User Guide 4.8. Managing CPU Mitigations DGX OS Server includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads. If your installation of DGX systems incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and thereby increase performance.
NVIDIA DGX H100 User Guide 4.8.2. Disabling CPU Mitigations Caution: Performing the following instructions will disable the CPU mitigations provided by the DGX OS Server software. Install the nv-mitigations-off package. sudo apt install nv-mitigations-off -y Reboot the system. Verify CPU mitigations are disabled.
Page 40
NVIDIA DGX H100 User Guide Chapter 4. Quickstart and Basic Operation...
Chapter 5. SBIOS Settings The NVIDIA DGX H100 system comes with a system BIOS with optimized settings for the DGX system. There might be situations where the settings need to be changed, such as changes in the boot order, changes to enable PXE booting, or changes in the BMC network settings.
The following instructions describe how to set the boot order at boot time. You can also set the boot order from the SBIOS setup > Boot screen. Access the DGX H100 console, either from a locally connected keyboard and mouse or through the BMC remote console.
Page 43
NVIDIA DGX H100 User Guide Select the boot device. The following figure shows virtual media selected. 5.2. Configuring the Boot Order...
Connect to the BMC web interface and click power on/reboot. From an operating system command line, run sudo reboot. ▶ Connect to the DGX H100 SOL console: ipmitool -I lanplus -H <ip-address> -U admin -P dgxluna.admin sol activate Press the Del or F2 key when the system is booting.
Chapter 6. Using the Baseboard Management Controller (BMC) The NVIDIA DGX H100 system comes with a baseboard management controller (BMC) for monitor- ing and controlling various hardware devices on the system. It monitors system sensors and other parameters. 6.1. Connecting to the BMC Here are the steps to connect to the BMC on a DGX H100 system.
NVIDIA DGX H100 User Guide 6.2. Overview of BMC Controls The left-side navigation menu bar on the BMC main page contains the primary controls. Chapter 6. Using the Baseboard Management Controller (BMC)
Page 47
NVIDIA DGX H100 User Guide 6.2. Overview of BMC Controls...
Page 48
NVIDIA DGX H100 User Guide Table 1: Table 8. BMC Main Controls Control Description Quick Links Provides quick access to several tasks. Dashboard Displays the overall information about the status of the device. Sensor Provides status and readings for system sensors, such as SSD, PSUs, voltages, CPU temperatures, DIMM temperatures, and fan speeds.
NVIDIA DGX H100 User Guide 6.3. Changing the BMC Login Credentials To change your credentials or add or remove users, perform the following steps: Select Settings from the left-side navigation menu. Select the User Management card. Click the help icon (?) for information about configuring users and creating a password.
NVIDIA DGX H100 User Guide Click Active Directory Settings or LDAP/E-Directory Settings and follow the instructions. 6.6. Configuring Platform Event Filters From the side navigation menu, click Settings and then click Platform Event Filters. The Event Filters page shows all configured event filters and available slots. You can modify or add new event filter entry on this page.
NVIDIA DGX H100 User Guide ▶ To view available configured and unconfigured slots, click All in the upper-left corner of the page. ▶ To view available configured slots, click Configured in the upper-left corner of the page. ▶ To view available unconfigured slots, click UnConfigured in the upper-left corner of the page.
NVIDIA DGX H100 User Guide ▶ Issuer information ▶ Valid Date range ▶ Issued to information 6.7.2. Generating the SSL Certificate Here is some information about generating an SSL certificate. From the SSL Setting page, click Generate SSL Certificate. Enter the information as described in the following table.
NVIDIA DGX H100 User Guide 6.7.3. Uploading the SSL Certificate In BMC, you can upload your SSL certificate. Make sure the certificate and key meet the following requirements: SSL certificates and keys must both use the .pem file extension. ▶...
Page 54
NVIDIA DGX H100 User Guide Select Server CA Configuration. Select Enroll Cert. Chapter 6. Using the Baseboard Management Controller (BMC)
Page 55
NVIDIA DGX H100 User Guide Select Enroll Cert Using File. Select the device where you stored the certificate. Navigate the file structure and select the certificate. 6.7. Uploading or Generating SSL Certificates...
Page 56
NVIDIA DGX H100 User Guide Chapter 6. Using the Baseboard Management Controller (BMC)
This section provides information about security measures in the DGX H100 system. 7.1. User Security Measures The NVIDIA DGX H100 system is a specialized server designed to be deployed in a data center. It must be configured to protect the hardware from unauthorized access and unapproved use. The DGX H100 system is designed with a dedicated BMC Management Port and multiple Ethernet network ports.
7.3. Secure Data Deletion This section explains how to securely delete data from the DGX H100 system SSDs to permanently destroy all the data that was stored there. This process performs a more secure SSD data deletion than merely deleting files or reformatting the SSDs.
NVIDIA DGX H100 User Guide 7.3.2. Procedure Here are the instructions to securely delete data from the DGX H100 system SSDs. Boot the system from the ISO image, either remotely or from a bootable USB key. At the GRUB menu, select: ▶...
Page 60
NVIDIA DGX H100 User Guide Chapter 7. Security...
The DGX System firmware supports Redfish APIs. Redfish is DMTF’s standard set of APIs for managing and monitoring a platform. By default, Redfish support is enabled in the DGX H100 BMC and the BIOS. By using the Redfish interface, administrator-privileged users can browse physical resources at the chassis and system level through the REST API interface.
"0.2.0.7" ∕∕ ... ▶ Update DGX H100 system components To update the HGX component in your DGX H100 system, you need to specify HGX_0 as the target regardless of the HGX component that you want to update. 8.2. Redfish Examples...
Page 64
▶ Update DGX HGX H100 components To update DGX H100 system components, you need to specify the component name as a target in a JSON file. The following example updates the host BMC: echo "{\"Targets\":[\"∕redfish∕v1∕UpdateService∕FirmwareInventory∕HostBMC_ 0\"]}"...
NVIDIA DGX H100 User Guide On success, the command returns a 204 HTTP status code. If you attempt to set the flag to the currently set value, the command returns a 400 HTTP status code. To get the value of the ForceUpdate parameter: curl -k -u <bmc-user>:<password>...
NVIDIA DGX H100 User Guide curl -k -u <bmc-user>:<password> --request POST --location 'https:∕∕<bmc- ip-address>∕redfish∕v1∕Systems∕DGX∕Actions∕ComputerSystem.Reset' → header 'Content-Type: application∕json' --data '{"ResetType": → "GracefulShutdown"}' → 8.2.6. SEL Logs To view all the SEL entries using redfish: curl -k -u <bmc-user>:<password> --location --request GET 'https:∕∕<bmc-ip-address>∕...
Page 68
NVIDIA DGX H100 User Guide (continued from previous page) "TaskState": "New" Monitor the task returned until it completes. Change task number as appropriate: curl -k -u <bmc-user>:<password> --request GET 'https:∕∕<bmc-ip-address>∕redfish∕ v1∕TaskService∕Tasks∕1' → After the task stats reports Complete, download the attachments: curl -k -u <bmc-user>:<password>...
Chapter 9. Safety This section provides information about how to safely use the DGX H100 system. 9.1. Safety Information To reduce the risk of bodily injury, electrical shock, fire, and equipment damage, read this document and observe all warnings and precautions in this guide before installing or maintaining your server product.
NVIDIA DGX H100 User Guide Indicates hot components or surfaces Indicates do not touch fan blades, may result in injury. Shock hazard: The product might be equipped with multiple power cords. - To remove all hazardous voltages, disconnect all power cords. - High leakage current ground (earth) connection to the Power Supply is essential before connecting the supply.
NVIDIA DGX H100 User Guide ▶ In regions that are susceptible to electrical storms, we recommend you plug your system into a surge suppressor and disconnect telecommunication lines to your modem during an electrical storm. ▶ Provided with a properly grounded wall outlet.
NVIDIA DGX H100 User Guide 9.6.2. Power Cord Warnings Caution: To avoid electrical shock or fire, check the power cord(s) that will be used with the product as follows: ▶ Do not attempt to modify or use the AC power cord(s) if they are not the exact type required to fit into the grounded electrical outlets.
NVIDIA DGX H100 User Guide Caution: To avoid injury do not contact moving fan blades. Your system is supplied with a guard over the fan, do not operate the system without the fan guard in place. 9.8. Rack Mount Warnings The following installation guidelines are required by UL to maintain safety compliance when installing your system into a rack.
9.10.2. NICKEL NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal foam is not intended for direct and prolonged skin contact. Please use the handles to remove, attach or carry the bezel. While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you are susceptible to nickel-related reactions.
NVIDIA DGX H100 User Guide Do not attempt to disassemble, puncture, or otherwise damage a battery. 9.10.4. Cooling and Airflow Caution: Carefully route cables as directed to minimize airflow blockage and cooling problems. For proper cooling and airflow, operate the system only with the chassis covers installed.
Page 76
NVIDIA DGX H100 User Guide Chapter 9. Safety...
Chapter 10. Compliance The NVIDIA DGX H100 Server is compliant with the regulations listed in this section. 10.1. United States Federal Communications Commission (FCC) FCC Marking (Class A) This device complies with part 15 of the FCC Rules. Operation is subject to the following two condi- tions: (1) this device may not cause harmful interference, and (2) this device must accept any inter- ference received, including any interference that may cause undesired operation of the device.
The full text of EU declaration of conformity is available at the following URL: http://www.nvidia.com/ support A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA GmbH (Bavaria Towers – Blue Tower, Einsteinstrasse 172, D-81677 Munich, Germany). Chapter 10. Compliance...
NVIDIA DGX H100 User Guide 10.5. Australia and New Zealand Australian Communications and Media Authority This product meets the applicable EMC requirements for Class A, I.T.E equipment. 10.6. Brazil INMETRO 10.7. Japan Voluntary Control Council for Interference (VCCI) 10.5. Australia and New Zealand...
Page 80
NVIDIA DGX H100 User Guide This is a Class A product. In a domestic environment this product may cause radio interference, in which case the user may be required to take corrective actions. VCCI-A. Japan RoHS Material Content Declaration Chapter 10. Compliance...
NVIDIA DGX H100 User Guide 10.8. South Korea Korean Agency for Technology and Standards (KATS) Class A Equipment (Industrial Broadcasting & Communication Equipment). This equipment Industrial (Class A) electromagnetic wave suitability equipment and seller or user should take notice of it, and this equipment is to be used in the places except for home.
NVIDIA DGX H100 User Guide Korea RoHS Material Content Declaration 10.9. China China Compulsory Certificate No certification is needed for China. The NVIDIA DGX A100 is a server with power consumption greater than 1.3 kW. Chapter 10. Compliance...
Page 83
NVIDIA DGX H100 User Guide China RoHS Material Content Declaration 10.9. China...
NVIDIA DGX H100 User Guide Taiwan RoHS Material Content Declaration 10.11. Russia/Kazakhstan/Belarus Customs Union Technical Regulations (CU TR) This device complies with the technical regulations of the Customs Union (CU TR) ТЕХНИЧЕСКИЙ РЕГЛАМЕНТ ТАМОЖЕННОГО СОЮЗА О безопасности низковольтного оборудования (ТР ТС 004/2011) ТЕХНИЧЕСКИЙ...
NVIDIA DGX H100 User Guide 10.12. Israel 10.13. India Bureau of India Standards (BIS) Authenticity may be verified by visiting the Bureau of Indian Standards website at http://www.bis.gov. Chapter 10. Compliance...
SI 2012/3032: The Restriction of the Use of Certain Hazardous Substances in Electrical and Elec- tronic Equipment (As Amended) A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA Ltd. (100 Brook Drive, 3rd Floor Green Park, Reading RG2 6UJ, United Kingdom) 10.14. South Africa...
Page 88
NVIDIA DGX H100 User Guide Chapter 10. Compliance...
Chapter 11. Third-Party License Notices This NVIDIA product contains third party software that is being made available to you under their re- spective open source software licenses. Some of those licenses also require specific legal information to be included in the product. This section provides such information.
NVIDIA DGX H100 User Guide INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF MTI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Because some jurisdictions prohibit the exclusion or limitation of liability for consequential or incidental damages, the above limitation may not apply to you.
NVIDIA accepts no liability related to any default, damage, costs, or prob- lem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WAR- RANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CON-...