Download Print this page

Advertisement

Quick Links

NVIDIA DGX H100/H200 Service
Manual
NVIDIA Corporation
Sep 05, 2024

Advertisement

loading
Need help?

Need help?

Do you have a question about the DGX H200 and is the answer not in the manual?

Questions and answers

Subscribe to Our Youtube Channel

Summary of Contents for Nvidia DGX H200

  • Page 1 NVIDIA DGX H100/H200 Service Manual NVIDIA Corporation Sep 05, 2024...
  • Page 3: Table Of Contents

    Contents 1 Introduction Customer-replaceable Components ....... . Recommended Tools ......... . Customer Support .
  • Page 4 Next Steps ..........44 7 U.2 NVMe Cache Drive Post-Installation Tasks Recreating the Cache RAID 0 Volume .
  • Page 5 15.3 Prepare the System for Replacement ....... . 92 15.4 Remove the PCI Ethernet Card .
  • Page 6 20.7 Japan ........... 133 20.8 South Korea .
  • Page 7 NVIDIA DGX H100/H200 Service Manual The NVIDIA DGX H100/H200 Service Manual is also available as a PDF. Contents...
  • Page 8 NVIDIA DGX H100/H200 Service Manual Contents...
  • Page 9: Introduction

    NVIDIA DGX Systems Support page. Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use only the replacement supplied to you by NVIDIA.
  • Page 10: Recommended Tools

    1.3. Customer Support Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing prob- lems with your DGX H100/H200 system. Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX H100/H200 system. For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https://www.nvidia.
  • Page 11: Running The Pre-Flight Test

    1.4. Running the Pre-flight Test Instructions for running the DGX stress test. NVIDIA recommends running the pre-flight stress test before putting a system into a production envi- ronment or after servicing. You can specify running the test on the GPUs, CPU, memory, and storage, and also specify the duration of the tests.
  • Page 12 NVIDIA DGX H100/H200 Service Manual Chapter 1. Introduction...
  • Page 13: Front Fan Module Replacement

    Insert new fan module Confirm new fan module is working correctly through BMC or the operating system tools Return/ship the failed unit to NVIDIA Enterprise Support using the packaging provided 2.2. Identifying a Failed Fan Module You can identify a failed fan module using any of the following methods: ▶...
  • Page 14 NVIDIA DGX H100/H200 Service Manual Viewing the Fan Module LEDs 1. Removing and Attaching the Bezel to expose the fan modules. After you remove the bezel, the system looks like the following figure. Identify the failed fan using the fan module fault LED as shown in the following figure.
  • Page 15 NVIDIA DGX H100/H200 Service Manual following figure. Running the Show Fans command ▶ From the operating system, run: sudo nvsm show fans View the command output for any alerts, failures, or an unhealthy status. Viewing Fan Modules from the BMC web user interface Identify the faulty fan module using the BMC dashboard.
  • Page 16 NVIDIA DGX H100/H200 Service Manual There are two fans in the fan module, identified by SPD_FAN_SYSn_F and SPD_FAN_SYSn_R, where n is the module ID. If either fan fails, then the entire module must be replaced. Use the nvsm command to confirm the fan issue.
  • Page 17: Replacing And Returning The Front Fan Module

    NVIDIA DGX H100/H200 Service Manual sudo nvsm show fans View the output and confirm that the status is unhealthy for the same fan. 2.3. Replacing and Returning the Front Fan Module Remove the new fan module from its packaging and be ready to install it.
  • Page 18 NVIDIA DGX H100/H200 Service Manual Confirm that the fan module is healthy working properly by performing the following actions: ▶ Using the BMC web user interface ▶ Verifying that the amber LED on the fan module is extinguished Running the sudo nvsm show fans command ▶...
  • Page 19: Power Supply Replacement

    Chapter 3. Power Supply Replacement This topic describes how to replace the power supplies (PSUs) of the NVIDIA DGX™ H100/H200 system. 3.1. Power Supply Replacement Overview This is a high-level overview of the steps needed to replace a power supply.
  • Page 20 Access the rear of the system and view the status LEDs while the system is powered on. Both LEDs are solid green if the PSU is good. If either of the LEDs are not green or they blink, contact NVIDIA Enterprise Support to troubleshoot the issue. Chapter 3. Power Supply Replacement...
  • Page 21 NVIDIA DGX H100/H200 Service Manual Running the Show PSUs Command ▶ Run the following command to display information about the PSUs: sudo nvsm show psus The output shows information for each PSU. Look for any that do not report Status_Health=OK.
  • Page 22 NVIDIA DGX H100/H200 Service Manual ▶ Confirm the PSU temperature readings: Run the ipmitool command to view information about the PSUs: sudo ipmitool sdr | grep -i psu Look for power supplies with no temperature reading or an output reading that is close to, or equal to, zero.
  • Page 23: Preparing The Power Supply For Replacement

    Targets: Verbs: show Obtain the replacement PSU (of the same manufacturer) from NVIDIA Enterprise Support. 3.3. Preparing the Power Supply for Replacement If the system is on, make sure at least 4 other power supplies are working by confirming the IN and OUT LEDs are lit green: Note: If insufficient PSUs are present and working, power off the system.
  • Page 24: Replacing The Power Supply

    NVIDIA DGX H100/H200 Service Manual After the new power supply arrives, look at the system and identify which one needs to be replaced. The system is capable of operating at full capacity with four fully working power supplies. If the system is on, make sure that at least four power supplies are fully functional.
  • Page 25: Locking Power Cords

    From the BMC web user interface, confirm the power supply sensors are OK. Run the nvsm show health command and confirm the output does not report any errors. After the replacement is complete, return the broken power supply to NVIDIA Enterprise Support. 3.5. Locking Power Cords How to use the twisting locking power cords that ship with the system.
  • Page 26 NVIDIA DGX H100/H200 Service Manual To remove the cable from the power supply, twist the locking ring to the unlocked position and pull the cable out of the plug. Chapter 3. Power Supply Replacement...
  • Page 27: Motherboard Tray - Opening And Closing The Io Door

    Chapter 4. Motherboard Tray - Opening and Closing the IO door You will need to completely remove the motherboard tray from the server in order to service the fol- lowing components. If this is the case, please refer to the section that describes the procedure to remove the motherboard.
  • Page 28: Release The Motherboard

    NVIDIA DGX H100/H200 Service Manual 4.2. Release the Motherboard Unlock the motherboard by loosening the captive screws that hold the ejection levers in place: Pull the ejection levers to disengage the midplane connectors: Chapter 4. Motherboard Tray - Opening and Closing the IO door...
  • Page 29: Pull Motherboard From Chassis

    NVIDIA DGX H100/H200 Service Manual 4.3. Pull Motherboard from Chassis Pull the motherboard out until the locking mechanism in the lid engages and prevents further movement. Unscrew the thumb screws indicated by the green arrows in the following figure to release lid...
  • Page 30: Open The Motherboard Io Door

    NVIDIA DGX H100/H200 Service Manual 4.4. Open the Motherboard IO Door Fold the lid IO opening section as shown in the following figure: Secure the folding section until it stays in place so you can work on the IO section of the moth- erboard: Chapter 4.
  • Page 31: Close The Motherboard Io Door

    NVIDIA DGX H100/H200 Service Manual 4.5. Close the Motherboard IO Door Before closing the lid, make sure all components are properly installed and that nothing is block- ing the lid. Slide the lid as shown in the following figure to close the motherboard IO section:...
  • Page 32: Lock The Motherboard Lid

    NVIDIA DGX H100/H200 Service Manual 4.6. Lock the Motherboard Lid Close the lid so that you can lock it in place: Use the thumb screws indicated in the following figure to secure the lid to the motherboard tray. Open the tray levers: Push the motherboard tray into the system chassis until the levers on both sides engage with the sides.
  • Page 33 NVIDIA DGX H100/H200 Service Manual After the levers are fully closed, tighten the green thumbscrews to hold the ejection levers in place: 4.7. Insert the Motherboard...
  • Page 34: Finalize Motherboard Closing

    NVIDIA DGX H100/H200 Service Manual 4.8. Finalize Motherboard Closing ▶ Use the labels on the cables to reconnect them to the correct ports. After all cables are installed, plug the locking power cables in and power the system on. Chapter 4. Motherboard Tray - Opening and Closing the IO door...
  • Page 35: Motherboard Tray - Removal And Installation

    Chapter 5. Motherboard Tray - Removal and Installation You will need to completely remove the motherboard tray from the server in order to service the fol- lowing components. If this is the case, please refer to the section that describes the procedure to remove the motherboard.
  • Page 36: Release The Motherboard

    NVIDIA DGX H100/H200 Service Manual 5.2. Release the Motherboard Unlock the motherboard by loosening the captive screws that hold the ejection levers in place: Pull the ejection levers to disengage the midplane connectors: Chapter 5. Motherboard Tray - Removal and Installation...
  • Page 37: Pull Motherboard From Chassis

    NVIDIA DGX H100/H200 Service Manual 5.3. Pull Motherboard from Chassis Make sure that you have a solid flat surface where you can rest the motherboard tray. Pull the motherboard tray out until the locking mechanism in the lid engages and prevents further movement.
  • Page 38: Remove The Motherboard Tray Lid

    NVIDIA DGX H100/H200 Service Manual ▶ Do not hold the motherboard tray by the ejection handles. The handles can bend or break. ▶ Be careful with the connectors at the back of the module to prevent damage. Place the motherboard tray on a solid, flat surface.
  • Page 39: Close The Motherboard Tray Lid

    NVIDIA DGX H100/H200 Service Manual ▶ After the triangular markers align, lift the tray lid to remove it. Optional: Depending on the procedure that you need to perform, remove the air baffles from the motherboard. 5.5. Close the Motherboard Tray Lid Before you perform the following steps, ensure that all components are installed correctly so that they do not interfere with the air baffles or tray lid.
  • Page 40: Insert The Motherboard Tray Into The Chassis

    NVIDIA DGX H100/H200 Service Manual Tighten the two lid screws on the port side of the motherboard tray, as shown in the following figure: Tighten the two lid screws on the connector side of the motherboard tray, as shown in the fol- lowing figure: 5.6.
  • Page 41 NVIDIA DGX H100/H200 Service Manual Push the motherboard tray into the chassis until the levers on both sides engage with the sides: 5.6. Insert the Motherboard Tray into the Chassis...
  • Page 42: Insert The Motherboard

    NVIDIA DGX H100/H200 Service Manual 5.7. Insert the Motherboard Use the levers to engage the midplane connectors: After the levers are fully closed, tighten the green thumbscrews to hold the ejection levers in place: Chapter 5. Motherboard Tray - Removal and Installation...
  • Page 43: Finalize Motherboard Closing

    NVIDIA DGX H100/H200 Service Manual 5.8. Finalize Motherboard Closing ▶ Use the labels on the cables to reconnect them to the correct ports. After all cables are installed, plug the locking power cables in and power the system on. 5.8. Finalize Motherboard Closing...
  • Page 44 NVIDIA DGX H100/H200 Service Manual Chapter 5. Motherboard Tray - Removal and Installation...
  • Page 45: Nvme Cache Drive Replacement Overview

    Insert new SSD Power on the system Rebuild the RAID volume and mount the filesystem Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided 6.2. Identifying the Failed U.2 NVMe SSD Identifying the Failed NVMe from the Front If physical access to the system is available, you can identify a failed drive by the illuminated amber LED.
  • Page 46: Identifying The Nvme Manufacturer And Model

    EncryptionStatus = Unlocked CapacityBytes = 3840755982336 Id = nvme5n1 Targets: Verbs: show Refer to the Manufacturer and Model fields in the output. Request a replacement NVMe from NVIDIA Enterprise Support, specifying this information. Chapter 6. U.2 NVMe Cache Drive Replacement...
  • Page 47: Replacing The U.2 Nvme Drive

    NVIDIA DGX H100/H200 Service Manual 6.4. Replacing the U.2 NVMe Drive Make sure that you requested and obtained the replacement drive from NVIDIA Enterprise Sup- port. Back up any critical data to a network shared volume or some other means of backup.
  • Page 48: Insert The U.2 Nvme Drive

    NVIDIA DGX H100/H200 Service Manual Remove the drive: 6.5. Insert the U.2 NVMe Drive Open the lever on the drive and insert the replacement drive in the same slot: Chapter 6. U.2 NVMe Cache Drive Replacement...
  • Page 49 NVIDIA DGX H100/H200 Service Manual Close the lever and secure it in place: Confirm the drive is flush with the system: 6.5. Insert the U.2 NVMe Drive...
  • Page 50: Next Steps

    NVIDIA DGX H100/H200 Service Manual Install the bezel after the drive replacement is complete. Power on the system. 6.6. Next Steps ▶ U.2 NVMe Cache Drive Post-Installation Tasks. Chapter 6. U.2 NVMe Cache Drive Replacement...
  • Page 51: Nvme Cache Drive Post-Installation Tasks

    If the cache volume was locked with an access key, unlock the drives: sudo nv-disk-encrypt disable The disk encryption packages must be installed on the system. Refer to the NVIDIA DGX H100/H200 User Guide for more information. Recreate the cache volume and the ∕raid filesystem:...
  • Page 52: Returning The Nvme Drive

    Note: If your organization purchased a media retention policy, you might be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics. Chapter 7. U.2 NVMe Cache Drive Post-Installation Tasks...
  • Page 53: Nvme Boot Drive Replacement Overview

    Overview This is a high-level overview of the procedure to replace a boot drive. Determine which M.2 device needs to be replaced with the help of NVIDIA Enterprise Support Get a replacement M.2 disk from NVIDIA Enterprise Support Make sure the system is shut down If cables don’t reach, label all cables and unplug them from the motherboard tray...
  • Page 54: Identify The Failed M.2 Nvme

    NVIDIA DGX H100/H200 Service Manual 8.2. Identify the Failed M.2 NVMe The NVIDIA DGX™ H100/H200 system automatically sets the failed M.2 drive offline when it detects the failure. The boot drives are mirrored, so the mdadm command-line utility can identify the drive to replace.
  • Page 55 NVIDIA DGX H100/H200 Service Manual Rotate the locking mechanism for the PCI carrier out of the way: Lossen the captive screw on the support bracket of the M.2 riser card: 8.3. Remove the M.2 Boot Drive Carrier...
  • Page 56 NVIDIA DGX H100/H200 Service Manual Pull the M.2 riser card from the slot: Lift the M.2 riser card to remove it from the system: Chapter 8. M.2 NVMe Boot Drive Replacement...
  • Page 57: Remove The M.2 Drive

    NVIDIA DGX H100/H200 Service Manual 8.4. Remove the M.2 Drive Before attempting to remove one of the M.2 NVMe drives, make sure that you performed the following prerequisites: ▶ Determined the location ID of the faulty M.2 drive. ▶ Obtained the replacement M.2 drive and have saved the packaging for use when returning the faulty drive.
  • Page 58 NVIDIA DGX H100/H200 Service Manual Pull the left end of the M.2 drive up about 30˚: To pull the M.2 out, raise it slightly, up to 30˚ and pull the drive off the socket as shown in the following figure:...
  • Page 59: Replace The M.2 Drive

    NVIDIA DGX H100/H200 Service Manual 8.5. Replace the M.2 Drive To insert the M.2 drive, set it at an angle and insert it into the connector: Lower the M.2 drive and align it with the screw post: Install and tighten the screw to secure the drive to the riser:...
  • Page 60: Install The M.2 Boot Drive Carrier And Close The System

    NVIDIA DGX H100/H200 Service Manual 8.6. Install the M.2 Boot Drive Carrier and Close the System Position the M.2 riser card into the system: Install the M.2 carrier card into the PCI riser by aligning it with the slot and then pressing it against the riser: Chapter 8.
  • Page 61 NVIDIA DGX H100/H200 Service Manual Tighten the captive screw on the support bracket of the M.2 riser card: Close the latch to secure the M.2 carrier and secure it in place: 8.6. Install the M.2 Boot Drive Carrier and Close the System...
  • Page 62: Integrate The New Drive And Complete Installation

    NVIDIA DGX H100/H200 Service Manual Tighten the thumb screw to make sure the locking mechanism stays in place: 8.7. Integrate the New Drive and Complete Installation Return the motherboard to its regular position and power on the system. Refer to Motherboard Tray - Opening and Closing the IO door for more information.
  • Page 63 In this case, make sure the name of the replacement drive is correct and try again. Use the packaging from the new drive to ship back the failed drive back to NVIDIA Enterprise Support Note: If your organization purchased a media retention policy, you might be able to keep failed drives for destruction.
  • Page 64 NVIDIA DGX H100/H200 Service Manual Chapter 8. M.2 NVMe Boot Drive Replacement...
  • Page 65: Boot Drive Assembly Replacement

    Note: If your organization purchased a media retention policy, you might be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics. Get a replacement M.2 boot drive assembly from NVIDIA Enterprise Support Make sure the system is shut down If cables don’t reach, label all cables and unplug them from the motherboard tray...
  • Page 66: Preparing The System For Replacement

    This failure is hard to diagnose because the system won’t boot, as both boot drives are unavailable. After the replacement part arrives from NVIDIA, shut down the system from the front power button or from the BMC user interface and proceed by opening the IO door of the motherboard. Refer to Motherboard Tray - Opening and Closing the IO door to get access to the M.2 boot drive carrier.
  • Page 67 NVIDIA DGX H100/H200 Service Manual Lossen the captive screw on the support bracket of the M.2 riser card: Pull the M.2 riser card from the slot: 9.3. Remove the M.2 Boot Drive Carrier...
  • Page 68 NVIDIA DGX H100/H200 Service Manual Lift the M.2 riser card to remove it from the system: Chapter 9. M.2 Boot Drive Assembly Replacement...
  • Page 69: Install The M.2 Boot Drive Carrier And Close The System

    NVIDIA DGX H100/H200 Service Manual 9.4. Install the M.2 Boot Drive Carrier and Close the System Position the M.2 riser card into the system: Install the M.2 carrier card into the PCI riser by aligning it with the slot and then pressing it against the riser: Tighten the captive screw on the support bracket of the M.2 riser card:...
  • Page 70 NVIDIA DGX H100/H200 Service Manual Close the latch to secure the M.2 carrier and secure it in place: Tighten the thumb screw to make sure the locking mechanism stays in place: Chapter 9. M.2 Boot Drive Assembly Replacement...
  • Page 71: Re-Install The System And Complete The Procedure

    Reinstall the system following the instructions in the DGX OS User Guide. Confirm the system is in working order by running: sudo nvsm show health Use the packaging from the new component to ship back the failed one back to NVIDIA Enterprise Support 9.5. Re-Install the System and Complete the Procedure...
  • Page 72 NVIDIA DGX H100/H200 Service Manual Chapter 9. M.2 Boot Drive Assembly Replacement...
  • Page 73: Dimm Replacement

    Insert the motherboard tray into the system Plug in all cables using the labels as a reference Power on the system Verify that all DIMMs are now healthy with nvsm health Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided...
  • Page 74: Identifying The Failed Dimm

    From the console, run the following nvsm command to identify memory alerts: sudo nvsm show health Determine the DIMM manufacturer. sudo nvsm show memory Request the replacement DIMM from NVIDIA Enterprise Support, specifying the manufacturer. 10.3. Replacing the DIMM Power off the system. Remove the motherboard tray. Refer to...
  • Page 75 NVIDIA DGX H100/H200 Service Manual Remove the DIMM. Press down on the side latches at both ends of the DIMM socket to push them away from the DIMM. This should unseat the DIMM from the socket. 10.3. Replacing the DIMM...
  • Page 76: Finalize Dimm Replacement

    NVIDIA DGX H100/H200 Service Manual To install the DIMM, make sure both levers are in the open position. Make sure the DIMM is correctly aligned with the key in the right position and press down on the DIMM until it clicks in the socket and the levers close.
  • Page 77 NVIDIA DGX H100/H200 Service Manual Power on system. Login and use the nvsm command to confirm the system is healthy: sudo nvsm show health Ship the bad DIMM back to NVIDIA Enterprise Support. 10.4. Finalize DIMM Replacement...
  • Page 78 NVIDIA DGX H100/H200 Service Manual Chapter 10. DIMM Replacement...
  • Page 79: Network Interface Card Replacement

    Chapter 11. Network Interface Card Replacement 11.1. Network Card Replacement Overview This is a high-level overview of the procedure to replace one or more network cards on the NVIDIA DGX™ H100/H200 system. Identify the failed card Get a replacement Ethernet card from NVIDIA Enterprise Support Make sure the system is shut down If cables don’t reach, label all cables and unplug them from the motherboard tray...
  • Page 80: Remove The Non-Functional Card

    NVIDIA DGX H100/H200 Service Manual After you rule out external connectivity issues, contact NVIDIA Enterprise Support to receive a replace- ment card. When you receive the card, begin the replacement by performing the following actions: ▶ Power off the system.
  • Page 81: Install The New Card And Close The Lock

    NVIDIA DGX H100/H200 Service Manual Remove the card from the system: 11.4. Install the New Card and Close the Lock Position the PCI card in the system: Push the card into the PCI slot: 11.4. Install the New Card and Close the Lock...
  • Page 82 NVIDIA DGX H100/H200 Service Manual Close the latch to lock the PCI cards in place: Secure the locking mechanism by tightening the black thumb screw: Chapter 11. Network Interface Card Replacement...
  • Page 83: Finalize The Network Interface Card Replacement

    NVIDIA DGX H100/H200 Service Manual 11.5. Finalize the Network Interface Card Replacement Refer to Motherboard Tray - Opening and Closing the IO door for information about performing the following actions: Close the motherboard tray IO door. Lock the motherboard lid.
  • Page 84 NVIDIA DGX H100/H200 Service Manual Chapter 11. Network Interface Card Replacement...
  • Page 85: Updating The Connectx-7 Firmware

    Firmware After replacing or installing the ConnectX-7 cards, make sure the firmware on the cards is up to date. Refer to the NVIDIA DGX H100/H200 Firmware Update Guide to find the most recent firmware version. Download the firmware from https://network.nvidia.com/support/firmware/connectx7ib/.
  • Page 86 NVIDIA DGX H100/H200 Service Manual Chapter 12. Updating the ConnectX-7 Firmware...
  • Page 87: Connectx-7 I/O Replacement

    Slide the motherboard back into the system Plug in all cables using the labels as a reference Power on the system Update the firmware if necessary and test the ConnectX-7 IO card Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided...
  • Page 88: Prepare The System For Replacement

    13.2. Prepare the System for Replacement First, identify which IO card to replace. Use the nvsm command or network tools to determine which card failed. After you have this information, contact NVIDIA Enterprise Support to get a replacement. When the card arrives, power off the system.
  • Page 89: Remove An Ipex Cable

    NVIDIA DGX H100/H200 Service Manual Before you pull the card too far, remove the white and black IPEX cables from the card. The white cable connects on top of the card and the black cable connects on the bottom (heatsink) of the card: Follow the instructions in the next steps to remove and insert the IPEX connectors.
  • Page 90: Insert An Ipex Cable

    NVIDIA DGX H100/H200 Service Manual Push the cable away from the connector: 13.6. Insert an IPEX Cable Align the IPEX cable to the connector: Press the cable into the connector: Confirm the cable is in the connector: Close the latching mechanism:...
  • Page 91: Install Connectx Card

    NVIDIA DGX H100/H200 Service Manual Make sure the cable is locked to the connector on the board: 13.7. Install ConnectX Card After you connect the IPEX cables, install the new card in the slot: Confirm the card is in place and that the cables are connected:...
  • Page 92: Install The I/O Card Above The Connectx Card

    Update the firmware on the card. Refer to the NVIDIA ConnectX-7 User Guide. Use the nvsm command to confirm that the system working correctly: sudo nvsm show health Use the packaging from the new component to ship the failed one back to NVIDIA Enterprise Support. Chapter 13. ConnectX-7 I/O Replacement...
  • Page 93: Front Console Board Replacement Overview

    Chapter 14. Front Console Board Replacement 14.1. Front Console Board Replacement Overview This is a high-level overview of the procedure to replace the front console board on the NVIDIA DGX™ H100/H200 system. Unpack the new front console board Shut down the system...
  • Page 94 NVIDIA DGX H100/H200 Service Manual Caution: Static Sensitive Devices: Be sure to observe best practices for electrostatic discharge (ESD) protection. This includes making sure personnel and equipment are connected to a common ground, such as by wearing a wrist strap connected to the chassis ground, and placing components on static-free work surfaces.
  • Page 95 NVIDIA DGX H100/H200 Service Manual Tighten the screws: 14.2. Front Console Board Replacement...
  • Page 96 Power on the system and confirm the ports work Run sudo nvsm show health to confirm the temperature sensor is working properly ▶ ▶ Replace the bezel Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided. Chapter 14. Front Console Board Replacement...
  • Page 97: Motherboard Tray Battery Replacement

    15.1. Motherboard Tray Battery Replacement Overview You can replace the motherboard tray battery of the NVIDIA DGX™ H100/H200 system by performing the following high-level steps: Get a replacement battery - type CR2032. Shut down the system.
  • Page 98: Identify A Failed Battery

    Call NVIDIA Enterprise Support to confirm that the battery is the right component to replace. Note: The CR2032 battery is not provided by NVIDIA, but it is easy to find at a convenience store. After you purchase a battery, perform the following procedures.
  • Page 99 NVIDIA DGX H100/H200 Service Manual Rotate the locking mechanism for the PCI carrier out of the way: Pull the card out of the slot: 15.4. Remove the PCI Ethernet Card...
  • Page 100: Remove The Connectx Card

    NVIDIA DGX H100/H200 Service Manual Remove the card: 15.5. Remove the ConnectX Card Pull the card out of the slot: Before you pull the card too far, remove the white and black IPEX cables from the card. The white cable connects on top of the card and the black cable connects on the bottom (heatsink) of the card: Chapter 15.
  • Page 101: Remove An Ipex Cable

    NVIDIA DGX H100/H200 Service Manual Follow the instructions in the next steps to remove and insert the IPEX connectors. 15.6. Remove an IPEX Cable Repeat this process for both white and black cables. Lift the locking door: Push the cable away from the connector:...
  • Page 102: Replace The Battery

    NVIDIA DGX H100/H200 Service Manual 15.7. Replace the Battery Use a thin tool to gently lift the battery from the battery holder: Rotate the battery as shown in the following figure: Replace the battery with a new CR2032, installing it in the battery holder. Make sure the positive side is on top: Chapter 15.
  • Page 103: Insert An Ipex Cable

    NVIDIA DGX H100/H200 Service Manual 15.8. Insert an IPEX Cable Align the IPEX cable to the connector: Press the cable into the connector: Confirm the cable is in the connector: 15.8. Insert an IPEX Cable...
  • Page 104: Install Connectx Card

    NVIDIA DGX H100/H200 Service Manual Close the latching mechanism: Make sure the cable is locked to the connector on the board: 15.9. Install ConnectX Card After you connect the IPEX cables, install the new card in the slot: Chapter 15. Motherboard Tray Battery Replacement...
  • Page 105: Install The Pci Ethernet Card

    NVIDIA DGX H100/H200 Service Manual Confirm the card is in place and that the cables are connected: 15.10. Install the PCI Ethernet Card Position the card in the system: 15.10. Install the PCI Ethernet Card...
  • Page 106 NVIDIA DGX H100/H200 Service Manual Push the card into the PCI slot: Close the latch to lock the PCI cards in place: Chapter 15. Motherboard Tray Battery Replacement...
  • Page 107: Power On The System And Confirm Replacement

    NVIDIA DGX H100/H200 Service Manual Tighten the thumbscrew to make sure the locking latch mechanism stays in place: 15.11. Power On the System and Confirm Replacement Close the motherboard tray IO door and insert the motherboard tray. Refer to Motherboard Tray - Opening and Closing the IO door for more information.
  • Page 108 NVIDIA DGX H100/H200 Service Manual sudo date [MMDDhhmm[[CC]YY][.ss]] Sync the date and time to the hardware real time clock: sudo hwclock -w Reset the BMC: sudo ipmitool mc reset cold Confirm that the time and date on the system are updated: sudo nvsm show health Chapter 15.
  • Page 109: Trusted Platform Module Replacement

    16.1. Trusted Platform Module Replacement Overview This is a high-level overview of the procedure to replace the trusted platform module (TPM) on the NVIDIA DGX™ H100/H200 system. If enabled, disable drive encryption. Shut down the system. Label all motherboard tray cables and unplug them.
  • Page 110: Prepare The System For Replacement

    NVIDIA DGX H100/H200 Service Manual 16.2. Prepare the System for Replacement If data drives are encrypted, the tpm2 OS package is installed, and the TPM is enabled in SBIOS, disable encryption: sudo nv-disk-encrypt disable Power down the system. Remove the motherboard tray. Refer to...
  • Page 111 NVIDIA DGX H100/H200 Service Manual Rotate the OSFP carrier module to access the TPM, as shown in the following diagram: Replace the TPM. Make sure that you position the TPM in the same direction as the original. 16.3. Replace the TPM Module...
  • Page 112: Install Osfp Carrier Module

    NVIDIA DGX H100/H200 Service Manual 16.4. Install OSFP Carrier Module Rotate the OSFP carrier module to return it to the original position. While you rotate the module, pull the module toward the DIMMs so that the ports do not interfere with the motherboard tray...
  • Page 113: Finalize Tpm Replacement

    NVIDIA DGX H100/H200 Service Manual 16.5. Finalize TPM replacement Install the air baffles, close the motherboard, and install the tray in the chassis. Refer to Moth- erboard Tray - Removal and Installation for more information. Plug in all cables. Install all power cords.
  • Page 114 NVIDIA DGX H100/H200 Service Manual Chapter 16. Trusted Platform Module Replacement...
  • Page 115: Removing And Attaching The Bezel

    Chapter 17. Removing and Attaching the Bezel 17.1. Bezel Removal Grab the bezel on both sides by the side handles. Pull the bezel away from the system with a horizontal motion to release it from the magnets that keep it in place.
  • Page 116: Bezel Installation

    NVIDIA DGX H100/H200 Service Manual 17.2. Bezel Installation Align the pins on the bezel to the notches on the system fascia. Chapter 17. Removing and Attaching the Bezel...
  • Page 117 NVIDIA DGX H100/H200 Service Manual Attach the bezel to the system making sure the pins fit in the notches and that the magnetic latch holds the bezel securely in place. 17.2. Bezel Installation...
  • Page 118 NVIDIA DGX H100/H200 Service Manual Chapter 17. Removing and Attaching the Bezel...
  • Page 119: Rack Mount Kit Replacement

    Chapter 18. Rack Mount Kit Replacement Remove the two front screws and washers Remove the two rear screws Use the clips to release the front and rear from each side of the kit Remove the cage nuts from the rack posts Install on the new rack by using the clips to position the kit at the right height Use the template to install the cage nuts in the right Use the four screws and two washers to secure the rack mount kit in place...
  • Page 120: Remove Rack Mount Kit - Front Rack

    NVIDIA DGX H100/H200 Service Manual On the lower part, there is a lip, labeled ‘1’, that when installed in a rack, will hold the system in place as if it was a shelf. On either end, and labeled ‘2’ on the diagram, there are spring loaded prongs that fit into the rack’s holes (either square or round.)
  • Page 121: Remove Rack Mount Kit - Rear

    NVIDIA DGX H100/H200 Service Manual 18.3. Remove Rack Mount Kit - Rear To release the rear of the rack mount kit, remove the round head screw and keep next to the other screws and washers. 18.3. Remove Rack Mount Kit - Rear...
  • Page 122 NVIDIA DGX H100/H200 Service Manual Pull on the metal clip and slide the rail away from the post so the progs are free from the rack. Chapter 18. Rack Mount Kit Replacement...
  • Page 123: Confirm Necessary Screws And Washers

    NVIDIA DGX H100/H200 Service Manual 18.4. Confirm Necessary Screws and Washers These items are in the rack mount kit box with the rack mount kit All these components should have been removed from the previous installation Note: front screws are different from the screws used for the back of the rack mount kit. If the correct screws are not used in the front, the server will not be flush when pushed against the rack and it will be difficult to secure the other eight captive screws.
  • Page 124: Install Cage Nuts Using Template

    NVIDIA DGX H100/H200 Service Manual 18.5. Install Cage Nuts Using Template A printed copy of this template is included as part of the rack kit, and it should be used to align the desired location of the system to where the included cage nuts should be installed The template is double sided so it can be used as a reference on the left and right posts of the rack.
  • Page 125 NVIDIA DGX H100/H200 Service Manual Note: RACKS WITH C-CHANNEL POSTS: They have an obstruction that prevents the rack mount kit from being installed in the front-most post - use a third pair of cage nuts so the bottom system screws have something to engage with.
  • Page 126: Install Rack Mount Kit - Front

    NVIDIA DGX H100/H200 Service Manual 18.6. Install Rack Mount Kit - Front To install the rack mount kit on the rack, start with either side. We will describe the installation of the left side. The first step is to align the lip to the bottom of the rack unit where the system needs to be installed as shown in the diagram.
  • Page 127: Install Rack Mount Kit - Rear

    NVIDIA DGX H100/H200 Service Manual 18.7. Install Rack Mount Kit - Rear To install the rear section of the rack mount kit, follow the same steps to align the bottom lip to the bottom of where the system should be.
  • Page 128 NVIDIA DGX H100/H200 Service Manual Repeat the procedure for the right side rack mount kit. Chapter 18. Rack Mount Kit Replacement...
  • Page 129: Safety

    Chapter 19. Safety This section provides information about how to safely use the NVIDIA DGX™ H100/H200 system. 19.1. Safety Information To reduce the risk of bodily injury, electrical shock, fire, and equipment damage, read this document and observe all warnings and precautions in this guide before installing or maintaining your server product.
  • Page 130: Intended Application Uses

    NVIDIA DGX H100/H200 Service Manual Indicates hot components or surfaces Indicates do not touch fan blades, may result in injury. Shock hazard: The product might be equipped with multiple power cords. - To remove all hazardous voltages, disconnect all power cords. - High leakage current ground (earth) connection to the Power Supply is essential before connecting the supply.
  • Page 131: Equipment Handling Practices

    NVIDIA DGX H100/H200 Service Manual ▶ In regions that are susceptible to electrical storms, we recommend you plug your system into a surge suppressor and disconnect telecommunication lines to your modem during an electrical storm. ▶ Provided with a properly grounded wall outlet.
  • Page 132: Power Cord Warnings

    NVIDIA DGX H100/H200 Service Manual 19.6.2. Power Cord Warnings Caution: To avoid electrical shock or fire, check the power cord(s) that will be used with the product as follows: ▶ Do not attempt to modify or use the AC power cord(s) if they are not the exact type required to fit into the grounded electrical outlets.
  • Page 133: Rack Mount Warnings

    NVIDIA DGX H100/H200 Service Manual Caution: To avoid injury do not contact moving fan blades. Your system is supplied with a guard over the fan, do not operate the system without the fan guard in place. 19.8. Rack Mount Warnings The following installation guidelines are required by UL to maintain safety compliance when installing your system into a rack.
  • Page 134: Electrostatic Discharge

    19.10.2. NICKEL NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal foam is not intended for direct and prolonged skin contact. Please use the handles to remove, attach or carry the bezel. While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you are susceptible to nickel-related reactions.
  • Page 135: Cooling And Airflow

    NVIDIA DGX H100/H200 Service Manual Do not attempt to disassemble, puncture, or otherwise damage a battery. 19.10.4. Cooling and Airflow Caution: Carefully route cables as directed to minimize airflow blockage and cooling problems. For proper cooling and airflow, operate the system only with the chassis covers installed.
  • Page 136 NVIDIA DGX H100/H200 Service Manual Chapter 19. Safety...
  • Page 137: Compliance

    Chapter 20. Compliance The NVIDIA DGX™ H100/H200 System is compliant with the regulations listed in this section. 20.1. United States Federal Communications Commission (FCC) FCC Marking (Class A) This device complies with part 15 of the FCC Rules. Operation is subject to the following two condi- tions: (1) this device may not cause harmful interference, and (2) this device must accept any inter- ference received, including any interference that may cause undesired operation of the device.
  • Page 138: Canada

    The full text of EU declaration of conformity is available at the following URL: http://www.nvidia.com/ support A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA GmbH (Bavaria Towers – Blue Tower, Einsteinstrasse 172, D-81677 Munich, Germany). Chapter 20. Compliance...
  • Page 139: Australia And New Zealand

    NVIDIA DGX H100/H200 Service Manual 20.5. Australia and New Zealand Australian Communications and Media Authority This product meets the applicable EMC requirements for Class A, I.T.E equipment. 20.6. Brazil INMETRO 20.7. Japan Voluntary Control Council for Interference (VCCI) 20.5. Australia and New Zealand...
  • Page 140 NVIDIA DGX H100/H200 Service Manual This is a Class A product. In a domestic environment this product may cause radio interference, in which case the user may be required to take corrective actions. VCCI-A. Japan RoHS Material Content Declaration Chapter 20. Compliance...
  • Page 141: South Korea

    NVIDIA DGX H100/H200 Service Manual 20.8. South Korea Korean Agency for Technology and Standards (KATS) Class A Equipment (Industrial Broadcasting & Communication Equipment). This equipment Industrial (Class A) electromagnetic wave suitability equipment and seller or user should take notice of it, and this equipment is to be used in the places except for home.
  • Page 142: China

    NVIDIA DGX H100/H200 Service Manual Korea RoHS Material Content Declaration 20.9. China China Compulsory Certificate No certification is needed for China. The NVIDIA DGX A100 is a server with power consumption greater than 1.3 kW. Chapter 20. Compliance...
  • Page 143 NVIDIA DGX H100/H200 Service Manual China RoHS Material Content Declaration 20.9. China...
  • Page 144: Taiwan

    NVIDIA DGX H100/H200 Service Manual 20.10. Taiwan Bureau of Standards, Metrology & Inspection (BSMI) Chapter 20. Compliance...
  • Page 145: Russia/Kazakhstan/Belarus

    NVIDIA DGX H100/H200 Service Manual Taiwan RoHS Material Content Declaration 20.11. Russia/Kazakhstan/Belarus Customs Union Technical Regulations (CU TR) This device complies with the technical regulations of the Customs Union (CU TR) ТЕХНИЧЕСКИЙ РЕГЛАМЕНТ ТАМОЖЕННОГО СОЮЗА О безопасности низковольтного оборудования (ТР ТС 004/2011) ТЕХНИЧЕСКИЙ...
  • Page 146: Israel

    NVIDIA DGX H100/H200 Service Manual 20.12. Israel 20.13. India Bureau of India Standards (BIS) Authenticity may be verified by visiting the Bureau of Indian Standards website at http://www.bis.gov. Chapter 20. Compliance...
  • Page 147: South Africa

    SI 2012/3032: The Restriction of the Use of Certain Hazardous Substances in Electrical and Elec- tronic Equipment (As Amended) A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA Ltd. (100 Brook Drive, 3rd Floor Green Park, Reading RG2 6UJ, United Kingdom) 20.14. South Africa...
  • Page 148 NVIDIA DGX H100/H200 Service Manual Chapter 20. Compliance...
  • Page 149: Third-Party License Notices

    Chapter 21. Third-Party License Notices This NVIDIA product contains third party software that is being made available to you under their re- spective open source software licenses. Some of those licenses also require specific legal information to be included in the product. This section provides such information.
  • Page 150: Mellanox (Ofed)

    NVIDIA DGX H100/H200 Service Manual INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF MTI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Because some jurisdictions prohibit the exclusion or limitation of liability for consequential or incidental damages, the above limitation may not apply to you.
  • Page 151: Notices

    NVIDIA accepts no liability related to any default, damage, costs, or prob- lem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
  • Page 152: Trademarks

    OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WAR- RANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CON-...

This manual is also suitable for:

Dgx h100