Nvidia DGX A100 Service Manual
Nvidia DGX A100 Service Manual

Nvidia DGX A100 Service Manual

System
Hide thumbs Also See for DGX A100:
Table of Contents

Advertisement

NVIDIA DGX A100 System
Service Manual
DU-10044-001 _v01
  |  
August   2020

Advertisement

Table of Contents
loading

Summary of Contents for Nvidia DGX A100

  • Page 1 NVIDIA DGX A100 System Service Manual DU-10044-001 _v01   |   August   2020...
  • Page 2: Table Of Contents

    8.1. Recreating the Cache RAID 0 Volume................... 31 8.2. Returning the NVMe Drive..................... 32 Chapter 9. M.2 NVMe Boot Drive Replacement..............33 9.1. M.2 NVMe Boot Drive Replacement Overview...............33 9.2. Identifying the Failed M.2 NVMe.................... 33 9.3. Replacing the M.2 NVMe Drive....................34 NVIDIA DGX A100 System DU-10044-001 _v01   |   ii...
  • Page 3 Chapter 15. Motherboard Tray Battery Replacement............63 15.1. Motherboard Tray Battery Replacement Overview............. 63 15.2. Replacing the Motherboard Tray Battery................63 Chapter 16. Removing and Attaching the Bezel..............68 Chapter 17. Installing the Rack Mount Kit................ 70 17.1. Installing the Rails........................ 70 17.2. Installing the Cage Nuts.......................73 NVIDIA DGX A100 System DU-10044-001 _v01   |   iii...
  • Page 4 List of Figures Figure 1. NVMe Drives: PCIe to Slot Mapping ................28 NVIDIA DGX A100 System DU-10044-001 _v01   |   iv...
  • Page 5 List of Tables Table 1. Network Card Slot IDs ....................... 54 NVIDIA DGX A100 System DU-10044-001 _v01   |   v...
  • Page 6 NVIDIA DGX A100 System DU-10044-001 _v01   |   vi...
  • Page 7: Chapter 1. Introduction

    A100 system components. Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX A100 system. These Terms & Conditions for the DGX A100 system can be found through the NVIDIA DGX Systems Support page.
  • Page 8: Recommended Tools

    Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX A100 system.  Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX A100 system. For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https:// www.nvidia.com/en-us/support/enterprise/ 1.4. ...
  • Page 9: Chapter 2. Front Fan Module Replacement

    1. Identify the failed front fan module through the BMC or the fan module LED and submit a service ticket to NVIDIA Enterprise Support. 2. Get a replacement from NVIDIA Enterprise Support. 3. Remove the failed fan module using the fan numbering diagram as a reference.
  • Page 10 Using the BMC Dashboard and NVSM 1. Identify the faulty fan module using the BMC dashboard. a). Log on to the BMC. b). Click Sensor from the left navigation menu, then review the Normal Sensors section. NVIDIA DGX A100 System DU-10044-001 _v01   |   4...
  • Page 11: Replacing And Returning The Front Fan Module

    1. Remove the new fan module from its packaging and be ready to install it. 2. Remove the failed fan module by pressing on the release button on the top of the module and pulling on the handle. NVIDIA DGX A100 System DU-10044-001 _v01   |   5...
  • Page 12 Viewing the state of the fan module on he BMC dashboard. ‣ Using NVSM ( sudo nvsm show fans 5. Use packaging to pack up the bad fan and follow the shipping instructions to return the bad fan to NVIDIA Enterprise Support. NVIDIA DGX A100 System DU-10044-001 _v01   |   6...
  • Page 13: Chapter 3. Power Supply Replacement

    Chapter 3. Power Supply Replacement This chapter describes how to replace one of the DGX A100 system power supplies (PSUs). 3.1.  Power Supply Replacement Overview This is a high-level overview of the steps needed to replace a power supply. 1. Identify failed power supply through the BMC and submit a service ticket.
  • Page 14 Look for power supplies with no temperature reading or an output reading close to or equal to zero. Both NVSM and the BMC identify each power supply as PSUx, where x is from 0 to 5. The following diagram shows the physical location of each PSU. NVIDIA DGX A100 System DU-10044-001 _v01   |   8...
  • Page 15: Replacing The Power Supply

    ‣ If the three remaining PSUs are working and energized, then you do not need to shut down power to the DGX A100 system. ‣ If fewer than three PSUs are working and energized, then shut down power to the DGX A100 system.
  • Page 16 Viewing the PSU status from the BMC dashboard-> page. ‣ Running to confirm all power supplies are healthy. nvsm show health Pack the old power supply and ship it back to NVIDIA Enterprise Support. NVIDIA DGX A100 System DU-10044-001 _v01   |   10...
  • Page 17: Chapter 4. Motherboard Tray - Accessing In Place

    Motherboard tray battery ‣ Single-port or dual port CX-6 PCI network adapter card 4.1.  Accessing the Motherboard Tray 1. Loosen the two motherboard thumbscrews and then pull the handles out to eject the motherboard tray. NVIDIA DGX A100 System DU-10044-001 _v01   |   11...
  • Page 18 Motherboard Tray - Accessing in Place 2. Pull the motherboard tray out of the system until it locks, then loosen the two thumbscrews holding the lid in place. 3. Lift the rear section of the motherboard lid. NVIDIA DGX A100 System DU-10044-001 _v01   |   12...
  • Page 19: Replacing The Motherboard Tray

    Motherboard Tray - Accessing in Place 4.2.  Replacing the Motherboard Tray 1. Close the lid to the motherboard tray. 2. Tighten the two thumbscrews and then push the motherboard tray into the system. NVIDIA DGX A100 System DU-10044-001 _v01   |   13...
  • Page 20 Motherboard Tray - Accessing in Place 3. Close the handles to secure the motherboard tray in place. 4. Tighten the motherboard tray thumbscrews to complete the motherboard insertion. NVIDIA DGX A100 System DU-10044-001 _v01   |   14...
  • Page 21 Motherboard Tray - Accessing in Place NVIDIA DGX A100 System DU-10044-001 _v01   |   15...
  • Page 22: Chapter 5. Motherboard Tray - Removal And Installation

    You will need to completely remove the motherboard tray from the server in order to service the following components. ‣ DIMMs (either adding or replacing) 5.1.  Removing the Motherboard Tray 1. Loosen the two motherboard thumbscrews and then pull the handles out to eject the motherboard tray. NVIDIA DGX A100 System DU-10044-001 _v01   |   16...
  • Page 23 Place the tray on a solid, flat work surface. 3. Loosen two rear thumbscrews on the motherboard lid. NVIDIA DGX A100 System DU-10044-001 _v01   |   17...
  • Page 24 Motherboard Tray - Removal and Installation 4. Loosen the two front thumbscrews on the motherboard tray lid. 5. Lift the lid off of the tray and set aside. NVIDIA DGX A100 System DU-10044-001 _v01   |   18...
  • Page 25: Reinstalling The Motherboard Tray

    Motherboard Tray - Removal and Installation 6. Remove all three air baffles to allow access to the DIMMs. 5.2.  Reinstalling the Motherboard Tray 1. Reinstall the three air baffles. NVIDIA DGX A100 System DU-10044-001 _v01   |   19...
  • Page 26 Motherboard Tray - Removal and Installation 2. Replace and secure the lid. a). Install the lid. b). Tighten the rear thumbscrews NVIDIA DGX A100 System DU-10044-001 _v01   |   20...
  • Page 27 Motherboard Tray - Removal and Installation c). Tighten the front thumbscrews. 3. Slide the motherboard tray into the slot, open the tray handles, and then continue pushing the motherboard tray in. NVIDIA DGX A100 System DU-10044-001 _v01   |   21...
  • Page 28 Motherboard Tray - Removal and Installation 4. Close the handles to secure the motherboard tray in place. 5. Tighten the motherboard tray thumbscrews to complete the motherboard insertion. NVIDIA DGX A100 System DU-10044-001 _v01   |   22...
  • Page 29 Motherboard Tray - Removal and Installation NVIDIA DGX A100 System DU-10044-001 _v01   |   23...
  • Page 30: Chapter 6. U.2 Nvme Cache Drive Upgrade From 4 To 8

    U.2 NVMe Cache Drive Upgrade Overview This is a high-level overview of the steps needed to upgrade the DGX A100 system's cache size. 1. Identify the manufacturer and model of the of currently installed NVMe drives. 2. Place an order for additional four NVME drives.
  • Page 31: Installing The Optional Nvme Drives

    5. Install the additional four NVMe drives in slots 1, 3, 5, and 7. a). Unlock the release lever and then slide the drive into the slot until the front face is flush with the other drives. b). Close the lever and lock it in place. NVIDIA DGX A100 System DU-10044-001 _v01   |   25...
  • Page 32 U.2 NVMe Cache Drive Upgrade from 4 to 8 6. Power on the system. Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks. NVIDIA DGX A100 System DU-10044-001 _v01   |   26...
  • Page 33: Chapter 7. U.2 Nvme Cache Drive Replacement

    /raid 6. Confirm the system is healthy by running nvsm show health 7. Ship the failed unit back to NVIDIA Enterprise Support using the provided packaging. 7.2.  Identifying the Failed U.2 NVMe Identifying the Failed NVMe from the Front If physical access to the system is available, you can identify a failed drive by the illuminated amber LED .
  • Page 34 U.2 NVMe Cache Drive Replacement Identifying the Failed NVMe from the Console To identify the failed NVMe drive from the DGX A100 console, enter the following and then look for drive alerts in the output to identify the failed drive.
  • Page 35: Replacing The U.2 Nvme Drive

    NVMe from NVIDIA Enterprise Support, specifying this information. 7.3.  Replacing the U.2 NVMe Drive 1. Be sure you have requested and obtained the replacement drive from NVIDIA Enterprise Support. 2. Back up any critical data to a network shared volume or some other means of backup.
  • Page 36 U.2 NVMe Cache Drive Replacement 6. Power on the system. Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks. NVIDIA DGX A100 System DU-10044-001 _v01   |   30...
  • Page 37: Chapter 8. U.2 Nvme Cache Drive Post-Installation Tasks

    ~ : @ % ^ + = _ , sudo nv-disk-encrypt lock ‣ To allow the encryption software to randomly generated the passwords, issue the following. sudo nv-disk-encrypt init -k <your-vault-password> -g -r NVIDIA DGX A100 System DU-10044-001 _v01   |   31...
  • Page 38: Returning The Nvme Drive

    NVIDIA Enterprise Support. Note: If your organization has purchased a media retention policy, you may be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.
  • Page 39: Chapter 9. M.2 Nvme Boot Drive Replacement

    13.Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided. 9.2.  Identifying the Failed M.2 NVMe The DGX A100 system automatically sets the failed M.2 drive offline when it detects the failure. 1. Identify which of the M.2 drives has failed (nvme0n1 or nvme1n1). sudo nvsm show health 2.
  • Page 40: Replacing The M.2 Nvme Drive

    3. Make a note of the device name for the failed drive (nvme0 or nvme1) and the device name for the good drive (nvme0 or nvme1). You will need this information when rebuilding the RAID 1 array after replacing the drive. 4. Obtain the replacement from NVIDIA Enterprise Support. 9.3.  Replacing the M.2 NVMe Drive Before attempting to replace one of the M.2 NVMe drives, be sure to have performed the...
  • Page 41 Using a Phillips #1 screwdriver, loosen the black screw that secures the drive in place. Note: The screw is not a captive screw and can drop. Be careful when loosening the screw to avoiding dropping and losing the screw. NVIDIA DGX A100 System DU-10044-001 _v01   |   35...
  • Page 42 Pull the drive to disconnect from the connector on the riser board, then insert the new drive into the connector on the riser board. e). Place the drive against the card and secure by tightening the screw using a Phillips #1 screwdriver. NVIDIA DGX A100 System DU-10044-001 _v01   |   36...
  • Page 43: Rebuilding The Boot Drive Raid 1 Volume

    Rebuild the RAID 1 array according to the instruction in the section Rebuilding the Boot Drive RAID 1 Volume. 9.4.  Rebuilding the Boot Drive RAID 1 Volume After replacing a faulty M.2 OS drive, you must rebuild the RAID 1 array. NVIDIA DGX A100 System DU-10044-001 _v01   |   37...
  • Page 44: Returning The Nvme Drive

    M.2 NVMe Boot Drive Replacement 1. If you have not already done so, boot the DGX A100 system and log in. 2. Rebuild the boot drive mirror. In the following steps, replace X with the number that corresponds to the replaced drive.
  • Page 45: Chapter 10. M.2 Boot Drive Riser Assembly Replacement

    13.Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided. 10.2.  Determining a Failed M.2 NVMe Riser Assembly The following are the conditions for which NVIDIA Enterprise Support may instruct the M.2 riser assembly be replaced: NVIDIA DGX A100 System...
  • Page 46: Replacing The M.2 Nvme Riser Assembly

    M.2 Boot Drive Riser Assembly Replacement ‣ The DGX A100 cannot be booted. ‣ The boot drives cannot be seen from the SBIOS. ‣ The system indicates that the boot drives are not available when booting from the ISO image.
  • Page 47 Refer to the instructions in the section Replacing the Motherboard Tray. 7. Connect all the cables to the motherboard tray. 8. Re-install the DGX OS server software. See the DGX A100 User Guide for detailed instructions. NVIDIA DGX A100 System DU-10044-001 _v01   |   41...
  • Page 48: Returning The Riser Assembly

    NVIDIA Enterprise Support. Note: If your organization has purchased a media retention policy, you may be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.
  • Page 49: Chapter 11. Dimm Replacement

    1. Use the commands to identify the failed DIMM nvsm health 2. Get a replacement DIMM from NVIDIA Enterprise Support. 3. Shut down the system. 4. Label all motherboard tray cables and unplug them. 5. Remove the motherboard tray and place on a solid flat surface.
  • Page 50: Replacing The Dimm

    DIMM ID of A1. Properties: system_name = ..component_id = CPU1_DIMM_A1 The output provides other information about the alert that can be provided to NVIDIA Enterprise Support. 3. Determine the DIMM manufacturer. sudo nvsm show memory 4. Request the replacement DIMM from NVIDIA Enterprise Support, specifying the manufacturer.
  • Page 51 DIMM Replacement 5. Remove the DIMM. a). Press down on the side latches at both ends of the DIMM socket to push them away from the DIMM. This should unseat the DIMM from the socket. NVIDIA DGX A100 System DU-10044-001 _v01   |   45...
  • Page 52 Position the DIMM over the socket, making sure that the notch on the DIMM lines up with the key in the slot, then press the DIMM down into the socket until the side latches click in place. c). Make sure that the latches are up and locked in place. NVIDIA DGX A100 System DU-10044-001 _v01   |   46...
  • Page 53 10.Power on the system and log in. 11.Confirm that the system is healthy. sudo nvsm show health sudo nvsm show /systems/localhost/memory/alerts There should be no new alerts listed. 12.Ship the bad DIMM back to NVIDIA Enterprise Support. NVIDIA DGX A100 System DU-10044-001 _v01   |   47...
  • Page 54: Chapter 12. Dimm Upgrade

    12.Verify that all DIMMs as well as the system are healthy using nvsm. 12.2.  Identifying the DIMM Manufacturer 1. Determine the DIMM manufacturer. sudo nvsm show memory 2. Request the additional DIMMs from NVIDIA Enterprise Support, specifying the manufacturer. 12.3.  Upgrading the DIMM NVIDIA DGX A100 System...
  • Page 55 3. Remove the motherboard tray. Refer to the instructions in the section Removing the Motherboard Tray. 4. Using the diagram label on the lid as a guide, locate the DIMMs to be installed during the upgrade. NVIDIA DGX A100 System DU-10044-001 _v01   |   49...
  • Page 56 5. Remove the air baffles. Press down on the side latches at both ends of the air baffle to eject the module from the slot, then pull the air baffle out of the slot. NVIDIA DGX A100 System DU-10044-001 _v01   |   50...
  • Page 57 9. Connect all the cables to the motherboard tray. 10.Install all the power cords. 11.Power on the system and log in. 12.Confirm that the total memory is now 2 TB. lsmem Total online memory: NVIDIA DGX A100 System DU-10044-001 _v01   |   51...
  • Page 58 DIMM Upgrade 13.Confirm that the system is healthy. sudo nvsm show health NVIDIA DGX A100 System DU-10044-001 _v01   |   52...
  • Page 59: Chapter 13. Network Card Replacement

    1. Use the commands to identify the failed network card. nvsm show 2. Get a replacement card from NVIDIA Enterprise Support. 3. Shut down the system. 4. Label all motherboard tray cables and unplug them. 5. Remove the motherboard tray and open the lid.
  • Page 60: Replacing The Vertical Network Card

    1. Power down the system. 2. Label all network, monitor, and USB cables connected to the motherboard tray for easy identification when reconnecting. 3. Unplug all power cords, and all network, monitor, and USB cables. NVIDIA DGX A100 System DU-10044-001 _v01   |   54...
  • Page 61: Replacing The Horizontal Network Card

    ‣ Determined the location ID of the faulty network card needing replacement. Identifying the Failed Network Card. ‣ Obtained the replacement network card have saved the packaging for use when returning the faulty component. NVIDIA DGX A100 System DU-10044-001 _v01   |   55...
  • Page 62 Accessing the Motherboard Tray. 5. Unlock the horizontal network card. a). Loosen the black thumbscrew that secures the PCIe card locking mechanism in place. b). Open the locking mechanism by turning 90 degrees or more. NVIDIA DGX A100 System DU-10044-001 _v01   |   56...
  • Page 63 6. Replace the card. a). Pull the network card out of the riser card slot. b). Replace the old network card with the new one. c). Install the network card into the riser card slot. NVIDIA DGX A100 System DU-10044-001 _v01   |   57...
  • Page 64 Network Card Replacement 7. Lock the network card in place. a). Close the locking mechanism by turning it back into its slot. b). Tighten the black thumb screw to secure the card in place. NVIDIA DGX A100 System DU-10044-001 _v01   |   58...
  • Page 65 9. Connect all cables back into the network card ports. 10.Power on the system and log in. 11.Confirm that the system is healthy. sudo nvsm show health There should be no new alerts listed. NVIDIA DGX A100 System DU-10044-001 _v01   |   59...
  • Page 66: Chapter 14. Front Console Board Replacement

    6. Tighten the screws. 7. Power on the system and confirm the ports work. 8. Ship the failed unit back to NVIDIA Enterprise Support using the provided packaging. 14.2.  Replacing the Front Console Board A front console board malfunction can be determined in a few ways.
  • Page 67 2. Remove the bezel. 3. Replace the front console board. a). Using a Phillips #2 screwdriver, loosen the two captive screws that secure the front console board. b). Replace the front console board. c). Tighten the screws. NVIDIA DGX A100 System DU-10044-001 _v01   |   61...
  • Page 68 Front Console Board Replacement 4. Confirm functionality. a). Power on the system. b). Issue the following to confirm the temperature sensor is working properly. sudo nvsm show health 5. Return the old module to NVIDIA Enterprise Services. NVIDIA DGX A100 System DU-10044-001 _v01   |   62...
  • Page 69: Chapter 15. Motherboard Tray Battery Replacement

    Chapter 15. Motherboard Tray Battery Replacement 15.1.  Motherboard Tray Battery Replacement Overview This is a high-level overview of the procedure to replace the DGX A100 system motherboard tray battery. 1. Get a replacement battery - type CR2032. 2. Shut down the system.
  • Page 70 Call NVIDIA Enterprise Support to confirm that the battery is the right component to replace. The CR2032 battery is not provided by NVIDIA, but can be purchased from a convenience store. CAUTION: Static Sensitive Devices: - Be sure to observe best practices for electrostatic discharge (ESD) protection.
  • Page 71 Motherboard Tray Battery Replacement 6. Replace the battery. a). Locate the battery, using the following image as a guide. NVIDIA DGX A100 System DU-10044-001 _v01   |   65...
  • Page 72 Use a small flat-head screwdriver or similar thin tool to gently lift the battery from the battery holder. c). Replace the battery with a new CR2032, installing it in the battery holder. 7. Re-insert the IO card, the M.2 riser card, and the air baffle into their respective slots. NVIDIA DGX A100 System DU-10044-001 _v01   |   66...
  • Page 73 Sync the date and time to the hardware real time clock. sudo hwclock -w c). Reset the BMC sudo ipmitool mc reset cold 12.Confirm that the time and date on the system are updated. sudo nvsm show health NVIDIA DGX A100 System DU-10044-001 _v01   |   67...
  • Page 74: Chapter 16. Removing And Attaching The Bezel

    Chapter 16. Removing and Attaching the Bezel 1. Grab the bezel on both sides by the side handles, then pull directly away from the system to disengage from the magnetic latch. NVIDIA DGX A100 System DU-10044-001 _v01   |   68...
  • Page 75 Removing and Attaching the Bezel 2. To replace the bezel, align the bezel alignment pins with the chassis, then let the magnetic latch complete the attachment of the bezel. NVIDIA DGX A100 System DU-10044-001 _v01   |   69...
  • Page 76: Chapter 17. Installing The Rack Mount Kit

    Chapter 17. Installing the Rack Mount 17.1.  Installing the Rails Follow these instructions to install the DGX A100 server rack mount kit. The rack mount kit acts as a shelf in the rack, it does not allow the system to be moved once installed. All components are serviceable from the front or rear, so this movement is not necessary.
  • Page 77 Make sure that rail is level and the attachment on the rear post is at the same rack unit as the front. a). Insert the spring-loaded prongs into the holes on the rear rack post. NVIDIA DGX A100 System DU-10044-001 _v01   |   71...
  • Page 78 The bottom lip is at the same height on all four posts. ‣ The metal clips are properly attached. ‣ Four screws are installed - flat head on the front and pan head on the back. NVIDIA DGX A100 System DU-10044-001 _v01   |   72...
  • Page 79: Installing The Cage Nuts

    Installing the Rack Mount Kit 17.2.  Installing the Cage Nuts The DGX A100 server is secured to the rack using four captive screws - one at each corner of the front of the unit. ‣ If your rack has round holes with 10-32 threads, then the screws will attached directly to the rack mounting holes.
  • Page 80 Rail kits attached to Type A racks require two (2) cage nuts installed; top positions only. ‣ Rail kits attached to Type B racks require four (4) cage nut installed; both top and bottom positions. NVIDIA DGX A100 System DU-10044-001 _v01   |   74...
  • Page 81 NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

Table of Contents