A100 system components. Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX A100 system. These Terms & Conditions for the DGX A100 system can be found through the NVIDIA DGX Systems Support page.
Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX A100 system. Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX A100 system. For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https:// www.nvidia.com/en-us/support/enterprise/ 1.4. ...
1. Identify the failed front fan module through the BMC or the fan module LED and submit a service ticket to NVIDIA Enterprise Support. 2. Get a replacement from NVIDIA Enterprise Support. 3. Remove the failed fan module using the fan numbering diagram as a reference.
Page 10
Using the BMC Dashboard and NVSM 1. Identify the faulty fan module using the BMC dashboard. a). Log on to the BMC. b). Click Sensor from the left navigation menu, then review the Normal Sensors section. NVIDIA DGX A100 System DU-10044-001 _v01 | 4...
1. Remove the new fan module from its packaging and be ready to install it. 2. Remove the failed fan module by pressing on the release button on the top of the module and pulling on the handle. NVIDIA DGX A100 System DU-10044-001 _v01 | 5...
Page 12
Viewing the state of the fan module on he BMC dashboard. ‣ Using NVSM ( sudo nvsm show fans 5. Use packaging to pack up the bad fan and follow the shipping instructions to return the bad fan to NVIDIA Enterprise Support. NVIDIA DGX A100 System DU-10044-001 _v01 | 6...
Chapter 3. Power Supply Replacement This chapter describes how to replace one of the DGX A100 system power supplies (PSUs). 3.1. Power Supply Replacement Overview This is a high-level overview of the steps needed to replace a power supply. 1. Identify failed power supply through the BMC and submit a service ticket.
Page 14
Look for power supplies with no temperature reading or an output reading close to or equal to zero. Both NVSM and the BMC identify each power supply as PSUx, where x is from 0 to 5. The following diagram shows the physical location of each PSU. NVIDIA DGX A100 System DU-10044-001 _v01 | 8...
‣ If the three remaining PSUs are working and energized, then you do not need to shut down power to the DGX A100 system. ‣ If fewer than three PSUs are working and energized, then shut down power to the DGX A100 system.
Page 16
Viewing the PSU status from the BMC dashboard-> page. ‣ Running to confirm all power supplies are healthy. nvsm show health Pack the old power supply and ship it back to NVIDIA Enterprise Support. NVIDIA DGX A100 System DU-10044-001 _v01 | 10...
Motherboard tray battery ‣ Single-port or dual port CX-6 PCI network adapter card 4.1. Accessing the Motherboard Tray 1. Loosen the two motherboard thumbscrews and then pull the handles out to eject the motherboard tray. NVIDIA DGX A100 System DU-10044-001 _v01 | 11...
Page 18
Motherboard Tray - Accessing in Place 2. Pull the motherboard tray out of the system until it locks, then loosen the two thumbscrews holding the lid in place. 3. Lift the rear section of the motherboard lid. NVIDIA DGX A100 System DU-10044-001 _v01 | 12...
Motherboard Tray - Accessing in Place 4.2. Replacing the Motherboard Tray 1. Close the lid to the motherboard tray. 2. Tighten the two thumbscrews and then push the motherboard tray into the system. NVIDIA DGX A100 System DU-10044-001 _v01 | 13...
Page 20
Motherboard Tray - Accessing in Place 3. Close the handles to secure the motherboard tray in place. 4. Tighten the motherboard tray thumbscrews to complete the motherboard insertion. NVIDIA DGX A100 System DU-10044-001 _v01 | 14...
Page 21
Motherboard Tray - Accessing in Place NVIDIA DGX A100 System DU-10044-001 _v01 | 15...
You will need to completely remove the motherboard tray from the server in order to service the following components. ‣ DIMMs (either adding or replacing) 5.1. Removing the Motherboard Tray 1. Loosen the two motherboard thumbscrews and then pull the handles out to eject the motherboard tray. NVIDIA DGX A100 System DU-10044-001 _v01 | 16...
Page 23
Place the tray on a solid, flat work surface. 3. Loosen two rear thumbscrews on the motherboard lid. NVIDIA DGX A100 System DU-10044-001 _v01 | 17...
Page 24
Motherboard Tray - Removal and Installation 4. Loosen the two front thumbscrews on the motherboard tray lid. 5. Lift the lid off of the tray and set aside. NVIDIA DGX A100 System DU-10044-001 _v01 | 18...
Motherboard Tray - Removal and Installation 6. Remove all three air baffles to allow access to the DIMMs. 5.2. Reinstalling the Motherboard Tray 1. Reinstall the three air baffles. NVIDIA DGX A100 System DU-10044-001 _v01 | 19...
Page 26
Motherboard Tray - Removal and Installation 2. Replace and secure the lid. a). Install the lid. b). Tighten the rear thumbscrews NVIDIA DGX A100 System DU-10044-001 _v01 | 20...
Page 27
Motherboard Tray - Removal and Installation c). Tighten the front thumbscrews. 3. Slide the motherboard tray into the slot, open the tray handles, and then continue pushing the motherboard tray in. NVIDIA DGX A100 System DU-10044-001 _v01 | 21...
Page 28
Motherboard Tray - Removal and Installation 4. Close the handles to secure the motherboard tray in place. 5. Tighten the motherboard tray thumbscrews to complete the motherboard insertion. NVIDIA DGX A100 System DU-10044-001 _v01 | 22...
Page 29
Motherboard Tray - Removal and Installation NVIDIA DGX A100 System DU-10044-001 _v01 | 23...
U.2 NVMe Cache Drive Upgrade Overview This is a high-level overview of the steps needed to upgrade the DGX A100 system's cache size. 1. Identify the manufacturer and model of the of currently installed NVMe drives. 2. Place an order for additional four NVME drives.
5. Install the additional four NVMe drives in slots 1, 3, 5, and 7. a). Unlock the release lever and then slide the drive into the slot until the front face is flush with the other drives. b). Close the lever and lock it in place. NVIDIA DGX A100 System DU-10044-001 _v01 | 25...
Page 32
U.2 NVMe Cache Drive Upgrade from 4 to 8 6. Power on the system. Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks. NVIDIA DGX A100 System DU-10044-001 _v01 | 26...
/raid 6. Confirm the system is healthy by running nvsm show health 7. Ship the failed unit back to NVIDIA Enterprise Support using the provided packaging. 7.2. Identifying the Failed U.2 NVMe Identifying the Failed NVMe from the Front If physical access to the system is available, you can identify a failed drive by the illuminated amber LED .
Page 34
U.2 NVMe Cache Drive Replacement Identifying the Failed NVMe from the Console To identify the failed NVMe drive from the DGX A100 console, enter the following and then look for drive alerts in the output to identify the failed drive.
NVMe from NVIDIA Enterprise Support, specifying this information. 7.3. Replacing the U.2 NVMe Drive 1. Be sure you have requested and obtained the replacement drive from NVIDIA Enterprise Support. 2. Back up any critical data to a network shared volume or some other means of backup.
Page 36
U.2 NVMe Cache Drive Replacement 6. Power on the system. Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks. NVIDIA DGX A100 System DU-10044-001 _v01 | 30...
NVIDIA Enterprise Support. Note: If your organization has purchased a media retention policy, you may be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.
13.Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided. 9.2. Identifying the Failed M.2 NVMe The DGX A100 system automatically sets the failed M.2 drive offline when it detects the failure. 1. Identify which of the M.2 drives has failed (nvme0n1 or nvme1n1). sudo nvsm show health 2.
3. Make a note of the device name for the failed drive (nvme0 or nvme1) and the device name for the good drive (nvme0 or nvme1). You will need this information when rebuilding the RAID 1 array after replacing the drive. 4. Obtain the replacement from NVIDIA Enterprise Support. 9.3. Replacing the M.2 NVMe Drive Before attempting to replace one of the M.2 NVMe drives, be sure to have performed the...
Page 41
Using a Phillips #1 screwdriver, loosen the black screw that secures the drive in place. Note: The screw is not a captive screw and can drop. Be careful when loosening the screw to avoiding dropping and losing the screw. NVIDIA DGX A100 System DU-10044-001 _v01 | 35...
Page 42
Pull the drive to disconnect from the connector on the riser board, then insert the new drive into the connector on the riser board. e). Place the drive against the card and secure by tightening the screw using a Phillips #1 screwdriver. NVIDIA DGX A100 System DU-10044-001 _v01 | 36...
Rebuild the RAID 1 array according to the instruction in the section Rebuilding the Boot Drive RAID 1 Volume. 9.4. Rebuilding the Boot Drive RAID 1 Volume After replacing a faulty M.2 OS drive, you must rebuild the RAID 1 array. NVIDIA DGX A100 System DU-10044-001 _v01 | 37...
M.2 NVMe Boot Drive Replacement 1. If you have not already done so, boot the DGX A100 system and log in. 2. Rebuild the boot drive mirror. In the following steps, replace X with the number that corresponds to the replaced drive.
13.Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided. 10.2. Determining a Failed M.2 NVMe Riser Assembly The following are the conditions for which NVIDIA Enterprise Support may instruct the M.2 riser assembly be replaced: NVIDIA DGX A100 System...
M.2 Boot Drive Riser Assembly Replacement ‣ The DGX A100 cannot be booted. ‣ The boot drives cannot be seen from the SBIOS. ‣ The system indicates that the boot drives are not available when booting from the ISO image.
Page 47
Refer to the instructions in the section Replacing the Motherboard Tray. 7. Connect all the cables to the motherboard tray. 8. Re-install the DGX OS server software. See the DGX A100 User Guide for detailed instructions. NVIDIA DGX A100 System DU-10044-001 _v01 | 41...
NVIDIA Enterprise Support. Note: If your organization has purchased a media retention policy, you may be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.
1. Use the commands to identify the failed DIMM nvsm health 2. Get a replacement DIMM from NVIDIA Enterprise Support. 3. Shut down the system. 4. Label all motherboard tray cables and unplug them. 5. Remove the motherboard tray and place on a solid flat surface.
DIMM ID of A1. Properties: system_name = ..component_id = CPU1_DIMM_A1 The output provides other information about the alert that can be provided to NVIDIA Enterprise Support. 3. Determine the DIMM manufacturer. sudo nvsm show memory 4. Request the replacement DIMM from NVIDIA Enterprise Support, specifying the manufacturer.
Page 51
DIMM Replacement 5. Remove the DIMM. a). Press down on the side latches at both ends of the DIMM socket to push them away from the DIMM. This should unseat the DIMM from the socket. NVIDIA DGX A100 System DU-10044-001 _v01 | 45...
Page 52
Position the DIMM over the socket, making sure that the notch on the DIMM lines up with the key in the slot, then press the DIMM down into the socket until the side latches click in place. c). Make sure that the latches are up and locked in place. NVIDIA DGX A100 System DU-10044-001 _v01 | 46...
Page 53
10.Power on the system and log in. 11.Confirm that the system is healthy. sudo nvsm show health sudo nvsm show /systems/localhost/memory/alerts There should be no new alerts listed. 12.Ship the bad DIMM back to NVIDIA Enterprise Support. NVIDIA DGX A100 System DU-10044-001 _v01 | 47...
12.Verify that all DIMMs as well as the system are healthy using nvsm. 12.2. Identifying the DIMM Manufacturer 1. Determine the DIMM manufacturer. sudo nvsm show memory 2. Request the additional DIMMs from NVIDIA Enterprise Support, specifying the manufacturer. 12.3. Upgrading the DIMM NVIDIA DGX A100 System...
Page 55
3. Remove the motherboard tray. Refer to the instructions in the section Removing the Motherboard Tray. 4. Using the diagram label on the lid as a guide, locate the DIMMs to be installed during the upgrade. NVIDIA DGX A100 System DU-10044-001 _v01 | 49...
Page 56
5. Remove the air baffles. Press down on the side latches at both ends of the air baffle to eject the module from the slot, then pull the air baffle out of the slot. NVIDIA DGX A100 System DU-10044-001 _v01 | 50...
Page 57
9. Connect all the cables to the motherboard tray. 10.Install all the power cords. 11.Power on the system and log in. 12.Confirm that the total memory is now 2 TB. lsmem Total online memory: NVIDIA DGX A100 System DU-10044-001 _v01 | 51...
Page 58
DIMM Upgrade 13.Confirm that the system is healthy. sudo nvsm show health NVIDIA DGX A100 System DU-10044-001 _v01 | 52...
1. Use the commands to identify the failed network card. nvsm show 2. Get a replacement card from NVIDIA Enterprise Support. 3. Shut down the system. 4. Label all motherboard tray cables and unplug them. 5. Remove the motherboard tray and open the lid.
1. Power down the system. 2. Label all network, monitor, and USB cables connected to the motherboard tray for easy identification when reconnecting. 3. Unplug all power cords, and all network, monitor, and USB cables. NVIDIA DGX A100 System DU-10044-001 _v01 | 54...
‣ Determined the location ID of the faulty network card needing replacement. Identifying the Failed Network Card. ‣ Obtained the replacement network card have saved the packaging for use when returning the faulty component. NVIDIA DGX A100 System DU-10044-001 _v01 | 55...
Page 62
Accessing the Motherboard Tray. 5. Unlock the horizontal network card. a). Loosen the black thumbscrew that secures the PCIe card locking mechanism in place. b). Open the locking mechanism by turning 90 degrees or more. NVIDIA DGX A100 System DU-10044-001 _v01 | 56...
Page 63
6. Replace the card. a). Pull the network card out of the riser card slot. b). Replace the old network card with the new one. c). Install the network card into the riser card slot. NVIDIA DGX A100 System DU-10044-001 _v01 | 57...
Page 64
Network Card Replacement 7. Lock the network card in place. a). Close the locking mechanism by turning it back into its slot. b). Tighten the black thumb screw to secure the card in place. NVIDIA DGX A100 System DU-10044-001 _v01 | 58...
Page 65
9. Connect all cables back into the network card ports. 10.Power on the system and log in. 11.Confirm that the system is healthy. sudo nvsm show health There should be no new alerts listed. NVIDIA DGX A100 System DU-10044-001 _v01 | 59...
6. Tighten the screws. 7. Power on the system and confirm the ports work. 8. Ship the failed unit back to NVIDIA Enterprise Support using the provided packaging. 14.2. Replacing the Front Console Board A front console board malfunction can be determined in a few ways.
Page 67
2. Remove the bezel. 3. Replace the front console board. a). Using a Phillips #2 screwdriver, loosen the two captive screws that secure the front console board. b). Replace the front console board. c). Tighten the screws. NVIDIA DGX A100 System DU-10044-001 _v01 | 61...
Page 68
Front Console Board Replacement 4. Confirm functionality. a). Power on the system. b). Issue the following to confirm the temperature sensor is working properly. sudo nvsm show health 5. Return the old module to NVIDIA Enterprise Services. NVIDIA DGX A100 System DU-10044-001 _v01 | 62...
Chapter 15. Motherboard Tray Battery Replacement 15.1. Motherboard Tray Battery Replacement Overview This is a high-level overview of the procedure to replace the DGX A100 system motherboard tray battery. 1. Get a replacement battery - type CR2032. 2. Shut down the system.
Page 70
Call NVIDIA Enterprise Support to confirm that the battery is the right component to replace. The CR2032 battery is not provided by NVIDIA, but can be purchased from a convenience store. CAUTION: Static Sensitive Devices: - Be sure to observe best practices for electrostatic discharge (ESD) protection.
Page 71
Motherboard Tray Battery Replacement 6. Replace the battery. a). Locate the battery, using the following image as a guide. NVIDIA DGX A100 System DU-10044-001 _v01 | 65...
Page 72
Use a small flat-head screwdriver or similar thin tool to gently lift the battery from the battery holder. c). Replace the battery with a new CR2032, installing it in the battery holder. 7. Re-insert the IO card, the M.2 riser card, and the air baffle into their respective slots. NVIDIA DGX A100 System DU-10044-001 _v01 | 66...
Page 73
Sync the date and time to the hardware real time clock. sudo hwclock -w c). Reset the BMC sudo ipmitool mc reset cold 12.Confirm that the time and date on the system are updated. sudo nvsm show health NVIDIA DGX A100 System DU-10044-001 _v01 | 67...
Chapter 16. Removing and Attaching the Bezel 1. Grab the bezel on both sides by the side handles, then pull directly away from the system to disengage from the magnetic latch. NVIDIA DGX A100 System DU-10044-001 _v01 | 68...
Page 75
Removing and Attaching the Bezel 2. To replace the bezel, align the bezel alignment pins with the chassis, then let the magnetic latch complete the attachment of the bezel. NVIDIA DGX A100 System DU-10044-001 _v01 | 69...
Chapter 17. Installing the Rack Mount 17.1. Installing the Rails Follow these instructions to install the DGX A100 server rack mount kit. The rack mount kit acts as a shelf in the rack, it does not allow the system to be moved once installed. All components are serviceable from the front or rear, so this movement is not necessary.
Page 77
Make sure that rail is level and the attachment on the rear post is at the same rack unit as the front. a). Insert the spring-loaded prongs into the holes on the rear rack post. NVIDIA DGX A100 System DU-10044-001 _v01 | 71...
Page 78
The bottom lip is at the same height on all four posts. ‣ The metal clips are properly attached. ‣ Four screws are installed - flat head on the front and pan head on the back. NVIDIA DGX A100 System DU-10044-001 _v01 | 72...
Installing the Rack Mount Kit 17.2. Installing the Cage Nuts The DGX A100 server is secured to the rack using four captive screws - one at each corner of the front of the unit. ‣ If your rack has round holes with 10-32 threads, then the screws will attached directly to the rack mounting holes.
Page 80
Rail kits attached to Type A racks require two (2) cage nuts installed; top positions only. ‣ Rail kits attached to Type B racks require four (4) cage nut installed; both top and bottom positions. NVIDIA DGX A100 System DU-10044-001 _v01 | 74...
Page 81
NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.