Nvme Boot Drive Replacement Overview; Identifying The Failed M.2 Nvme - Nvidia DGX A100 Service Manual

Hide thumbs Also See for DGX A100:
Table of Contents

Advertisement

Chapter 10. M.2 NVMe Boot Drive
10.1. M.2 NVMe Boot Drive Replacement
Overview
This is a high-level overview of the procedure to replace a boot drive.
1.
With the help of NVIDIA Enterprise Support, determine which M.2 drive needs to be replaced.
2.
Get replacement from NVIDIA Enterprise Support.
3.
Power down the system.
4.
Label all cables and unplug them from the motherboard tray.
5.
Slide motherboard out until it locks in place.
6.
Open rear compartment and pull out the M.2 riser card with both M.2 disks attached.
7.
Replace the failed M.2 device on the riser card.
8.
Install the M.2 riser card with both M.2 disks.
9.
Close the rear motherboard compartment and then slide the motherboard back into the system.
10.
Plug in all cables using the labels as a reference.
11.
Power on the system.
12.
Confirm the M.2 RAID 1 mirror is synchronizing.
13.
Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided.

10.2. Identifying the Failed M.2 NVMe

The DGX A100 system automatically sets the failed M.2 drive offline when it detects the failure.
1.
Identify which of the M.2 drives has failed (nvme0n1 or nvme1n1).
$ sudo nvsm show health
2.
You can confirm this by issuing the following.
Replacement
39

Advertisement

Table of Contents
loading

Table of Contents