Startup Considerations; Shutdown Considerations; Verifying Functionality - Quick Health Check - Nvidia DGX A100 User Manual

Hide thumbs Also See for DGX A100:
Table of Contents

Advertisement

4.4.1. 

Startup Considerations

To keep your DGX A100 running smoothly, allow up to a minute of idle time after reaching the
login prompt. This ensures that all components can complete their initialization.
4.4.2. 

Shutdown Considerations

When shutting down DGX A100, always initiate the shutdown from the operating system,
momentary press of the power button, or by using Graceful Shutdown from the BMC, and wait
until the system enters a powered-off state before performing any maintenance.
WARNING: Risk of Danger - Removing power cables or using Power Distribution Units (PDUs)
to shut off the system while the Operating System is running may cause damage to sensitive
components in the DGX A100 server.
4.5. 
Verifying Functionality - Quick Health
Check
NVIDIA provides customers a diagnostics and management tool called NVIDIA System
Management, or NVSM. The nvsm command can be used to determine the system's health,
identify component issues and alerts, or run a stress test to make sure all components are
in working order while under load. The use of Docker is key to getting the most performance
out of the system since NVIDIA has optimized containers for all the major frameworks and
workloads used on DGX systems.
The following are the steps for performing a health check on the DGX A100 System, and
verifying the Docker and NVIDIA driver installation.
1. Establish an SSH connection to the DGX A100 System.
2. Run a basic system check.
$ sudo nvsm show health
3. Verify that the output summary shows that all checks are Healthy and that the overall
system status is Healthy.
4. Verify that Docker is installed by viewing the installed Docker version.
$ sudo docker --version
This should return the version as "Docker version 19.03.5-ce", where the actual version
may differ depending on the specific release of the DGX OS Server software.
5. Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.
$ sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi
Docker pulls the nvidia/cuda container image layer by layer, then runs nvidia-smi.
When completed, the output should show the NVIDIA Driver version and a description of
each installed GPU.
NVIDIA DGX A100
Quick Start and Basic Operation
DU-09821-001 _v01   |   25

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents