Resolving A Gpu, Pcie Adapter, Or Device Problem - IBM Power System 8335-GCA Manual

Problem analysis, system parts, and locations
Hide thumbs Also See for Power System 8335-GCA:
Table of Contents

Advertisement

If
Then
Yes:
This ends the procedure.
No:
Go to "Collecting diagnostic data" on page 109. Then, go to "Contacting IBM service and
support" on page 110. This ends the procedure.

Resolving a GPU, PCIe adapter, or device problem

Learn how to access log files, information to identify types of events, and a list of potential problems and
service actions.
1. Are all of the adapters in the system missing or failed?
If
Then
Yes:
Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to "8335-GCA and 8335-GTA locations"
v If your system is an 8335-GTB, go to "8335-GTB locations" on page 121 to identify the
v If your system is an 8348-21C, go to "8348-21C locations" on page 133 to identify the
No:
Continue with the next step.
2. To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
a. Log in as the root user.
b. At the command prompt, type dmesg and press Enter.
3. Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or failed.
When you find a keyword that accompanies one or more of the resource names in the following table,
a service action is required. Use the following table to determine the service procedure to perform for
your type of problem.
Table 1. Resource names, examples, and service procedures for different types of operating system logs.
Resource name
aacraid
eth1, eth2, eth3
NVRM
nvidia-nvlink
nvme
on page 111 to identify the physical location and the removal and replacement procedure.
physical location and the removal and replacement procedure.
physical location and the removal and replacement procedure.
Example of a log requiring
a service action
PCI error detected 2
Failed to re-initialize
device
aborting RmInitAdapter
failed!
IBMNPU: NPU FENCE
detected, machine power
cycle required
Failed status: ffffffff,
reset controller
Type of problem
RAID
Note: This adapter is
available only for 8348-21C
systems.
Network
Graphics
Graphics
NVMe Flash adapter
Note: This adapter is
available only for
8335-GCA systems.
Beginning troubleshooting and problem analysis
Service procedure
Go to "Resolving a RAID
adapter problem" on page
14.
Go to "Resolving a network
adapter problem" on page
15.
Go to "Resolving a
graphics processing unit
problem" on page 16.
Go to "Resolving a
graphics processing unit
problem" on page 16.
Go to "Resolving an NVMe
Flash adapter problem" on
page 19.
13

Advertisement

Table of Contents
loading

Table of Contents