Diagnostics; Diagnostic Tools - IBM Flex System p270 Compute Node Installation And Service Manual

Table of Contents

Advertisement

Diagnostics

Use the available diagnostic tools to help solve any problems that might occur in the compute node.
The first and most crucial component of a solid serviceability strategy is the ability to accurately and
effectively detect errors when they occur. While not all errors are a threat to system availability, those that
go undetected are dangerous because the system does not have the opportunity to evaluate and act if
necessary. POWER7 processor-based systems are specifically designed with error-detection mechanisms
that extend from processor cores and memory to power supplies and hard drives.
POWER7 processor-based systems contain specialized hardware detection circuitry for detecting
erroneous hardware operations. Error checking hardware ranges from parity error detection coupled with
processor instruction retry and bus retry, to ECC correction on caches and system buses.
IBM hardware error checkers have these distinct attributes:
v Continuous monitoring of system operations to detect potential calculation errors
v Attempted isolation of physical faults based on runtime detection of each unique failure
v Initiation of a wide variety of recovery mechanisms designed to correct a problem
POWER7 processor-based systems include extensive hardware and firmware recovery logic.
Machine check handling
Machine checks are handled by firmware. When a machine check occurs, the firmware analyzes the error
to identify the failing device and creates an error log entry.
If the system degrades to the point that the service processor cannot reach standby state, the ability to
analyze the error does not exist. If the error occurs during hypervisor activities, the hypervisor initiates a
system reboot.
In partitioned mode, an error that occurs during partition activity is reported to the operating system in
the partition.

Diagnostic tools

Tools are available to help you diagnose and solve hardware-related problems.
v Power-on self-test (POST) progress codes (checkpoints), error codes, and isolation procedures
The POST checks out the hardware at system initialization. IPL diagnostic functions test some system
components and interconnections. The POST generates eight-digit checkpoints to mark the progress of
powering up the compute node.
Use the management module to view progress codes.
The documentation of a progress code includes recovery actions for system hangs. See "POST progress
codes (checkpoints)" on page 224 for more information.
If the service processor detects a problem during POST, an error code is logged in the management
module event log. Error codes are also logged in the Linux syslog or AIX diagnostic log, if possible.
See "System reference codes (SRCs)" on page 107.
The service processor can generate codes that point to specific isolation procedures. See "Service
processor problems" on page 457.
v Light path diagnostics
Use the light path diagnostic LEDs to identify failing hardware. If the enclosure fault LED on the front
or rear of the IBM Flex System Enterprise Chassis is lit, one or more fault LEDs on the compute node
will also be lit. Use the light path diagnostic LEDs on the compute node to help identify the failing
item.
Chapter 8. Troubleshooting
101

Advertisement

Table of Contents
loading

Table of Contents