Diagnosing - IBM Power 720 Overview

Hide thumbs Also See for Power 720:
Table of Contents

Advertisement

4.3.2 Diagnosing

General diagnostic objectives are to detect and identify problems such that they can be
resolved quickly. Elements of IBM diagnostics strategy include the following items:
Provide a common error code format equivalent to a system reference code, system
reference number, checkpoint, or firmware error code.
Provide fault detection and problem isolation procedures. Support remote connection
ability to be used by the IBM Remote Support Center or IBM Designated Service.
Provide interactive intelligence within the diagnostics with detailed online failure
information while connected to an IBM back-end system.
Using the extensive network of advanced and complementary error detection logic that is built
directly into hardware, firmware, and operating systems, the IBM Power Systems servers can
perform considerable self-diagnosis.
Because of the FFDC technology designed into IBM servers, re-creating diagnostics for
failures or requiring user intervention is not necessary. Solid and intermittent errors are
designed to be correctly detected and isolated at the time the failure occurs. Runtime and
boot time diagnostics fall into this category.
Boot time
When an IBM Power Systems server powers up, the service processor initializes the system
hardware. Boot-time diagnostic testing uses a multitier approach for system validation,
starting with managed low-level diagnostics that are supplemented with system firmware
initialization and configuration of I/O hardware, followed by OS-initiated software test routines.
Boot-time diagnostic routines include the following items:
Built-in self-tests (BISTs) for both logic components and arrays ensure the internal
integrity of components. Because the service processor assists in performing these tests,
the system is enabled to perform fault determination and isolation, whether or not the
system processors are operational. Boot-time BISTs can also find faults undetectable by
processor-based power-on self-test (POST) or diagnostics.
Wire-tests discover and precisely identify connection faults between components such as
processors, memory, or I/O hub chips.
Initialization of components such as ECC memory, typically by writing patterns of data and
allowing the server to store valid ECC data for each location, can help isolate errors.
To minimize boot time, the system determines which of the diagnostics are required to be
started to ensure correct operation, based on the way that the system was powered off, or on
the boot-time selection menu.
Run time
All Power Systems servers can monitor critical system components during run time, and they
can take corrective actions when recoverable faults occur. IBM hardware error-check
architecture provides the ability to report non-critical errors in an
path to the service processor without affecting system performance.
A significant part of IBM runtime diagnostic capabilities originate with the service processor.
Extensive diagnostic and fault analysis routines have been developed and improved over
many generations of POWER processor-based servers, and enable quick and accurate
predefined responses to both actual and potential system problems.
The service processor correlates and processes runtime error information using logic derived
from IBM engineering expertise to count recoverable errors (called thresholding) and predict
166
IBM Power 720 and 740 Technical Overview and Introduction
out-of-band
communications

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power 740

Table of Contents