Reporting Problems - IBM Power 595 Technical Overview And Introduction

Table of Contents

Advertisement

initialization and configuration of I/O hardware, followed by OS initiated software test routines.
Boot-time diagnostic routines include:
BISTs for both logic components and arrays ensure the internal integrity of components.
Because the service processor assist in performing these tests, the system is enabled to
perform fault determination and isolation whether system processors are operational or
not. Boot-time BISTs can also find faults undetectable by process-based power-on
self-test (POST) or diagnostics.
Wire tests discover and precisely identify connection faults between components such as
processors, memory, or I/O hub chips.
Initialization of components such as ECC memory, typically by writing patterns of data and
allowing the server to store valid ECC data for each location, can help isolate errors.
To minimize boot time, the system determines which of the diagnostics are required to be
started to ensure correct operation based on the way the system was powered off, or on the
boot-time selection menu.
Runtime
All POWER6 processor systems can monitor critical system components during runtime, and
they can take corrective actions when recoverable faults occur. IBM's hardware error check
architecture provides the ability to report non-critical errors in an
path to the service processor without affecting system performance.
A significant part of IBM's runtime diagnostic capabilities originate with the POWER6 service
processor. Extensive diagnostic and fault analysis routines have been developed and
improved over many generations of POWER process-based servers, and enable quick and
accurate predefined responses to both actual and potential system problems.
The service processor correlates and processes runtime error information, using logic
derived from IBM's engineering expertise, to count recoverable errors (called
and predict when corrective actions must be automatically initiated by the system. These
actions can include:
Requests for a part to be replaced
Dynamic (online) invocation of built-in redundancy for automatic replacement of a failing
part
Dynamic deallocation of failing components so that system availability is maintained
Device drivers
In certain cases, diagnostics are best performed by operating system-specific drivers, most
notably I/O devices that are owned directly by a logical partition. In these cases, the operating
system device driver often works in conjunction with I/O device microcode to isolate and
recover from problems. Potential problems are reported to an operating system device driver,
which logs the error. I/O devices can also include specific exercisers that can be invoked by
the diagnostic facilities for problem recreation if required by service procedures.

4.3.5 Reporting problems

In the unlikely event that a system hardware failure or an environmentally induced failure is
diagnosed, POWER6 processor systems report the error through a number of mechanisms.
This ensures that appropriate entities are aware that the system can be operating in an error
state. However, a crucial piece of a solid reporting strategy is ensuring that a single error
communicated through multiple error paths is correctly aggregated, so that later notifications
are not accidently duplicated.
Chapter 4. Continuous availability and manageability
out-of-band
communications
thresholding
)
155

Advertisement

Table of Contents
loading

Table of Contents