Predictive Failure Analysis - IBM p5 590 System Handbook

Table of Contents

Advertisement

Integrated hardware error detection and fault isolation has been a key
component of IBMs UNIX server design strategy since 1997. FFDC check
stations are carefully positioned within the server logic and data paths to ensure
that potential errors can be quickly identified and accurately tracked to an
individual field replaceable unit (FRU). These checkers are collected in a series
of Fault Isolation Registers (FIR), where they can easily be accessed by the
service processor. All communication between the service processor and the FIR
is accomplished
is transparent to an operating system. This entire structure is below the
architecture and is not seen, nor accessed, by system level activities.
In this environment, strategically placed error checkers are continuously
operating to precisely identify error signatures within defined hardware fault
domains. IBM servers are designed so that in the unlikely event that a fatal
hardware error occurs, FFDC, coupled with extensive error analysis and
reporting firmware in the service processor, should allow IBM to isolate a
hardware failure to a single FRU. In this event, the FRU part number will be
included in the extensive error log information captured by the service processor.
In select cases, a set of FRUs will be identified when the fault is on an interface
between two or more FRUs. For example, three FRUs may be called out when
the system cannot differentiate between a failed driver on one component, the
corresponding receiver on a second, or the interconnect fabric. In either case, it
is IBMs maintenance practice for the p5-590 and p5-595 systems to replace all of
the identified components as a group. Meeting rigorous goals for fault isolation
requires a reliability, availability, and serviceability methodology that carefully
instruments the entire system logic design with meticulously placed error
checkers.

6.3.2 Predictive failure analysis

Statistically, there are two main situations where a component has a catastrophic
failure: Shortly after being manufactured, and when it has reached its useful life
period. Between these two regions, the failure rate for a given component is
generally low, and normally gradual. A complete failure usually happens after
some degradation has happened, be it in the form of temporary errors, degraded
performance, or degraded function.
The p5-590 and p5-595 have the ability to monitor critical components such as
processors, memory, cache, I/O subsystem, PCI-X slots, adapters, and internal
disks, and detect possible indications of failures. By continuously monitoring
these components, upon reaching a threshold, the system can isolate and
deallocate the failing component without system outage, thereby avoiding a
partition or complete system failure.
IBM Eserver p5 590 and 595 System Handbook
144
out of band
. That is, operation of the error detection mechanism

Advertisement

Table of Contents
loading

This manual is also suitable for:

P5 595

Table of Contents