Detecting Errors; Error Checkers, Fault Isolation Registers, And First-Failure Data Capture - IBM Power System E850C Technical Overview And Introduction

Hide thumbs Also See for Power System E850C:
Table of Contents

Advertisement

customer, an IBM service support representative (SSR), or an authorized warranty service
provider.
The serviceability features that are delivered in this system provide a highly efficient service
environment by incorporating the following attributes:
A design for customer setup (CSU), customer installable features (CIFs), and
customer-replaceable units (CRUs)
ED/FI incorporating FFDC
Converged service approach across multiple IBM server platforms
Concurrent Firmware Maintenance (CFM)
This section provides an overview of how these attributes contribute to efficient service in the
progressive steps of error detection, analysis, reporting, notification, and repair found in all
POWER processor-based systems.

4.5.1 Detecting errors

The first and most crucial component of a solid serviceability strategy is the ability to
accurately and effectively detect errors when they occur.
Although not all errors are a threat to system availability, those that go undetected can cause
problems because the system has no opportunity to evaluate and act if necessary. POWER
processor-based systems employ IBM z™ Systems server-inspired error detection
mechanisms, extending from processor cores and memory to power supplies and storage
devices.

4.5.2 Error checkers, fault isolation registers, and First-Failure Data Capture

IBM POWER processor-based systems contain specialized hardware detection circuitry that
is used to detect erroneous hardware operations. Error checking hardware ranges from parity
error detection that is coupled with Processor Instruction Retry and bus try again, to ECC
correction on caches and system buses.
Within the processor and memory subsystem error-checkers, error-check signals are
captured and stored in hardware FIRs. The associated logic circuitry is used to limit the
domain of an error to the first checker that encounters the error. In this way, runtime error
diagnostic tests can be deterministic so that for every check station, the unique error domain
for that checker is defined and mapped to field-replaceable units (FRUs) that can be repaired
when necessary.
Integral to the Power Systems design is the concept of FFDC. FFDC is a technique that
involves sufficient error checking stations and coordination of fault reporting so that faults are
detected and the root cause of the fault is isolated. FFDC also expects that necessary fault
information can be collected at the time of failure without needing to re-create the problem or
run an extended tracing or diagnostics program.
For the vast majority of faults, a good FFDC design means that the root cause is isolated at
the time of the failure without intervention by a service representative. For all faults, good
FFDC design still makes failure information available to the service representative. This
information can be used to confirm the automatic diagnosis. More detailed information can be
collected by a service representative for rare cases where the automatic diagnosis is not
adequate for fault isolation.
116
IBM Power System E850C: Technical Overview and Introduction

Advertisement

Table of Contents
loading

Table of Contents