Other Processor Chip Functions; Other Fault Error Handling - IBM Power Systems E870 Technical Overview And Introduction

Hide thumbs Also See for Power Systems E870:
Table of Contents

Advertisement

5137ch04.fm
Nothing within the system terminates just because such an event is encountered. Rather, the
hardware monitors the usage of pages with marks. If such data is never used, hardware
replacement is requested, but nothing terminates as a result of the operation. Software layers
are not required to handle such faults.
Only when data is loaded to be processed by a processor core, or sent out to an I/O adapter,
is any further action needed. In such cases, if data is used as owned by a partition, then the
partition operating system might be responsible for terminating itself or just the program using
the marked page. If data is owned by the hypervisor, then the hypervisor might choose to
terminate, resulting in a system-wide outage.
However, the exposure to such events is minimized because cache-lines can be deleted,
which eliminates repetition of an uncorrectable fault that is in a particular cache-line.

4.3.8 Other processor chip functions

Within a processor chip, there are other functions besides just processor cores.
POWER8 processors have built-in accelerators that can be used as application resources to
handle such functions as random number generation. POWER8 also introduces a controller
for attaching cache-coherent adapters that are external to the processor module. The
POWER8 design contains a function to "freeze" the function that is associated with some of
these elements, without taking a system-wide checkstop. Depending on the code using these
features, a "freeze" event might be handled without an application or partition outage.
As indicated elsewhere, single bit errors, even solid faults, within internal or external
processor "fabric busses", are corrected by the error correction code that is used. POWER8
processor-to-processor module fabric busses also use a spare data-lane so that a single
failure can be repaired without calling for the replacement of hardware.

4.3.9 Other fault error handling

Not all processor module faults can be corrected by these techniques. Therefore, a provision
is still made for some faults that cause a system-wide outage. In such a "platform" checkstop
event, the ED/FI capabilities that are built in to the hardware and dedicated service processor
work to isolate the root cause of the checkstop and unconfigure the faulty element were
possible so that the system can reboot with the failed component unconfigured from the
system.
The auto-restart (reboot) option, when enabled, can reboot the system automatically following
an unrecoverable firmware error, firmware hang, hardware failure, or environmentally induced
(AC power) failure.
The auto-restart (reboot) option must be enabled from the Advanced System Management
Interface (ASMI) or from the Control (Operator) Panel.
152
IBM Power Systems E870 and E880 Technical Overview and Introduction
Draft Document for Review October 14, 2014 10:19 am

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power systems e880

Table of Contents