IBM Power 595 Technical Overview And Introduction page 149

Table of Contents

Advertisement

POWER6 processor instruction retry
To achieve the highest levels of server availability and integrity, FFDC and recovery
safeguards must protect the validity of user data anywhere in the server, including all the
internal storage areas and the buses used to transport data. Equally important is to
authenticate the correct operation of internal latches (registers), arrays, and logic within a
processor core that comprise the system execution elements (branch unit, fixed instruction,
floating point instruction unit and so forth) and to take appropriate action when a fault (
is discovered.
The POWER6 microprocessor has incrementally improved the ability of a server to identify
potential failure conditions by including enhanced error check logic, and has dramatically
improved the capability to recover from core fault conditions. Each core in a POWER6
microprocessor includes an internal processing element known as the Recovery Unit (
Using the Recovery Unit and associated logic circuits, the POWER6 microprocessor takes a
snap shot
checkpoint
, or
processed by one of the core's nine-instruction execution units.
If a fault condition is detected during any cycle, the POWER6 microprocessor uses the saved
state information from r unit to effectively
instruction processing, allowing the instruction to be retried from a
state. This procedure is called
Hypervisor and service processor, architectural state information from one recovery unit can
be loaded into a different processor core, allowing an entire instruction stream to be restarted
on a substitute core. This is called
POWER6 processor-based systems include a suite of mainframe-inspired processor
instruction retry features that can significantly reduce situations that could result in checkstop:
Processor instruction retry: Automatically retry a failed instruction and continue with the
task. By combining enhanced error identification information with an integrated Recovery
Unit, a POWER6 microprocessor can use processor instruction retry to transparently
operate through (recover from) a wider variety of fault conditions (for example
non-predicted
fault conditions undiscovered through predictive failure techniques) than
could be handled in earlier POWER processor cores. For transient faults, this mechanism
allows the processor core to recover completely from what would otherwise have caused
an application, partition, or system outage.
Alternate processor recovery: For solid (hard) core faults, retrying the operation on the
same processor core is not effective. For many such cases, the alternate processor
recovery feature deallocates and deconfigures a failing core, moving the instruction
stream to, and restarting it on, a spare core. The POWER Hypervisor and POWER6
processor-based hardware can accomplish these operations without application
interruption, allowing processing to continue unimpeded, as follows:
a. Identifying a spare processor core.
Using an algorithm similar to that employed by dynamic processor deallocation, the
POWER Hypervisor manages tss of acquiring a spare processor core.
b. Using partition availability priority.
Starting with POWER6 technology, partitions receive an integer rating with the lowest
priority partition rated at 0 and the highest priority partition valued at 255. The default
value is set at 127 for standard partitions and 192 for Virtual I/O Server (VIOS)
partitions. Partition availability priorities are set for both dedicated and shared
partitions. To initiate alternate processor recovery when a spare processor is not
available, the POWER Hypervisor uses the partition availability priority to identify low
priority partitions and keep high priority partitions running at full capacity.
, of the architected core internal state before each instruction is
roll back
processor instruction retry
alternate processor recovery
Chapter 4. Continuous availability and manageability
the internal state of the core to the start of
known good
. In addition, using the POWER
.
error
)
r unit
).
architectural
137

Advertisement

Table of Contents
loading

Table of Contents