Serviceability - IBM Power Systems S822LC Technical Overview And Introduction

Hide thumbs Also See for Power Systems S822LC:
Table of Contents

Advertisement

Special Uncorrectable Error handling
Special Uncorrectable Error (SUE) handling prevents an uncorrectable error in memory or
cache from immediately causing an MC with uncorrectable error (UE). The system marks the
data such that if the data ever is read again, it generates an MC with UE. Termination may be
limited to the program / partition or hypervisor owning the data. If the data is referenced by an
I/O adapter, it freeze if data is transferred to an I/O device.
Processor Instruction Retry and other try again techniques
Within the processor core, soft error events might occur that interfere with the various
computation units. When such an event can be detected before a failing instruction is
completed, the processor hardware might try the operation again by using the advanced RAS
feature that is known as
Processor Instruction Retry allows the system to recover from soft faults that otherwise result
in outages of applications or the entire server. Try-again techniques are used in other parts of
the system as well. Faults that are detected on the memory bus that connects processor
memory controllers to DIMMs can be tried again. In POWER8 processor-based systems, the
memory controller is designed with a replay buffer that allows memory transactions to be tried
again after certain faults internal to the memory controller faults are detected. This function
complements the try-again abilities of the memory buffer module.
Other processor chip functions
Within a processor chip, there are other functions besides just processor cores.
POWER8 processors have built-in accelerators that can be used as application resources to
handle such functions as random number generation. POWER8 also introduces a controller
for attaching cache-coherent adapters that are external to the processor module. The
POWER8 design contains a function to "freeze" the function that is associated with some of
these elements, without taking a system-wide checkstop. Depending on the code that uses
these features, a "freeze" event might be handled without an application or partition outage.
As indicated elsewhere, single-bit errors, even solid faults, within internal or external
processor
processor-to-processor module fabric buses also use a spare data lane so that a single
failure can be repaired without calling for the replacement of hardware.

2.3.4 Serviceability

The server is designed for system installation and setup, feature installation and removal,
proactive maintenance, and corrective repair that is performed by the client:
Customer Install and Setup (CSU)
Customer Feature Install (CFI)
Customer Repairable Units (CRU)
Warranty service upgrades are offered for an onsite repair (OSR) by an IBM System Services
Representative (SSR), or an authorized warranty service provider.
The system is designed with a 5 year MTBF. If something needs to be serviced or relocated,
Table 2-2 on page 37 lists whether an item is able to be concurrently repaired, and if it
requires an IBM SSR to repair.
36
IBM Power Systems S822LC for High Performance Computing
Processor Instruction Retry
fabric buses
, are corrected by the ECC that is used. POWER8
.

Advertisement

Table of Contents
loading

Table of Contents