Serviceability - IBM Power Systems S822LC Technical Overview And Introduction

Hide thumbs Also See for Power Systems S822LC:

Manual (244 pages)

Servicing (146 pages)

Installing the system and ordered parts (140 pages)

Table Of Contents

Table of Contents

Special Uncorrectable Error handling

Special Uncorrectable Error (SUE) handling prevents an uncorrectable error in memory or

cache from immediately causing an MC with uncorrectable error (UE). The system marks the

data such that if the data ever is read again, it generates an MC with UE. Termination may be

limited to the program / partition or hypervisor owning the data. If the data is referenced by an

I/O adapter, it freeze if data is transferred to an I/O device.

Processor Instruction Retry and other try again techniques

Within the processor core, soft error events might occur that interfere with the various

computation units. When such an event can be detected before a failing instruction is

completed, the processor hardware might try the operation again by using the advanced RAS

feature that is known as

Processor Instruction Retry allows the system to recover from soft faults that otherwise result

in outages of applications or the entire server. Try-again techniques are used in other parts of

the system as well. Faults that are detected on the memory bus that connects processor

memory controllers to DIMMs can be tried again. In POWER8 processor-based systems, the

memory controller is designed with a replay buffer that allows memory transactions to be tried

again after certain faults internal to the memory controller faults are detected. This function

complements the try-again abilities of the memory buffer module.

Other processor chip functions

Within a processor chip, there are other functions besides just processor cores.

POWER8 processors have built-in accelerators that can be used as application resources to

handle such functions as random number generation. POWER8 also introduces a controller

for attaching cache-coherent adapters that are external to the processor module. The

POWER8 design contains a function to "freeze" the function that is associated with some of

these elements, without taking a system-wide checkstop. Depending on the code that uses

these features, a "freeze" event might be handled without an application or partition outage.

As indicated elsewhere, single-bit errors, even solid faults, within internal or external

processor

processor-to-processor module fabric buses also use a spare data lane so that a single

failure can be repaired without calling for the replacement of hardware.

2.3.4 Serviceability

The server is designed for system installation and setup, feature installation and removal,

proactive maintenance, and corrective repair that is performed by the client:

Customer Install and Setup (CSU)

Customer Feature Install (CFI)

Customer Repairable Units (CRU)

Warranty service upgrades are offered for an onsite repair (OSR) by an IBM System Services

Representative (SSR), or an authorized warranty service provider.

The system is designed with a 5 year MTBF. If something needs to be serviced or relocated,

Table 2-2 on page 37 lists whether an item is able to be concurrently repaired, and if it

requires an IBM SSR to repair.

IBM Power Systems S822LC for High Performance Computing

Processor Instruction Retry

fabric buses

, are corrected by the ECC that is used. POWER8

Table of Contents

Serviceability - IBM Power Systems S822LC Technical Overview And Introduction

2.3.4 Serviceability

Related Manuals for IBM Power Systems S822LC

Related Content for IBM Power Systems S822LC

Table of Contents