Uncorrectable Error Introduction; Processor Core/Cache Correctable Error Handling; Processor Instruction Retry And Other Try Again Techniques - IBM Power Systems E870 Technical Overview And Introduction

Hide thumbs Also See for Power Systems E870:

page of 202

/ 202
Contents
Table of Contents
Bookmarks

Table of Contents

5137ch04.fm

4.3.2 Uncorrectable error introduction

An uncorrectable error can be defined as a fault that can cause incorrect instruction execution

within logic functions, or an uncorrectable error in data that is stored in caches, registers, or

other data structures. In less sophisticated designs, a detected uncorrectable error nearly

always results in the termination of an entire system. More advanced system designs in some

cases might be able to terminate just the application by using the hardware that failed. Such

designs might require that uncorrectable errors are detected by the hardware and reported to

software layers, and the software layers must then be responsible for determining how to

minimize the impact of faults.

The advanced RAS features that are built in to POWER8 processor-based systems handle

certain "uncorrectable" errors in ways that minimize the impact of the faults, even keeping an

entire system up and running after experiencing such a failure.

Depending on the fault, such recovery may use the virtualization capabilities of PowerVM in

such a way that the operating system or any applications that are running in the system are

not impacted or must participate in the recovery.

4.3.3 Processor Core/Cache correctable error handling

Layer 2 (L2) and Layer 3 (L3) caches and directories can correct single bit errors and detect

double bit errors (SEC/DED ECC). Soft errors that are detected in the level 1 caches are also

correctable by a try again operation that is handled by the hardware. Internal and external

processor "fabric" busses have SEC/DED ECC protection as well.

SEC/DED capabilities are also included in other data arrays that are not directly visible to

customers.

Beyond soft error correction, the intent of the POWER8 design is to manage a solid

correctable error in an L2 or L3 cache by using techniques to delete a cache line with a

persistent issue, or to repair a column of an L3 cache dynamically by using spare capability.

Information about column and row repair operations is stored persistently for processors, so

that more permanent repairs can be made during processor reinitialization (during system

reboot, or individual Core Power on Reset using the Power On Reset Engine.)

4.3.4 Processor Instruction Retry and other try again techniques

Within the processor core, soft error events might occur that interfere with the various

computation units. When such an event can be detected before a failing instruction is

completed, the processor hardware might be able to try the operation again by using the

advanced RAS feature that is known as

Processor Instruction Retry allows the system to recover from soft faults that otherwise result

in outages of applications or the entire server.

Try again techniques are used in other parts of the system as well. Faults that are detected on

the memory bus that connects processor memory controllers to DIMMs can be tried again. In

POWER8 systems, the memory controller is designed with a replay buffer that allows memory

transactions to be tried again after certain faults internal to the memory controller faults are

detected. This complements the try again abilities of the memory buffer module.

150

IBM Power Systems E870 and E880 Technical Overview and Introduction

Draft Document for Review October 14, 2014 10:19 am

Processor Instruction Retry

Table of Contents

Show Quick Links

Hide quick links:

Table of Contents

Need help?

Do you have a question about the Power Systems E870 and is the answer not in the manual?

This manual is also suitable for:

Power systems e880

Uncorrectable Error Introduction; Processor Core/Cache Correctable Error Handling; Processor Instruction Retry And Other Try Again Techniques - IBM Power Systems E870 Technical Overview And Introduction

4.3.2 Uncorrectable error introduction

4.3.3 Processor Core/Cache correctable error handling

4.3.4 Processor Instruction Retry and other try again techniques

Hide quick links:

Need help?

Subscribe to Our Youtube Channel

Related Manuals for IBM Power Systems E870

Related Content for IBM Power Systems E870

This manual is also suitable for:

Table of Contents