Special Uncorrectable Error Handling - IBM Power 720 Overview

Hide thumbs Also See for Power 720:
Table of Contents

Advertisement

L2 and L3 array protection
The L2 and L3 caches in the POWER7+ processor are protected with double-bit detect
single-bit correct error detection code (ECC). Single-bit errors are corrected before forwarding
to the processor and are subsequently written back to the L2 and L3 cache.
In addition, the caches maintain a cache-line-delete capability. A threshold of correctable
errors detected on a cache line can result in the data in the cache line being purged and the
cache line removed from further operation without requiring a reboot. An ECC uncorrectable
error detected in the cache can also trigger a purge and delete of the cache line. This results
in no loss of operation because an unmodified copy of the data can be held on system
memory to reload the cache line from main memory. Modified data is handled through Special
Uncorrectable Error handling.
L2 and L3 deleted cache lines are marked for persistent deconfiguration on subsequent
system reboots until they can be replaced.

4.2.5 Special Uncorrectable Error handling

While it is rare, an uncorrectable data error can occur in memory or a cache. IBM POWER
processor-based systems attempt to limit the impact of an uncorrectable error to the least
possible disruption, using a well-defined strategy that first considers the data source.
Sometimes, an uncorrectable error is temporary in nature and occurs in data that can be
recovered from another repository, as in the following example:
Data in the instruction L1 cache is never modified within the cache itself. Therefore, an
uncorrectable error discovered in the cache is treated like an ordinary cache miss, and
correct data is loaded from the L2 cache.
The L2 and L3 cache of the POWER7+ processor-based systems can hold an unmodified
copy of data in a portion of main memory. In this case, an uncorrectable error simply
triggers a reload of a cache line from main memory.
In cases where the data cannot be recovered from another source, a technique called Special
Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory or
cache from immediately causing the system to terminate. Rather, the system tags the data
and determines whether it will ever be used again:
If the error is irrelevant, SUE will not force a checkstop.
If data is used, termination can be limited to the program, kernel or hypervisor owning the
data, or freeze of the I/O adapters controlled by an I/O hub controller if data is going to be
transferred to an I/O device.
When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the "standard" ECC is no longer valid. The
service processor is then notified and takes appropriate actions. When running AIX 5.2 or
later or Linux and a process attempts to use the data, the operating system is informed of the
error and might terminate, or only terminate a specific process associated with the corrupt
data, depending on the operating system and firmware level and whether the data was
associated with a kernel or non-kernel process.
It is only in the case where the corrupt data is used by the POWER Hypervisor that the entire
system must be rebooted, thereby preserving overall system integrity.
158
IBM Power 720 and 740 Technical Overview and Introduction

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power 740

Table of Contents