Special Uncorrectable Error Handling; Pci Extended Error Handling - IBM BladeCenter PS700 Technical Overview And Introduction

Hide thumbs Also See for BladeCenter PS700:
Table of Contents

Advertisement

4.3.5 Special uncorrectable error handling

Although rare, an uncorrectable data error can occur in memory or a cache. IBM POWER
processor-based systems attempt to limit, to the least possible disruption, the impact of an
uncorrectable error using a well-defined strategy that first considers the data source.
Sometimes, an uncorrectable error is temporary in nature and occurs in data that can be
recovered from another repository. See the following examples:
Data in the instruction L1 cache is never modified within the cache itself. Therefore, an
uncorrectable error discovered in the cache is treated as an ordinary cache miss, and
correct data is loaded from the L2 cache.
The L2 and L3 cache of the POWER7 processor-based systems can hold an unmodified
copy of data in a portion of main memory. In this case, an uncorrectable error would trigger
a reload of a cache line from main memory.
In cases where the data cannot be recovered from another source, a technique called Special
Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory or
cache from immediately causing the system to terminate. Rather, the system tags the data
and determines whether it will ever be used again. Note the following information:
If the error is irrelevant, it does not force a check stop.
If the data is used, termination can be limited to the program or kernel, or hypervisor
owning the data. Also possible is the freezing of the I/O adapters that are controlled by an
I/O hub controller if data is to be transferred to an I/O device.
When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the standard ECC is no longer valid. The
service processor is notified, and takes appropriate actions. When running AIX (since V5.2
and later) or Linux, and a process attempts to use the data, the operating system is informed
of the error and might terminate, or might only terminate a specific process associated with
the corrupt data. This depends on the operating system and firmware level and whether the
data was associated with a kernel or non-kernel process.
Only in the case where the corrupt data is used by the POWER Hypervisor must the entire
system must be rebooted, thereby preserving overall system integrity.
Depending on system configuration and source of the data, errors encountered during I/O
operations might not result in a machine check. Instead, the incorrect data is handled by the
processor host bridge (PHB) chip. When the PHB chip detects a problem it rejects the data,
preventing data being written to the I/O device. The PHB enters a freeze mode that halts
normal operations. Depending on the model and type of I/O being used, the freeze might
include the entire PHB chip, or a single bridge. This results in the loss of all I/O operations
that use the frozen hardware until a power-on reset of the PHB is performed. The impact to
partitions depends on how the I/O is configured for redundancy. In a server configured for
fail-over availability, redundant adapters spanning multiple PHB chips can enable the system
to recover transparently, without partition loss.

4.3.6 PCI extended error handling

IBM estimates that PCI adapters can account for a significant portion of the hardware-based
errors on a large server. Although servers that rely on boot-time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a
more significant problem.
PCI adapters are generally complex designs involving extensive on-board instruction
processing, often on embedded microcontrollers. They tend to use industry standard-grade
108
IBM BladeCenter PS700, PS701, and PS702 Technical Overview and Introduction

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

Bladecenter ps701Bladecenter ps702

Table of Contents