Memory Error Recovery Mechanisms - IBM p5 590 System Handbook

Table of Contents

Advertisement

Bit-steering to redundant memory in the event of a failed memory module to
keep the server operational
Bit-scattering, thus allowing for error correction and continued operation in the
presence of a complete chip failure (
Single-bit error correction and double-bit error detection using ECC without
reaching error thresholds for main, L2, L3 cache, and fabric bus
L1 cache is protected by parity and re-fetches data from L2 cache when
errors are detected
L3 cache line deletes extended from 2 to 10 for additional self-healing
Memory scrubbing to help prevent soft-error memory faults
Figure 6-3 graphically represents the redundancy and error recovery
mechanisms on the main memory.
Figure 6-3 Memory error recovery mechanisms
Uncorrectable error handling
While it is a rare occurrence, an uncorrectable data error can occur in memory or
a cache, despite all precautions built into the server. In servers prior to IBMs
POWER4 processor-based offerings, this type of error would eventually result in
a system crash. The IBM Sserver p5 systems extend the POWER4 technology
design and include techniques for handling these errors.
On these servers, when an uncorrectable error (UE) is identified at one of the
many checkers strategically deployed throughout the system's central electronic
complex, the detecting hardware modifies the ECC word associated with the
data, creating a special ECC code. This code indicates that an uncorrectable
error has been identified at the data source and that the data in the standard
ECC word is no longer valid. The check hardware also signals the service
processors and identifies the source of the error. The active service processor
then takes appropriate action to handle the error. This technique is named
special uncorrectable error (SUE) handling.
Chipkill
recovery)
Chapter 6. Reliability, availability, and serviceability
147

Advertisement

Table of Contents
loading

This manual is also suitable for:

P5 595

Table of Contents