Reliability, Fault Tolerance, And Data Integrity; Memory Error Correction Extensions; Redundancy For Array Self-Healing - IBM 9123710 - eServer OpenPower 710 Introduction Manual

Technical guide
Hide thumbs Also See for 9123710 - eServer OpenPower 710:
Table of Contents

Advertisement

3.1 Reliability, fault tolerance, and data integrity

The reliability of the OpenPower 710 server starts with components, devices, and subsystems
that are designed to be fault-tolerant. During the design and development process,
subsystems go through rigorous verification and integration testing processes. During system
manufacturing, systems go through a thorough testing process designed to help ensure the
highest level of product quality.
The OpenPower 710 server L3 cache and system memory offers ECC (error checking and
correcting) fault-tolerant features. ECC is designed to correct environmentally induced,
single-bit, intermittent memory failures and single-bit hard failures. With ECC, the
likelihood of memory failures will be substantially reduced.
ECC also provides double-bit memory error detection that helps protect data integrity in
the event of a double-bit memory failure.
System memory also provides 4-bit packet error detection that helps to protect data
integrity in the event of a DRAM chip failure.
The system bus, I/O bus, and PCI buses are designed with parity error detection.
Linux supports disk mirroring (RAID 1). This is supported in software using the md driver.
Some of the hardware RAID adapters supported under Linux also support mirroring.
The Journaled File System maintains file system consistency and reduces the likelihood
of data loss when the system is abnormally halted due to a power failure.

3.1.1 Memory error correction extensions

The OpenPower 710 server uses Error Checking and Correcting (ECC) circuitry for memory
reliability, fault tolerance, and integrity.
Memory has single-error-correct and double-error-detect ECC circuitry designed to
correct single-bit memory failures. The
data integrity by detecting and reporting multiple errors beyond what the ECC circuitry can
correct.
The memory chips are organized such that the failure of any specific memory module only
affects a single-bit within an ECC word (
and continued operation in the presence of a complete chip failure (Chipkill™ recovery).
The memory also utilizes memory scrubbing and thresholding to determine when spare
memory modules, within each bank of memory, if available, should be used to replace
ones that have exceeded their threshold value (
is the process of reading the contents of the memory during idle time and checking and
correcting any single-bit errors that have accumulated by passing the data through the
ECC logic. This function is a hardware function on the memory controller chip and does
not influence normal system memory performance.

3.1.2 Redundancy for array self-healing

Although the most likely failure event in a processor is a soft single-bit error in one of its
caches, there are other events that can occur, and they need to be distinguished from one
another.
For the L1, L2, and L3 caches and their directories, hardware and firmware keep track of
whether permanent errors are being corrected beyond a threshold. If this threshold is
exceeded, a deferred repair error log is created. Additional run-time availability actions,
such as CPU vary off
1
This RAS function is only available for a Linux operating system running the 2.6 kernel.
50
IBM eServer OpenPower 710 Technical Overview and Introduction
double-bit
bit-scattering)
1
or L3 cache line delete, are also initiated.
detection is designed to help maintain
, thus allowing for error correction
dynamic bit-steering
). Memory scrubbing

Advertisement

Table of Contents
loading

This manual is also suitable for:

Eserver openpower 710

Table of Contents