Reliability, Fault Tolerance, And Data Integrity; Memory Error Correction Extensions; Redundancy For Array Self-Healing - IBM 9123710 - eServer OpenPower 710 Introduction Manual

Technical guide

Hide thumbs Also See for 9123710 - eServer OpenPower 710:

Quick start manual (16 pages)

Table Of Contents

Table of Contents

3.1 Reliability, fault tolerance, and data integrity

The reliability of the OpenPower 710 server starts with components, devices, and subsystems

that are designed to be fault-tolerant. During the design and development process,

subsystems go through rigorous verification and integration testing processes. During system

manufacturing, systems go through a thorough testing process designed to help ensure the

highest level of product quality.

The OpenPower 710 server L3 cache and system memory offers ECC (error checking and

correcting) fault-tolerant features. ECC is designed to correct environmentally induced,

single-bit, intermittent memory failures and single-bit hard failures. With ECC, the

likelihood of memory failures will be substantially reduced.

ECC also provides double-bit memory error detection that helps protect data integrity in

the event of a double-bit memory failure.

System memory also provides 4-bit packet error detection that helps to protect data

integrity in the event of a DRAM chip failure.

The system bus, I/O bus, and PCI buses are designed with parity error detection.

Linux supports disk mirroring (RAID 1). This is supported in software using the md driver.

Some of the hardware RAID adapters supported under Linux also support mirroring.

The Journaled File System maintains file system consistency and reduces the likelihood

of data loss when the system is abnormally halted due to a power failure.

3.1.1 Memory error correction extensions

The OpenPower 710 server uses Error Checking and Correcting (ECC) circuitry for memory

reliability, fault tolerance, and integrity.

Memory has single-error-correct and double-error-detect ECC circuitry designed to

correct single-bit memory failures. The

data integrity by detecting and reporting multiple errors beyond what the ECC circuitry can

correct.

The memory chips are organized such that the failure of any specific memory module only

affects a single-bit within an ECC word (

and continued operation in the presence of a complete chip failure (Chipkill™ recovery).

The memory also utilizes memory scrubbing and thresholding to determine when spare

memory modules, within each bank of memory, if available, should be used to replace

ones that have exceeded their threshold value (

is the process of reading the contents of the memory during idle time and checking and

correcting any single-bit errors that have accumulated by passing the data through the

ECC logic. This function is a hardware function on the memory controller chip and does

not influence normal system memory performance.

3.1.2 Redundancy for array self-healing

Although the most likely failure event in a processor is a soft single-bit error in one of its

caches, there are other events that can occur, and they need to be distinguished from one

another.

For the L1, L2, and L3 caches and their directories, hardware and firmware keep track of

whether permanent errors are being corrected beyond a threshold. If this threshold is

exceeded, a deferred repair error log is created. Additional run-time availability actions,

such as CPU vary off

This RAS function is only available for a Linux operating system running the 2.6 kernel.

IBM eServer OpenPower 710 Technical Overview and Introduction

double-bit

bit-scattering)

or L3 cache line delete, are also initiated.

detection is designed to help maintain

, thus allowing for error correction

dynamic bit-steering

). Memory scrubbing

Table of Contents

This manual is also suitable for:

Eserver openpower 710

Reliability, Fault Tolerance, And Data Integrity; Memory Error Correction Extensions; Redundancy For Array Self-Healing - IBM 9123710 - eServer OpenPower 710 Introduction Manual

3.1 Reliability, fault tolerance, and data integrity

3.1.1 Memory error correction extensions

3.1.2 Redundancy for array self-healing

Related Manuals for IBM 9123710 - eServer OpenPower 710

Related Content for IBM 9123710 - eServer OpenPower 710

This manual is also suitable for:

Table of Contents