Self-Healing; N+1 Redundancy - IBM p5 550 Technical Overview And Introduction

Hide thumbs Also See for p5 550:

Technical overview and introduction (110 pages)

Table Of Contents

Table of Contents

The operating system cannot program or access the temperature threshold using the SP.

EPOW events can, for example, trigger the following actions:

Temperature monitoring, which increases the fans speed rotation when ambient

temperature is above a preset operating range.

Temperature monitoring warns the system administrator of potential

environmental-related problems. It also performs an orderly system shutdown when the

operating temperature exceeds a critical level.

Voltage monitoring provides warning and an orderly system shutdown when the voltage is

out of the operational specification.

3.2.4 Self-healing

For a system to be self-healing, it must be able to recover from a failing component by first

detecting and isolating the failed component, taking it off line, fixing or isolating it, and

reintroducing the fixed or replacement component into service without any application

disruption. Examples include:

Bit steering

server operational.

Bit-scattering

of a complete chip failure (

Single bit error correction using ECC without reaching error thresholds for main, L2, and

L3 cache memory.

L3 cache line deletes extended from 2 to 10 for additional self-healing.

ECC extended to inter-chip connections on fabric and processor bus.

Memory scrubbing

Dynamic processor deallocation

Capacity on Demand processor to keep the system operational.

Memory reliability, fault tolerance, and integrity

The p5-550 uses Error Checking and Correcting (ECC) circuitry for system memory to correct

single-bit and to detect double-bit memory failures. Detection of double-bit memory failures

helps maintain data integrity. Furthermore, the memory chips are organized such that the

failure of any specific memory module only affects a single bit within a four-bit ECC word

bit-scattering

(

a complete chip failure (

and thresholding to determine when spare memory modules within each bank of memory

should be used to replace ones that have exceeded their threshold of error count (

bit-steering

idle time and checking and correcting any single-bit errors that have accumulated by passing

the data through the ECC logic. This function is a hardware function on the memory controller

chip and does not influence normal system memory performance.

3.2.5 N+1 redundancy

The use of redundant parts allows the p5-550 to remain operational with full resources:

Redundant spare memory bits in L1, L2, L3, and main memory

Redundant fans

Redundant power supplies (optional)

p5-550 Technical Overview and Introduction

to redundant memory in the event of a failed memory module to keep the

, thus allowing for error correction and continued operation in the presence

Chipkill recovery

to help prevent soft-error memory faults.

), thus allowing for error correction and continued operation in the presence of

Chipkill recovery

). Memory scrubbing is the process of reading the contents of the memory during

, a deallocated processor can be replaced by an unused

). The memory DIMMs also use

memory scrubbing

dynamic

Table of Contents

Self-Healing; N+1 Redundancy - IBM p5 550 Technical Overview And Introduction

3.2.4 Self-healing

3.2.5 N+1 redundancy

Related Manuals for IBM p5 550

Related Content for IBM p5 550

Table of Contents