Self-Healing - IBM IntelliStation POWER 285 Technical Overview And Introduction

Hide thumbs Also See for IntelliStation POWER 285:
Table of Contents

Advertisement

The operating system cannot program or access the temperature threshold using the SP.
EPOW events can, for example, trigger the following actions.
Temperature monitoring, which increases the fans' speed rotation when ambient
temperature is above a preset operating range.
Temperature monitoring warns the system administrator of potential
environmental-related problems. It also performs an orderly system shutdown when the
operating temperature exceeds a critical level.
Voltage monitoring provides warning and an orderly system shutdown when the voltage is
out of the operational specification.

3.1.4 Self-healing

For a system to be self-healing, it must be able to recover from a failing component by first
detecting and isolating the failed component, taking it offline, fixing or isolating it, and
reintroducing the fixed or replacement component into service without any application
disruption. Examples include:
Bit steering
operational
Bit-scattering
of a complete chip failure (Chipkill™ recovery)
Single-bit error-correction using ECC without reaching error thresholds for main, L2, and
L3 cache memory
L3 cache line deletes extended from 2 to 10 for additional self-healing
ECC extended to inter-chip connections on fabric and processor bus
Memory scrubbing
Memory reliability, fault tolerance, and integrity
The IntelliStation POWER 285 use Error Checking and Correcting (ECC) circuitry for system
memory to correct single-bit and to detect double-bit memory failures. Detection of double-bit
memory failures helps maintain data integrity. Furthermore, the memory chips are organized
such that the failure of any specific memory chip only affects a single bit within a four-bit ECC
bit-scattering
word (
presence of a complete chip failure (
memory scrubbing and thresholding to determine when spare memory chips within each
bank of memory should be used to replace ones that have exceeded their threshold of error
dynamic bit-steering
count (
memory during idle time and checking and correcting any single-bit errors that have
accumulated by passing the data through the ECC logic. This function is a hardware function
on the memory controller and does not influence normal system memory performance.
30
IBM IntelliStation POWER 285 Technical Overview and Introduction
to redundant memory in the event of a failed memory chip to keep the server
, thus allowing for error correction and continued operation in the presence
to help prevent soft-error memory faults
), thus allowing for error correction and continued operation in the
Chipkill recovery
Memory scrubbing
).
). The memory DIMMs also use
is the process of reading the contents of the

Advertisement

Table of Contents
loading

Table of Contents