Reliability, Availability, And Serviceability Features - IBM BladeCenter JS12 7998 Installation And User Manual

Table of Contents

Advertisement

v Power throttling

Reliability, availability, and serviceability features

Three of the most important features in server design are reliability, availability,
and serviceability (RAS). The reliability of the BladeCenter JS12 starts with
components, devices, and subsystems that are fault tolerant.
Reliability, availability, and serviceability protect the integrity of the data that is
stored in the blade server, maintain the availability of the blade server when you
need it, and enhance the ease with which you can diagnose and correct problems.
Component-level RAS features
The blade server has the following component-level RAS features:
v Alternate processor recovery
v Bit steering
v Chipkill memory for dual inline memory modules (DIMMs)
v Diagnostic support of Ethernet controllers
v Dual inline memory module (DIMM) failure isolation
v Dynamic deallocation (runtime POWER6 garding of microprocessor and
v L2 cache line delete
v Memory chip kill - Chipkill memory for DIMMs
v Memory Predictive Failure Analysis (PFA) alerts through scrubbing and
v Memory scrubbing
v Peripheral component interconnect (PCI) bus parity, ECRC, and surprise link
v PFA thresholding of correctable hardware errors of the microprocessor and L2
v Processor runtime diagnostics (PRD) that initiates the following actions to
v Single processor checkstop (including a partition checkstop)
8
JS12 Type 7998: Installation and User's Guide
For more information, see the online information or the Problem Determination
and Service Guide.
If your BladeCenter unit supports power management, the power consumption
of the blade server can be dynamically managed through the management
module. For more information, see the online management-module
documentation or the IBM support site at http://www.ibm.com/systems/
support/.
– DIMM pair identification through unrecoverable error (UE) checkpointing and
message-related recovery actions
– Single DIMM identification through recoverable component error (CE)
checkpointing and garding
memory)
error-checking and correction (ECC)
down
cache
recover from errors:
– Self-healing, such as redundant bit steering for memory
– Deallocation at runtime of a failing resource, such as a processor core, a
memory page
– Identifying parts for service
– Runtime error persistent deallocation, if necessary, for I-Cash, D-cash, L2
cache, L3 cache
– Transparent microprocessor hardware error recovery (for example, for L2
cache errors)

Advertisement

Table of Contents
loading

Table of Contents