Reliability, Availability, And Serviceability Features - IBM BladeCenter JS12 7998 Installation And User Manual

Hide thumbs

Table Of Contents

Table of Contents

v Power throttling

Reliability, availability, and serviceability features

Three of the most important features in server design are reliability, availability,

and serviceability (RAS). The reliability of the BladeCenter JS12 starts with

components, devices, and subsystems that are fault tolerant.

Reliability, availability, and serviceability protect the integrity of the data that is

stored in the blade server, maintain the availability of the blade server when you

need it, and enhance the ease with which you can diagnose and correct problems.

Component-level RAS features

The blade server has the following component-level RAS features:

v Alternate processor recovery

v Bit steering

v Chipkill memory for dual inline memory modules (DIMMs)

v Diagnostic support of Ethernet controllers

v Dual inline memory module (DIMM) failure isolation

v Dynamic deallocation (runtime POWER6 garding of microprocessor and

v L2 cache line delete

v Memory chip kill - Chipkill memory for DIMMs

v Memory Predictive Failure Analysis (PFA) alerts through scrubbing and

v Memory scrubbing

v Peripheral component interconnect (PCI) bus parity, ECRC, and surprise link

v PFA thresholding of correctable hardware errors of the microprocessor and L2

v Processor runtime diagnostics (PRD) that initiates the following actions to

v Single processor checkstop (including a partition checkstop)

JS12 Type 7998: Installation and User's Guide

For more information, see the online information or the Problem Determination

and Service Guide.

If your BladeCenter unit supports power management, the power consumption

of the blade server can be dynamically managed through the management

module. For more information, see the online management-module

documentation or the IBM support site at http://www.ibm.com/systems/

support/.

– DIMM pair identification through unrecoverable error (UE) checkpointing and

message-related recovery actions

– Single DIMM identification through recoverable component error (CE)

checkpointing and garding

memory)

error-checking and correction (ECC)

down

cache

recover from errors:

– Self-healing, such as redundant bit steering for memory

– Deallocation at runtime of a failing resource, such as a processor core, a

memory page

– Identifying parts for service

– Runtime error persistent deallocation, if necessary, for I-Cash, D-cash, L2

cache, L3 cache

– Transparent microprocessor hardware error recovery (for example, for L2

cache errors)

Table of Contents

Reliability, Availability, And Serviceability Features - IBM BladeCenter JS12 7998 Installation And User Manual

Reliability, availability, and serviceability features

Related Manuals for IBM BladeCenter JS12 7998

Related Content for IBM BladeCenter JS12 7998

Table of Contents