Reliability, Availability, And Serviceability Features - IBM Power PS700 Installation And User Manual

Power systems
Hide thumbs Also See for Power PS700:
Table of Contents

Advertisement

Light path diagnostics provides light-emitting diodes (LEDs) to help you diagnose problems. An LED
on the blade server control panel is lit if an unusual condition or a problem occurs. If this happens,
you can look at the LEDs on the system board to locate the source of the problem.
For more information, see the online information or the Problem Determination and Service Guide.
v Power throttling
If your BladeCenter unit supports power management, the power consumption of the blade server can
be dynamically managed through the management module. For more information, see the online
management-module documentation or the IBM support site at http://www.ibm.com/systems/
support/.

Reliability, availability, and serviceability features

Three of the most important features in server design are reliability, availability, and serviceability (RAS).
The reliability of the BladeCenter PS700 blade server starts with components, devices, and subsystems
that are fault tolerant.
Reliability, availability, and serviceability protect the integrity of the data that is stored in the blade server,
maintain the availability of the blade server when you need it, and enhance the ease with which you can
diagnose and correct problems.
Component-level RAS features
The blade server has the following component-level RAS features:
v Alternate processor recovery
v Bit steering
v Chipkill memory for dual inline memory modules (DIMMs)
v Diagnostic support of Ethernet controllers
v Dual inline memory module (DIMM) failure isolation
– DIMM pair identification through unrecoverable error (UE) checkpointing and message-related
recovery actions
– Single DIMM identification through recoverable component error (CE) checkpointing and garding
v Dynamic deallocation (runtime POWER7 garding of microprocessor and memory)
v L2 cache line delete
v Memory chip kill - Chipkill memory for DIMMs
v Memory Predictive Failure Analysis (PFA) alerts through scrubbing and error-checking and correction
(ECC)
v Memory scrubbing
v Peripheral component interconnect (PCI) bus parity, ECRC, and surprise link down
v PFA thresholding of correctable hardware errors of the microprocessors and L2 cache
v Processor runtime diagnostics (PRD) that initiates the following actions to recover from errors:
– Self-healing, such as redundant bit steering for memory
– Deallocation at runtime of a failing resource, such as a processor core, a memory page
– Identifying parts for service
– Runtime error persistent deallocation, if necessary, for I-Cash, D-cash, L2 cache, L3 cache
– Transparent microprocessor hardware error recovery (for example, for L2 cache errors)
v Single processor checkstop (including a partition checkstop)
Chapter 1. Product overview
7

Advertisement

Table of Contents
loading

Table of Contents