Component Reliability; Extended System Testing And Surveillance - IBM p5 590 System Handbook

Table of Contents

Advertisement

6.3.3 Component reliability

The components used in the CEC provide superior levels of reliability that are
available and undergo additional stress testing and screening above and beyond
the industry-standard components that are used in several UNIX OS-based
systems today.
Fault avoidance is also enhanced by minimizing the total number of components,
and this is inherent in POWER5 chip technology, with two processors per chip. In
addition, the basic memory DIMM technology has been significantly improved in
reliability through the use of more reliable soldered connections to the memory
cards. Going beyond component reliability, an internal array of soft errors
throughout the POWER5 chip are systematically masked using internal ECC
recovery techniques whenever an error is detected.
The POWER5 chip provides additional enhancements such as virtualization, and
improved reliability, availability, and serviceability at both chip and system levels.
The chip includes approximately 276 M transistors. Given the large number of
circuits and the small die size, one of the biggest challenges in modern
processor design is controlling chip power consumption in order to reduce heat
creation. Unmanaged, the heat can significantly affect the overall reliable of a
server. The introduction of simultaneous multi-threading in POWER5 allows the
chip to execute more instructions per cycle per processor core, increasing total
switching power. In mitigation, POWER5 chips use a fine-grained, dynamic
clock-gating mechanism. This mechanism turns off clocks to a local clock buffer if
dynamic management logic determines that a set of latches driven by the buffer
will not be used in the next cycle. This allows substantial power saving with no
performance impact.

6.3.4 Extended system testing and surveillance

The design of the p5-590 and p5-595 aids in the recognition of intermittent errors
that are either corrected dynamically or reported for further isolation and repair.
Parity checking on the system bus, cyclic redundancy checking (CRC) on the
remote I/O (RIO-2) bus, and the use of error correcting code on memory and
processors contribute to outstanding RAS characteristics.
During the boot sequence, built-in self test (BIST) and power-on self test (POST)
routines check the processors, cache, and associated hardware required for a
successful system start. These tests run every time the system is powered on.
Chapter 6. Reliability, availability, and serviceability
145

Advertisement

Table of Contents
loading

This manual is also suitable for:

P5 595

Table of Contents