Ibms Ras Philosophy - IBM p5 590 System Handbook

Table of Contents

Advertisement

Problem Management
Problem Management
Repair,
Concurrent
Concurrent
Service
Repair
Repair
Error Logging /
Error Logging /
Diagnosis
Diagnosis
Diagnose,
Reconfigure
Threshold /
Threshold /
Prediction
Prediction
Successful
Successful
Recovery,
Fault
Fault
retry
Masking
Masking
Fault
avoidance
Figure 6-1 IBMs RAS philosophy
Both the p5-595 and p5-590 are designed to provide new levels of proven,
mainframe-inspired reliability, availability, and serviceability for mission-critical
applications. It comes equipped with multiple resources to help identify resolve
system problems rapidly. During ongoing operation, error checking and
correction (ECC) checks data for errors and can correct them in real time. First
Failure Data Capture (FFDC) capabilities log both the source and root cause of
problems to help prevent the recurrence of intermittent failures that diagnostics
cannot reproduce. Meanwhile, Dynamic Processor Deallocation and dynamic
deallocation of PCI bus slots help to reallocate resources when an impending
failure is detected so applications can continue to run unimpeded.
The p5-595 and p5-590 also include structural elements to help ensure
outstanding availability and serviceability. The 24-inch system frame includes
hot-swappable disk bays and blind-swap, hot-plug PCI-X slots that allow
administrators to repair, replace or install components without interrupting the
system. Redundant hot-plug power and cooling subsystems provide power and
cooling backup in case units fail, and they allow for easy replacement. In the
event of a complete power failure, early power off warning capabilities are
designed to perform an orderly shutdown. In addition, both primary and
redundant battery backup power subsystems are optionally available as well as
UPSs.
The p5-590 and p5-595 RAS design enhancements can be grouped into four
main areas:
Failure
Failure
Recovery
Recovery
Remote
Remote
Support
Support
Successful
Successful
User
User
Redundant
Redundant
Notification
Notification
Policy
Policy
Failover
Failover
Fault Isolation
Fault Isolation
Failure Damage Containment
Failure Damage Containment
Analysis
Analysis
Unsuccessful
Unsuccessful
Redundancy
Redundancy
HW Retry
HW Retry
Error Detection
Error Detection
Base Hardware and Software Design Integrity
Base Hardware and Software Design Integrity
Chapter 6. Reliability, availability, and serviceability
Failure Resilience
Failure Resilience
Restart
Restart
Recovery
Recovery
Unsuccessful
Unsuccessful
Software
Software
Damage
Damage
Retry
Retry
Control
Control
Failure Data Capture
Failure Data Capture
141

Advertisement

Table of Contents
loading

This manual is also suitable for:

P5 595

Table of Contents