Fault Masking; Resource Deallocation - IBM p5 550 Technical Overview And Introduction

Hide thumbs Also See for p5 550:
Table of Contents

Advertisement

3.2.6 Fault masking

If corrections and retries succeed and do not exceed threshold limits, the system remains
operational with full resources, and no client or IBM customer engineer intervention is
required. This technology is used in the following faults:
CEC bus retry and recovery
PCI-X bus recovery
ECC Chipkill soft error

3.2.7 Resource deallocation

If recoverable errors exceed threshold limits, resources can be deallocated with system
remaining operational, allowing deferred maintenance at a convenient time.
Dynamic or persistent deallocation
Dynamic deallocation of potentially failing components is non-disruptive, allowing the system
to continue to run. Persistent deallocation occurs when a failed component is detected, which
is then deactivated at a subsequent reboot.
Dynamic deallocation functions include:
Processor
L3 cache line delete
Partial L2 cache deallocation
PCI-X bus and slots
For dynamic processor deallocation, the service processor performs a predictive failure
analysis based on any recoverable processor errors that have been recorded. If these
transient errors exceed a defined threshold, the event is logged and the processor is
deallocated from the system while the operating system continues to run. This feature
(named
deallocation can only occur if there are sufficient functional processors (at least two).
To verify whether CPU Guard has been enabled, run the following command:
lsattr -El sys0 | grep cpuguard
If enabled, the output will be similar to the following:
cpuguard
If the output shows CPU Guard as disabled, enter the following command to enable it:
chdev -l sys0 -a cpuguard='enable'
Cache or cache-line deallocation is aimed at performing dynamic reconfiguration to bypass
potentially failing components. This capability is provided for both L2 and L3 caches. Dynamic
run-time deconfiguration is provided if a threshold of L1 or L2 recovered errors is exceeded.
In the case of an L3 cache run-time array single-bit solid error, the spare chip resources are
used to perform a line delete on the failing line.
PCI hot-plug slot fault tracking helps prevent slot errors from causing a system machine
check interrupt and subsequent reboot. This provides superior fault isolation, and the error
affects only the single adapter. Run-time errors on the PCI bus caused by failing adapters will
result in recovery action. If this is unsuccessful, the PCI device will be gracefully shut down.
CPU Guard
) enables maintenance to be deferred until a suitable time. Processor
enable
CPU Guard
True
Chapter 3. Capacity on Demand, RAS, and manageability
53

Advertisement

Table of Contents
loading

Table of Contents