Fault Masking; Resource Deallocation - IBM p5 550 Technical Overview And Introduction

Hide thumbs Also See for p5 550:

Technical overview and introduction (110 pages)

Table Of Contents

Table of Contents

3.2.6 Fault masking

If corrections and retries succeed and do not exceed threshold limits, the system remains

operational with full resources, and no client or IBM customer engineer intervention is

required. This technology is used in the following faults:

CEC bus retry and recovery

PCI-X bus recovery

ECC Chipkill soft error

3.2.7 Resource deallocation

If recoverable errors exceed threshold limits, resources can be deallocated with system

remaining operational, allowing deferred maintenance at a convenient time.

Dynamic or persistent deallocation

Dynamic deallocation of potentially failing components is non-disruptive, allowing the system

to continue to run. Persistent deallocation occurs when a failed component is detected, which

is then deactivated at a subsequent reboot.

Dynamic deallocation functions include:

Processor

L3 cache line delete

Partial L2 cache deallocation

PCI-X bus and slots

For dynamic processor deallocation, the service processor performs a predictive failure

analysis based on any recoverable processor errors that have been recorded. If these

transient errors exceed a defined threshold, the event is logged and the processor is

deallocated from the system while the operating system continues to run. This feature

(named

deallocation can only occur if there are sufficient functional processors (at least two).

To verify whether CPU Guard has been enabled, run the following command:

lsattr -El sys0 | grep cpuguard

If enabled, the output will be similar to the following:

cpuguard

If the output shows CPU Guard as disabled, enter the following command to enable it:

chdev -l sys0 -a cpuguard='enable'

Cache or cache-line deallocation is aimed at performing dynamic reconfiguration to bypass

potentially failing components. This capability is provided for both L2 and L3 caches. Dynamic

run-time deconfiguration is provided if a threshold of L1 or L2 recovered errors is exceeded.

In the case of an L3 cache run-time array single-bit solid error, the spare chip resources are

used to perform a line delete on the failing line.

PCI hot-plug slot fault tracking helps prevent slot errors from causing a system machine

check interrupt and subsequent reboot. This provides superior fault isolation, and the error

affects only the single adapter. Run-time errors on the PCI bus caused by failing adapters will

result in recovery action. If this is unsuccessful, the PCI device will be gracefully shut down.

CPU Guard

) enables maintenance to be deferred until a suitable time. Processor

enable

CPU Guard

True

Chapter 3. Capacity on Demand, RAS, and manageability

Table of Contents

Fault Masking; Resource Deallocation - IBM p5 550 Technical Overview And Introduction

3.2.6 Fault masking

3.2.7 Resource deallocation

Related Manuals for IBM p5 550

Related Content for IBM p5 550

Table of Contents