IBM BladeCenter PS703 Technical Overview And Introduction page 146

Hide thumbs Also See for BladeCenter PS703:
Table of Contents

Advertisement

– The server input voltages are out of operational specification
The service processor can immediately shut down a system in the following
circumstances:
– Temperature exceeds the critical level or if the temperature remains beyond the
warning level for too long
– Internal component temperatures reach critical levels
Mutual surveillance
The service processor monitors the operation of the POWER Hypervisor firmware during
the boot process and watches for loss of control during system operation. It also allows
the POWER Hypervisor to monitor service processor activity. The service processor can
take appropriate action, including calling for service, when it detects the POWER
Hypervisor firmware has lost control. Likewise, the POWER Hypervisor can request a
service processor repair action if necessary.
Availability
The auto-restart (reboot) option, when enabled by the BladeCenter AMM, can reboot the
system automatically following AC power failure.
Fault monitoring
The built-in self-test (BIST) checks processor, cache, memory, and associated hardware
required for proper booting of the operating system when the system is powered on at the
initial install or after a hardware configuration change (for example, an upgrade). If a
non-critical error is detected or if the error occurs in a resource that can be removed from
the system configuration, the booting process is designed to proceed to completion. The
errors are logged in the system nonvolatile random access memory (NVRAM). When the
operating system completes booting, the information is passed from the NVRAM into the
system error log, where it is analyzed by error log analysis (ELA) routines. Appropriate
actions are taken to report the boot time error for subsequent service if required.
Error checkers
IBM POWER processor-based systems contain specialized hardware detection circuitry that
is used to detect erroneous hardware operations. Error checking hardware ranges from parity
error detection coupled with processor instruction retry and bus retry, to ECC correction on
caches and system buses. All IBM hardware error checkers have distinct attributes:
Continual monitoring of system operations to detect potential calculation errors.
Attempt to isolate physical faults based on runtime detection of each unique failure.
Ability to initiate a wide variety of recovery mechanisms designed to correct the problem.
The POWER processor-based systems include extensive hardware and firmware
recovery logic.
Fault isolation registers
Error checker signals are captured and stored in hardware fault isolation registers (FIRs). The
associated logic circuitry is used to limit the domain of an error to the first checker that
encounters the error. In this way, runtime error diagnostics can be deterministic so that for
every check station, the unique error domain for that checker is defined and documented.
Ultimately, the error domain becomes the field-replaceable unit (FRU) call, and manual
interpretation of the data is not normally required.
First-failure data capture (FFDC)
First-failure data capture (FFDC) is an error isolation technique which ensures that when a
fault is detected in a system through error checkers or other types of detection methods, the
132
IBM BladeCenter PS703 and PS704 Technical Overview and Introduction

Advertisement

Table of Contents
loading

This manual is also suitable for:

Bladecenter ps704

Table of Contents