IBM Power 750 Express Hardware Announcement page 16

Hide thumbs Also See for Power 750 Express:
Table of Contents

Advertisement

• Disk drive fault tracking is designed to alert the system administrator of an
impending disk drive failure before it impacts customer operation.
Mutual surveillance
The Service Processor monitors the operation of the firmware during the boot
process, and also monitors the Hypervisor
monitors the Service Processor and will perform a reset/reload if it detects the loss
of the Service Processor. If the reset/reload does not correct the problem with the
Service Processor, the Hypervisor will notify the operating system and the operating
system can take appropriate action, including calling for service.
Environmental monitoring functions
POWER7-based servers include a range of environmental monitoring functions:
• Temperature monitoring warns the system administrator of potential
environmental-related problems by monitoring the air inlet temperature. When
the inlet temperature rises above a warning threshold, the system initiates an
orderly shutdown. When the temperature exceeds the critical level or if the
temperature remains above the warning level for too long, the system will shut
down immediately.
• Fan speed is controlled by monitoring actual temperatures on critical components
and adjusting accordingly. If internal component temperatures reach critical
levels, the system will shut down immediately, regardless of fan speed. When a
redundant fan fails, the system calls out the failing fan and continues running.
When a nonredundant fan fails, the system shuts down immediately.
Availability enhancement functions
The POWER7 family of systems continues to offer and introduce significant
enhancements designed to increase system availability.
POWER7 processor functions
As in POWER6, the POWER7 processor has the ability to do processor instruction
retry and alternate processor recovery for a number of core-related faults. This
significantly reduces exposure to both hard (logic) and soft (transient) errors in
the processor core. Soft failures in the processor core are transient (intermittent)
errors, often due to cosmic rays or other sources of radiation, and generally are not
repeatable. When an error is encountered in the core, the POWER7 processor will
first automatically retry the instruction. If the source of the error was truly transient,
the instruction will succeed and the system will continue as before. On IBM systems
prior to POWER6, this error would have caused a checkstop.
Hard failures are more difficult, being true logical errors that will be replicated
each time the instruction is repeated. Retrying the instruction will not help in this
situation because the instruction will continue to fail. As in POWER6, POWER7
processors have the ability to extract the failing instruction from the faulty core
and retry it elsewhere in the system for a number of faults, after which the failing
core is dynamically deconfigured and called out for replacement. The entire process
is transparent to the partition owning the failing instruction. These systems are
designed to avoid a full system outage.
POWER7 single processor checkstopping
As in POWER6, POWER7 provides single processor checkstopping. This significantly
reduces the probability of any one processor affecting total system availability.
Partition availability priority
Also available is the ability to assign availability priorities to partitions. If an
alternate processor recovery event requires spare processor resources in order
to protect a workload, when no other means of obtaining the spare resources is
available, the system will determine which partition has the lowest priority and
attempt to claim the needed resource. On a properly configured POWER7 processor-
IBM United States Hardware Announcement
110-009
for termination. The Hypervisor
TM
IBM is a registered trademark of International Business Machines Corporation
16

Advertisement

Table of Contents
loading

Table of Contents