First Failure Data Capture; Environmental Monitoring Functions; Error Handling And Reporting - IBM 9123710 - eServer OpenPower 710 Introduction Manual

Technical guide
Hide thumbs Also See for 9123710 - eServer OpenPower 710:
Table of Contents

Advertisement

3.1.6 First Failure Data Capture

Diagnosing problems in a computer is a critical requirement for autonomic computing. The
first step to producing a computer that truly has the ability to self-heal is to create a highly
accurate way to identify and isolate hardware errors. IBM has implemented a server design
that builds in hardware error-check stations that capture and help to identify error conditions
within the server. Each of these checkers is viewed as a diagnostic probe into the server, and,
when coupled with extensive diagnostic firmware routines, allows quick and accurate
assessment of hardware error conditions at run-time.
First Failure Data Capture (FFDC) check stations are carefully positioned within the server
logic and data paths to help ensure that potential errors can be quickly identified and
accurately tracked to an individual field replaceable unit (FRU).
These checkers are collected in a series of Fault Isolation Registers, where they can be
accessed by the service processor.
All communication between the service processor and monitored components is
accomplished
transparent to an operating system. This entire structure is
not seen, nor accessed, by system-level activities.

3.1.7 Environmental monitoring functions

The following are some of the environmental monitoring functions available for an
OpenPower 710 server.
Temperature monitoring increases the fan speed rotation when ambient temperature is
above the normal operating range.
Temperature monitoring warns the system administrator of potential
environmental-related problems (for example, air conditioning and air circulation around
the system) so that appropriate corrective actions can be taken before a critical failure
threshold is reached. It also performs an orderly system shutdown when the operating
temperature exceeds the critical level.
Fan speed monitoring provides a warning and an orderly system shutdown when the
speed is out of the operational specification.
Voltage monitoring provides a warning and an orderly system shutdown when the voltages
are out of the operational specification.

3.1.8 Error handling and reporting

In the unlikely event of system hardware or environmentally induced failure, the system
run-time error capture capability systematically analyzes the hardware error signature to
determine the cause of failure.
The analysis will be stored in the system NVRAM. When the system can be successfully
rebooted either manually or automatically, the error will be reported to the Linux operating
system.
Error Log Analysis can be used to display the failure cause and the physical location of
failing hardware.
With the integrated service processor, the system has the ability to automatically send out
an alert via phone line to a pager or call for service in the event of critical system failure. A
hardware fault will also turn on the two Attention Indicators (one located on the front of the
system unit and the other on the rear of the system) to alert the user of an internal
hardware problem. The indicator may also be turned on by the operator as a tool to allow
52
IBM eServer OpenPower 710 Technical Overview and Introduction
out of band
. That is, operation of the error-detection mechanism is
below the architecture
and is

Advertisement

Table of Contents
loading

This manual is also suitable for:

Eserver openpower 710

Table of Contents