Reporting Problems - IBM Power 570 Technical Overview And Introduction

page of 142

/ 142
Contents
Table of Contents
Bookmarks

Table of Contents

Draft Document for Review September 2, 2008 5:05 pm

4.3.3 Reporting problems

In the unlikely event of a system hardware or environmentally induced failure is diagnosed,

POWER6 processor-based systems report the error through a number of mechanisms. This

ensures that appropriate entities are aware that the system may be operating in an error

state. However, a crucial piece of a solid reporting strategy is ensuring that a single error

communicated through multiple error paths is correctly aggregated, so that later notifications

are not accidently duplicated.

Error logging and analysis

Once the root cause of an error has been identified by a fault isolation component, an error

log entry is created with some basic data such as:

An error code uniquely describing the error event

The location of the failing component

The part number of the component to be replaced, including pertinent data like

engineering and manufacturing levels

Return codes

Resource identifiers

First Failure Data Capture data

Data containing information on the effect that the repair will have on the system is also

included. Error log routines in the operating system can then use this information and decide

to call home to contact service and support, send a notification message, or continue without

an alert.

Remote support

The Remote Management and Control (RMC) application is delivered as part of the base

operating system, including the operating system running on the Hardware Management

Console. RMC provides a secure transport mechanism across the LAN interface between the

operating system and the Hardware Management Console and is used by the operating

system diagnostic application for transmitting error information. It performs a number of other

functions as well, but these are not used for the service infrastructure.

Manage serviceable events

A critical requirement in a logically partitioned environment is to ensure that errors are not lost

before being reported for service, and that an error should only be reported once, regardless

of how many logical partitions experience the potential effect of the error. The Manage

Serviceable Events task on the Hardware Management Console (HMC) is responsible for

aggregating duplicate error reports, and ensures that all errors are recorded for review and

management.

When a local or globally reported service request is made to the operating system, the

operating system diagnostic subsystem uses the Remote Management and Control

Subsystem (RMC) to relay error information to the Hardware Management Console. For

global events (platform unrecoverable errors, for example) the Service Processor will also

forward error notification of these events to the Hardware Management Console, providing a

redundant error-reporting path in case of errors in the RMC network.

The first occurrence of each failure type will be recorded in the Manage Serviceable Events

task on the Hardware Management Console. This task will then filter and maintain a history of

duplicate reports from other logical partitions or the Service Processor. It then looks across all

active service event requests, analyzes the failure to ascertain the root cause and, if enabled,

4405ch04 Continuous availability and manageability.fm

Chapter 4. Continuous availability and manageability

109

Table of Contents

Show Quick Links

Quick Links:
System Specifications

Hide quick links:

Table of Contents

Reporting Problems - IBM Power 570 Technical Overview And Introduction

4.3.3 Reporting problems

Hide quick links:

Related Manuals for IBM Power 570

Related Content for IBM Power 570

Table of Contents