IBM Power 595 Technical Overview And Introduction page 168

page of 188

/ 188
Contents
Table of Contents
Bookmarks

Table of Contents

Error logging and analysis

After the root cause of an error has been identified by a fault isolation component, an error log

entry is created and that includes basic data such as:

An error code uniquely describing the error event

The location of the failing component

The part number of the component to be replaced, including pertinent data like

engineering and manufacturing levels

Return codes

Resource identifiers

FFDC data

Data containing information about the effecte repair can have on the system is also included.

Error log routines in the operating system can tthis information and decide to call home to

contact service and support, send a notification message, or continue without an alert.

Remote support

The Remote Management and Control (RMC) application is delivered as part of the base

operating system, including the operating system running on the HMC. The RMC provides a

secure transport mechanism across the LAN interface between the operating system and the

HMC and is used by the operating system diagnostic application for transmitting error

information. It performs a number of other functions as well, but these are not used for the

service infrastructure.

Manage serviceable events

A critical requirement in a logically partitioned environment is to ensure that errors are not lost

before being reported for service, and that an error should only be reported once, regardless

of how many logical partitions experience the potential effect of the error. The Manage

Serviceable Events task on the HMC is responsible for aggregating duplicate error reports,

and ensuring that all errors are recorded for review and management.

When a local or globally-reported service request is made to the operating system, the

operating system diagnostic subsystem uses the RMC subsystem to relay error information to

the HMC. For global events (platform unrecoverable errors, for example) the service

processor will also forward error notification of these events to the HMC, providing a

redundant error-reporting path in case of errors in the RMC network.

The first occurrence of each failure type will be recorded in the Manage Serviceable Events

task on the HMC. This task then filters and maintains a history of duplicate reports from other

logical partitions or the service processor. It then looks across all active service event

requests, analyzes the failure to ascertain the root cause, and, if enabled, initiates a call home

for service. This method ensures that all platform errors will be reported through at least one

functional path, ultimately resulting in a single notification for a single problem.

Extended error data (EED)

Extended error data (EED) is additional data collected either automatically at the time of a

failure or manually at a later time. The data collected depends on the invocation method but

includes information like firmware levels, operating system levels, additional fault isolation

pertinent data.

156

IBM Power 595 Technical Overview and Introduction

Table of Contents

IBM Power 595 Technical Overview And Introduction page 168

Related Manuals for IBM Power 595

Related Products for IBM Power 595

Table of Contents