Cpu And Memory Testing And Error Log Analysis - IBM R30 Operator's Manual

Table of Contents

Advertisement

When the System Exerciser is running, most built-in error recovery procedures are turned
off. This can cause occasional errors to be reported that normally have no effect on system
operation. Parts should only be replaced when the following occurs:
A high number of errors are reported in relation to the number of times the device was
tested.
Errors reported by the System Exerciser are in the same area as that reported by the
customer.

CPU and Memory Testing and Error Log Analysis

Except for the floating-point tests, all CPU and memory testing on the system units are done
by POST and BIST. Memory is tested entirely by the POST. On systems with RS.9, RS1,
and RS2 processors, bit steering is used to map out defective bits. The POST provides an
error-free memory map. If POST cannot find at least 2MB of good memory, it halts and
display an SRN in the LEDs identifying the problem. If POST finds at least 2MB of good
memory, the memory problems are recorded in the IPL Control Block and the system
continues to boot.
If any memory errors were recorded in the IPL Control Block, they are reported by the Base
System Diagnostics, which must be run to analyze the IPL Control Block. Normally, most
memory problems that are detected by the POST are isolated to a single FRU.
The CPU and memory cannot be tested after the AIX based diagnostics are loaded;
however, they are monitored for correct operation by various checkers such as Checkstop,
Machine Check, Data Storage Interrupt, etc. The checkers may vary by processor type. If
one of these checks intermittently occurs, it is logged in to the error log. To analyze these
errors the Base System Diagnostics must be run in the Problem Determination Mode.
Single-bit memory errors are corrected by ECC (Error Checking and Correction).
Machine Checks occur when there is a double bit error. Except for 7011 system units,
Machine Check problems are isolated to memory cards and memory modules that were
addressed when the error occurred. Depending on the system type and model, this may be
a single memory card and two memory modules, two memory cards and four memory
modules, or four memory cards and eight memory modules, etc. On 7011 system units,
Machine Checks are isolated to two memory modules and the CPU Planar.
Although Checkstops can be caused by things other than the CPU, the diagnostics always
callout the CPU when there is a Checkstop. Machine Checks can cause Checkstops. If both
a Checkstop and a Machine Check are logged, only the Machine Check entry is analyzed.
Note: Normally, the Base System Diagnostics do not analyze any error more than four
days old.
8-11
Using the Diagnostics

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

R40R50

Table of Contents