Thresholding - IBM eServer xSeries x382 Hardware Maintenance Manual And Troubleshooting Manual

Type 8834
Table of Contents

Advertisement

Thresholding

MCA errors are classified into one of three categories: corrected, recoverable, and
fatal. In general, corrected errors will not affect the operation of the system and
therefore may occur repeatedly (fatal and most recoverable errors result in a
system reset.) In some cases, such as a stuck bit in a memory DIMM, a corrected
error may occur with a very high frequency. In this scenario, the system may
experience performance degradation due to excessive amounts of time spent in the
error logging routines. In addition, the BMC SEL has a finite size and may be
quickly filled with duplicate errors. To help alleviate these problems, a thresholding
algorithm has been applied to the BMC SEL logging routines. If the threshold is
crossed, a special "event disabled" SEL entry will be created and the BMC SEL
logging code will not attempt to send future platform event message commands for
that error type to the BMC.
This greatly reduces the amount of time spent in the SEL logging routines and
avoids overrunning the BMC SEL log storage. This thresholding in no way affects
the ability of the OS to receive notification and service CPEIs or CMCIs, nor does it
disable any error correction logic in the chipset. Any disabled event reporting will be
re-enabled on the next reboot.
Corrected errors are grouped into four categories: Microprocessor, Memory, PCI
PERR, and Generic Bus. History for each category is maintained separately.
Thresholding does not apply to Recoverable or Fatal errors, only corrected errors.
On the xSeries 382, the maximum number of errors that can occur for each
category is "10", within one hour. If this threshold is crossed, a special 'Event
Logging Disabled' SEL entry will be logged.
28
IBM eServer xSeries x382 Type 8834: Hardware Maintenance Manual and Troubleshooting Guide
Table 4. SAL 3.0 MCA record event messages (continued)
MCA SAL record section
type.
PCI components.
Memory device.
Other.
SEL event: Sensor type.
Critical interrupt.
PERR.
SERR.
Memory error.
Correctable.
Uncorrectable.
Critical interrupt.
Bus correctable error.
Bus uncorrectable error.
SEL event: Event data
bytes.
PCI, bus, device, function
information.
SMBIOS type 16 0-based
index.
None

Advertisement

Table of Contents
loading

Table of Contents