Hardware Uncorrectable Errors; Fatal Errors; Blocking Timeout Fatal Errors; Deadlock Recovery Reset Errors - HP A9834-9001B User's & Service Manual

Hewlett-packard integrity superdome server user service guide
Table of Contents

Advertisement

are opened between PDs when it is established that the PDs are up and communication between them is
open. When there is a failure in GSM, the goal is to close the sharing windows between those two cells but not
to affect sharing windows to other cells.
There are two methods to detect GSM errors. The first method is a software-only-method, in which software
wraps data with a CRC code and sequence number. Software checks this for each buffer transferred. The
second method has some hardware assistance: the hardware sets some CSR bits whenever a GSM error
occurs. Software checks the CSR bits before using the data.

Hardware Uncorrectable Errors

Hardware uncorrectable errors are detected by the hardware and signaled to software, from which software is
able to recover. For some of these errors, the hardware must behave differently to enable software recovery.

Fatal Errors

Fatal errors are unrecoverable errors that usually indicate a loss of data. The system prevents committing
corrupt data to disk or network, and logs information about the error to aid diagnosis. No software recovery of
system fatal errors is possible when a system fatal error has been detected. The goal of the sx2000 chipset and
PDC is to bring all interfaces in this PD into fatal error (FE) mode, signal an HPMC, and guarantee a clear
path to fetch PDC. PDC then saves the error logs, cleans up the error logs, and calls the OS HPMC handler.
The OS then makse a memory dump and reboot.

Blocking Timeout Fatal Errors

Blocking timeout errors occur when an interface detects that a required resource is blocked. Timeout errors
that occur when a specific transaction does not complete (TID timeouts) are not considered blocking timeout
errors. When a blocking timeout error has occurred, the interface tries to prevent queues in other interfaces,
cells, and PDs from backing up by throwing away transactions destined for the blocked resource and
returning flow control credits.

Deadlock Recovery Reset Errors

Deadlock errors are unrecoverable errors that indicate that the chipset is in a deadlock state and must be
reset to enable the CPU to fetch PDC code. Deadlock errors are caused by a defective chipset or CPU (or a
functional bug).
NOTE
After the sx2000 chipset is reset, all GSM sharing regions are disabled, thus providing error
containment and preventing any corruption from spreading to other PDs.

Error Logging

Hardware error handling can be broken into four phases: detection, transaction handling, logging, and state
behavior.
Chapter 1
Overview
Server Errors
57

Hide quick links:

Advertisement

Table of Contents
loading

This manual is also suitable for:

Integrity superdome sx2000

Table of Contents