Permanent Failures; Responding To Failures; Error Messages For Fabric Checksums - Extreme Networks BlackDiamond 6804 Troubleshooting Manual

Advanced system diagnostics and troubleshooting guide
Hide thumbs Also See for BlackDiamond 6804:
Table of Contents

Advertisement

Error Messages for Fabric Checksums

Permanent Failures

The most detrimental set of conditions that result in packet error events are those that result in
permanent errors. These types of errors arise from some failure within the switch fabric that causes data
to be corrupted in a systematic fashion. These permanent hardware defects might, or might not, affect
normal switch operation. They cannot be resolved by user intervention and will not resolve themselves.
You must replace hardware to resolve permanent errors.

Responding to Failures

Because fabric checksum validation can detect and report both transient and systematic failures, some
human intelligence must be applied to differentiate between those transient and systematic failures.
As an example, the following messages are associated with an MSM64i health-check packet problem.
They indicate that the system is running system-health-check to check the internal connectivity.
<CRIT:SYST> CPU health-check packet missing type 0 on slot 5
<CRIT:SYST> CPU health-check packet problem on card 5
<INFO:SYST> card.C 1937: card5 (type 20) is reset due to autorecovery config reset
counter is 1
If these messages occur only once or twice, no action is necessary. (Transient problem.)
If these messages recur continuously, remove and re-insert the module in its slot. (If the problem goes
away, this was a systematic, soft-state failure.)
If removing and re-inserting the module does not fix the problem, run extended diagnostics on the
switch, because the messages might point to a systematic, permanent failure.
Hardware replacement is indicated when systematic errors cannot be resolved by normal
troubleshooting methods. That is, one must first demonstrate that an error is both systematic and
permanent before repairing or replacing a component.
Error Messages for Fabric Checksums
Versions of ExtremeWare prior to Version 6.2.2b56 simply logged the fact that a checksum occurred on a
slot without providing much detail as to the type of packet or the reason that the checksum message
was logged. The following messages are examples of the earlier message format.
01/31/2002 01:30.58 <CRIT:KERN> ERROR: Checksum Error on slot 3
01/31/2002 01:30.58 <CRIT:KERN> ERROR: Checksum Error on Slot 3
01/31/2002 01:30.58 <CRIT:KERN> ERROR: Checksum Error on slot 3
ExtremeWare Release 6.2.2b56 and higher provide more detailed information about the origins of the
checksum message by expanding the message to include descriptions of the type of message and the
condition detected. For example, if the system-health-check subsystem detects a panic or a condition
requiring action on the part of the system health check subsystem or the administrator, you can expect
to see a message similar to this:
04/16/2003 13:17.23 <CRIT:SYST> Sys-health-check [EXT] checksum error
on slot 5 prev=0 cur=6
Advanced System Diagnostics and Troubleshooting Guide
31

Advertisement

Table of Contents
loading

Table of Contents