Permanent Failures; Responding To Reported Failures - Extreme Networks ExtremeWare Version 7.8 Troubleshooting Manual

Advanced system diagnostics
Table of Contents

Advertisement

Failures of this type are the result of software or hardware systems entering an abnormal operating state
in which normal switch operation might, or might not, be impaired.

Permanent Failures

The most detrimental set of conditions that result in packet error events are those that result in
permanent errors. These types of errors arise from some failure within the switch fabric that causes data
to be corrupted in a systematic fashion. These permanent hardware defects might, or might not, affect
normal switch operation. They cannot be resolved by user intervention and will not resolve themselves.
You must replace hardware to resolve permanent errors.

Responding to Reported Failures

Before ExtremeWare 7.1, the fabric checksum validation mechanisms in ExtremeWare detected and
reported all checksum validation failures, so the resulting mix of message types reported in the system
log could cause confusion as to the true nature of the failure and the appropriate response. The
confusion over the error reporting scheme often led to unnecessary diversion of resources and often
unnecessary service interruptions because operators attempted to respond to reported errors that
presented no actual threat to network operation.
In ExtremeWare 7.1, the responsibility for reporting checksum errors shifted from the low-level bus
monitoring and data integrity verification subsystem that monitors the operation of all data and control
busses within the switch to the higher-level intelligent layer that is responsible for interpreting the test
results and reporting them to the user. Rather than simply insert every checksum validation error in the
system log, the higher-level interpreting and reporting subsystem monitors checksum validation failures
and inserts error messages in the system log when it is likely that a systematic hardware problem is the
cause for the checksum validation failures.
NOTE
The intent of the higher-level interpreting and reporting subsystem is to remove the burden of
interpreting and classifying of messages from the operator. The subsystem automatically differentiates
between harmless checksum error instances and service-impacting checksum error instances.
The interpreting and reporting subsystem uses measurement periods that are divided into a sequence of
20-second windows. Within the period of a window, reports from the low-level bus monitoring
subsystem are collected and stored in an internal data structure for the window. These reports are
divided into two major categories: slow-path reports and fast-path reports.
• Slow-path reports come from monitoring control busses and the CPU-to-switch fabric interface. The
slow-path reporting category is subdivided into different report message subcategories depending
on whether they come from CPU data monitoring, CPU health check tests, or backplane health check
tests.
• Fast-path reports come from direct monitoring of the switch fabric data path. The fast-path reporting
category is subdivided into different report message subcategories, depending on whether they come
from monitoring either internal or external MAC counters associated with each switch fabric in the
switching system.
Advanced System Diagnostics and Troubleshooting Guide
Failure Modes
31

Advertisement

Table of Contents
loading

Table of Contents