Extreme Networks ExtremeWare Version 7.8 Troubleshooting Manual
Extreme Networks ExtremeWare Version 7.8 Troubleshooting Manual

Extreme Networks ExtremeWare Version 7.8 Troubleshooting Manual

Advanced system diagnostics
Table of Contents

Advertisement

Quick Links

Advanced System
Diagnostics and

Troubleshooting Guide

ExtremeWare Software Version 7.8
Extreme Networks, Inc.
3585 Monroe Street
Santa Clara, California 95051
(888) 257-3000
http://www.extremenetworks.com
Published: May 2008
Part number: 100279-00 Rev 01

Advertisement

Table of Contents
loading

Summary of Contents for Extreme Networks ExtremeWare Version 7.8

  • Page 1: Troubleshooting Guide

    Advanced System Diagnostics and Troubleshooting Guide ExtremeWare Software Version 7.8 Extreme Networks, Inc. 3585 Monroe Street Santa Clara, California 95051 (888) 257-3000 http://www.extremenetworks.com Published: May 2008 Part number: 100279-00 Rev 01...
  • Page 2 Extreme Networks, Inc., which may be registered or pending registration in certain jurisdictions. The Extreme Turbodrive logo is a service mark of Extreme Networks, which may be registered or pending registration in certain jurisdictions. Specifications are subject to change without notice.
  • Page 3: Table Of Contents

    Contents Preface Introduction Terminology Conventions Related Publications Chapter 1 Introduction Introduction Diagnostics: A Brief Historical Perspective Overview of the ExtremeWare Diagnostics Suite Supported Hardware Applicable ExtremeWare Versions Chapter 2 “ i ” Series Switch Hardware Architecture Diagnostics Support The BlackDiamond Systems BlackDiamond 6800 Series Hardware Architecture Differences The BlackDiamond Backplane BlackDiamond I/O Modules...
  • Page 4 Contents Definition of Terms Standard Ethernet Detection for Packet Errors on the Wire Extreme Networks’ Complementary Detection of Packet Errors Between Wires Hardware System Detection Mechanisms Software System Detection Mechanisms Failure Modes Transient Failures Systematic Failures Soft-State Failures Permanent Failures...
  • Page 5 Contents The Role of Processes to Monitor System Operation Power On Self Test (POST) Related Commands Configuring the Boot-Up Diagnostics Runtime (On-Demand) System Diagnostics Runtime Diagnostics on “i” Series Systems Related Commands Running the Diagnostics on BlackDiamond Systems Runtime Diagnostics on “i” Series Alpine and Summit Systems System Impact of Running the Diagnostics on “i”...
  • Page 6 Contents System Impacts of the Transceiver Diagnostics Network Impact of the Transceiver Diagnostics Viewing Diagnostics Results Example Log Messages for Transceiver Diagnostic Failures Examples, show diagnostics Command Example—show switch Command Transceiver Diagnostic Result Analysis FDB Scan Usage Guidelines Related Commands Enabling FDB Scanning Disabling FDB Scanning Configuring the FDB Scan Diagnostics...
  • Page 7 Asia TAC EMEA TAC Japan TAC What Information Should You Collect? Analyzing Data Diagnostic Troubleshooting Extreme Networks’ Recommendations Using Memory Scanning to Screen I/O Modules Appendix A Limited Operation Mode and Minimal Operation Mode Limited Operation Mode Triggering Limited Operation Mode...
  • Page 8 Contents Advanced System Diagnostics and Troubleshooting Guide...
  • Page 9: Preface

    Introduction This guide describes how to use the ExtremeWare hardware diagnostics suite to test and validate the operating integrity of Extreme Networks switches. The tools in the diagnostic suite are used to detect, isolate, and treat faults in a system.
  • Page 10: Related Publications

    • ExtremeWare Software User Guide, Software Version 7.7. • ExtremeWare Software Command Reference, Software Version 7.7. • ExtremeWare Error Message Decoder. Documentation for Extreme Networks products is available on the World Wide Web at the following location: http://www.extremenetworks.com/services/documentation/Default.asp Advanced System Diagnostics and Troubleshooting Guide...
  • Page 11: Chapter 1 Introduction

    Introduction This guide describes how to use the ExtremeWare hardware diagnostics suite to test and validate the operating integrity of Extreme Networks switches. The tools in the diagnostic suite are used to detect, isolate, and treat faults in a system.
  • Page 12: Diagnostics: A Brief Historical Perspective

    Introduction Diagnostics: A Brief Historical Perspective Diagnostic utility programs were created to aid in troubleshooting system problems by detecting and reporting faults so that operators or administrators could go fix the problem. While this approach does help, it has some key limitations: •...
  • Page 13: Supported Hardware

    Supported Hardware The ExtremeWare diagnostic suite applies only to Extreme Networks switch products based on the “inferno” series chipset. Equipment based on this chipset are referred to as being “inferno” series or “i”...
  • Page 14 Introduction Advanced System Diagnostics and Troubleshooting Guide...
  • Page 15: I " Series Switch Hardware Architecture

    • Summit “i” Series Systems on page 23 Diagnostics Support The ExtremeWare diagnostic suite applies only to Extreme Networks switch products based on the “inferno” series chipset. Equipment based on this chipset are referred to as being “inferno” series or “i”...
  • Page 16: The Blackdiamond Systems

    “i” Series Switch Hardware Architecture The BlackDiamond Systems In the context of the advanced system diagnostics suite, the BlackDiamond family of core chassis switches share the same fundamental hardware architecture: a multislot modular chassis containing a passive backplane that supports redundant load-sharing, hot-swappable switch fabric modules. On BlackDiamond systems, each I/O module and MSM represents an individual switch containing its own switching fabric and packet memory.
  • Page 17: The Blackdiamond Backplane

    The BlackDiamond Systems The BlackDiamond Backplane The BlackDiamond backplane is a passive backplane, meaning that all the active components such as CPUs, ASICs, and memory have been moved onto plug-in modules, such as the I/O modules and MSMs. Figure 2: BlackDiamond passive backplane architecture (BlackDiamond 6808 shown) Switch Module Eight Load-Shared...
  • Page 18: Blackdiamond I/O Modules

    “i” Series Switch Hardware Architecture BlackDiamond I/O Modules Each BlackDiamond I/O module has a built-in switching fabric (see Figure 3) giving the module the capability to switch local traffic on the same module. Traffic that is destined for other modules in the chassis travels across the backplane to the MSMs, where it is switched and sent to its destination I/O module.
  • Page 19: Management Switch Modules

    The BlackDiamond Systems packet memory for temporary storage. Based on the information in memory, such as the FDB, the address filtering and queue management ASIC makes a forwarding decision. If the next hop is a local port (on the same module), the packet is forwarded to the external MAC and PHY for the exit port. If the packet is destined for another module (as either slow path traffic or fast path traffic), the packet is transferred to the internal MAC and then on to the MSM (CPU).
  • Page 20: Blackdiamond Msm Redundancy

    “i” Series Switch Hardware Architecture BlackDiamond MSM Redundancy The CPU subsystems on a pair of BlackDiamond MSMs operate in a master-slave relationship. (See Figure Figure 5: BlackDiamond MSM redundancy scheme MSM64i (A) I/O Module Switching Sub- Fabric system I/O Module I/O Module Fault Tolerant Switch Fabric...
  • Page 21 The BlackDiamond Systems The MSM failover behavior depends on the following factors: • Platform type and equippage (Summit vs. Alpine vs. BlackDiamond) • Software configuration settings for the software exception handling options such as system watchdog, system recovery level, and reboot loop protection. (For more information on the configuration settings, see Chapter “Software Exception...
  • Page 22: Alpine Systems

    “i” Series Switch Hardware Architecture Alpine Systems Like the BlackDiamond systems, the Alpine systems are also based on a multislot modular chassis that uses the inferno chipset, but the Alpine switches differ from the BlackDiamond switches on these points (see Figure •...
  • Page 23: Summit "I" Series Systems

    Summit “i” Series Systems Summit “ i ” Series Systems Unlike the BlackDiamond and Alpine systems, the Summit “i” series stackables are not modular systems: all of the system components are integrated into one unit. (See Figure Figure 7: Summit “ i ” series architecture AFQM ASIC ASIC...
  • Page 24 “i” Series Switch Hardware Architecture Advanced System Diagnostics and Troubleshooting Guide...
  • Page 25: Overview

    The Ethernet standard contains built-in protections to detect packet errors on the link between devices, but these mechanisms cannot always detect packet errors occurring in the switch fabric of a device. Extreme Networks has incorporated many protection mechanisms to ensure that packet error events are minimized and handled properly.
  • Page 26: Definition Of Terms

    Checksum A value computed by running actual packet data through a polynomial formula. Checksums are one of the tools used by Extreme Networks in attempts to detect and manage packet error events. Packet checksum A checksum value that is computed by the MAC chip when the packet is transferred from the MAC chip to the switch fabric.
  • Page 27: Standard Ethernet Detection For Packet Errors On The Wire

    CRC calculation and CRC validation and discards the packet, and increments the CRC error counter in the MAC device associated with that port. In Extreme Networks devices, ExtremeWare polls the MAC CRC error count registers and makes that information available through the output of the CLI command.
  • Page 28: Hardware System Detection Mechanisms

    Extreme Networks switch. Hardware System Detection Mechanisms All Extreme Networks switches based on the “i”-series switch fabric validate data integrity internal to the switch fabric using a common checksum verification algorithm. Using Figure 8...
  • Page 29: Software System Detection Mechanisms

    Extreme Networks’ Complementary Detection of Packet Errors Between Wires transmitted, but an invalid CRC value is included with the packet. Therefore, the receiving device will detect an invalid CRC value and will drop the packet. In Summit “i” series stackable switches, the packet checksum is calculated by the MAC ASIC on the receiving port and is compared against the verification checksum calculated by the MAC ASIC on the transmitting port, as described above.
  • Page 30: Failure Modes

    Packet Errors and Packet Error Detection described in the section “System (CPU and Backplane) Health Check” on page 70. For example, the system health check facility can be configured such that ExtremeWare will insert a message into the system log that a checksum error has been detected. Failure Modes Although packet errors are extremely rare events, packet errors can occur anywhere along the data path, along the control path, or while stored in packet memory.
  • Page 31: Permanent Failures

    Failure Modes Failures of this type are the result of software or hardware systems entering an abnormal operating state in which normal switch operation might, or might not, be impaired. Permanent Failures The most detrimental set of conditions that result in packet error events are those that result in permanent errors.
  • Page 32 Packet Errors and Packet Error Detection The slow-path and fast-path categories each have a separate configured threshold and associated action that occurs at the end of the 20-second window: • For the slow-path category, the three types of slow-path subcategory reports are tallied and compared to the configured slow-path subcategory threshold.
  • Page 33: Health Check Messages

    However, these attributes are currently accessible only under the instruction from Extreme Networks TAC personnel. The default settings for these attributes have been found to work effectively under a broad range of networking conditions and should not require changes.
  • Page 34: Checksum Error Messages

    Packet Errors and Packet Error Detection The intent of these messages is to alert the NOC that the health check error threshold is being exceeded. Closer monitoring is required, but these errors do not necessarily point to a systematic problem. These messages take the general format: date time <...
  • Page 35: Corrective Behavior Messages

    Health Check Messages These messages appear in the log when EDP packets received are corrupted: • <Crit:SYST> Sys-health-check [EDP] checksum error (slow-path) on M-BRD, port 0x03 — (Summit) 701026-00-03 0003Y-00052 • <Crit:SYST> Sys-health-check [EDP] checksum error (slow-path) on BPLNE, port 0x03 —...
  • Page 36 Packet Errors and Packet Error Detection • Backplane link—Indicates that health check packets were lost on one or more backplane links connecting an MSM module to an I/O module. Either module might be in error; check the transceiver diagnostics. • FDB error—Indicates that a discrepancy was detected during the FDB scan of the RAM memory pool.
  • Page 37: Chapter 4 Software Exception Handling

    Software Exception Handling This chapter describes the software exception handling features built into Extreme hardware and software products to detect and respond to problems to maximize switch reliability and availability. This chapter contains the following sections: • Overview of Software Exception Handling Features on page 37 •...
  • Page 38: System Software Exception Recovery Behavior

    Software Exception Handling The system-watchdog feature is enabled by default. The CLI commands related to system-watchdog operation are: enable system-watchdog disable system-watchdog NOTE During the reboot cycle, network redundancy protocols will work to recover the network. The impact on the network depends on the network topology and configuration (for example, OSPF ECMP versus a large STP network on a single domain).
  • Page 39 Overview of Software Exception Handling Features switch is equipped with MSM-64i modules), or 2) initiate a hitless failover (when the switch is equipped with MSM-3 modules). The watchdog is a software watchdog timer that can be enabled or disabled through CLI commands. The watchdog timer is reset as long as ExtremeWare is functioning well enough to return to the main software exception handling loop where the critical software exception handling tasks, such as tBGTask, handle the process of resetting the watchdog timer and creating log entries.
  • Page 40: Configuring System Recovery Actions

    Software Exception Handling Configuring System Recovery Actions ExtremeWare provides a user-configurable system recovery software diagnostic tool whose main function is to monitor the system boot processes. If an error occurs during the POST, the system enters a fail-safe mode that allows the network or system administrator to view logs and troubleshoot the fault.
  • Page 41: Configuring System Recovery Actions On "E" Series Switches

    Configuring System Recovery Actions Configuring System Recovery Actions on “ e ” Series Switches To specify a system recovery scheme for “e” series switches when a software exception occurs, use this command: configure sys-recovery-level [none | [all | critical]] [reboot] where: No recovery action is taken when a software exception occurs (no system shutdown or none...
  • Page 42 Software Exception Handling back into the network during a scheduled outage window. This might be an advantage if all connected nodes are dual-homed, as a reinsertion will trigger a network reconvergence and an additional service outage. NOTE Under the options, network redundancy protocols will work to recover the network. reboot shutdown The only difference between these two options, in this case, is that under the...
  • Page 43: Configuring Reboot Loop Protection

    Configuring Reboot Loop Protection Configuring Reboot Loop Protection Reboot loop protection prevents a failure that persists across a reboot from putting the switch into an endless cycle of reboots. Reboot loop protection is helpful to increase network stability in the event that some systematic problem is causing the watchdog timer to expire or a software exception to be triggered repeatedly.
  • Page 44 Software Exception Handling On BlackDiamond switches you can configure the number of times the slave MSM can reboot within a configured time limit or configure the slave MSM to use the global reboot-loop-protection configuration. Use on the of the following commands: configure reboot-loop-protection backup-msm threshold <time-interval>...
  • Page 45: Dumping The "I" Series Switch System Memory

    On “i” series switches, you can dump (copy and transfer) the contents of the system DRAM memory to a remote TFTP host so that it can be passed to an Extreme Networks technical support representative who will examine and interpret the dump results. The system dump only works through the Ethernet management port.
  • Page 46: Initiating A Manual System Dump

    Software Exception Handling • Configure the system dump as a system recovery response action. To specify a system memory dump if a software exception occurs, use this command: configure sys-recovery-level [all | critical] system-dump [maintenance-mode | msm-failover | reboot | shutdown]]] where: If any task exception occurs, ExtremeWare logs an error in the system log and automatically initiates a memory dump transfer to a remote TFTP dump server,...
  • Page 47: Example Log For A Software Exception

    Dumping the “i” Series Switch System Memory Example Log for a Software Exception The following log is taken after simulating a BGTask crash. The System recovery level, for critical events, is set to system shutdown. Hence, when BGTask crashed, all I/O modules in the system was shutdown. 12/23/2000 23:15:14.87 <Info:SYST>...
  • Page 48 Software Exception Handling Advanced System Diagnostics and Troubleshooting Guide...
  • Page 49: Diagnostic Test Functionality

    Diagnostics This chapter describes how to configure and use the Extreme Advanced System Diagnostics. This chapter contains the following sections: • Diagnostic Test Functionality on page 49 • System Health Checks: A Diagnostics Suite on page 52 • Power On Self Test (POST) on page 56 •...
  • Page 50: How The Test Affects The Switch

    Diagnostics Some diagnostic tests, such as the slot-based hardware diagnostics (including the packet memory scan), for example, can be run on demand through user CLI commands. Other tests can be run on demand by user CLI commands and can also be configured to observe specific user-selected settings. All of the ExtremeWare diagnostic tests can be coordinated under the umbrella of the ExtremeWare system health check feature, which runs automatic background checks to detect packet memory errors and take automatic action when errors are found.
  • Page 51 Diagnostic Test Functionality Diagnostic tests are processed by the CPU. When invoked, each diagnostic tests looks for different things (device problems, communication-path problems, etc.), and uses either the control bus or the data bus, or—in some cases—both buses to perform the test. For example, Figure 9 shows a simplified example of the CPU health check test.
  • Page 52: System Health Checks: A Diagnostics Suite

    Diagnostics Figure 10: Backplane health check paths (BlackDiamond architecture) Control Bus Control Bus NVRAM CPLD Subassembly UART PCMCIA AFQM AFQM ASIC ASIC (Quake) (Quake) FLASH SRAM MGMT Master MSM Daughter Card ASIC ASIC (Twister) (Twister) CPU loads test packet to MSM Fabric. Test packet transferred to I/O module Fabric on data Control Bus...
  • Page 53: The Role Of Memory Scanning And Memory Mapping

    System Health Checks: A Diagnostics Suite — Offer configurable levels — Remove the switch fabric from service for the duration of the tests • Background packet memory scanning and mapping — Checks all packet storage memory for defects — Potentially maps out defective blocks •...
  • Page 54: Modes Of Operation

    Diagnostics scanning and memory mapping diagnostics are used to identify and correct switch fabric checksum errors. Memory scanning and memory mapping are two separate functions: scanning detects the faulted portion of the memory; mapping re-maps the memory to remove the faulted memory section. Memory scanning is designed to help isolate one of the major root causes of fabric checksum errors: single-bit permanent (hard) failures.
  • Page 55: The Role Of Processes To Monitor System Operation

    System Health Checks: A Diagnostics Suite Automatic Mode. Automatic mode for initiating a memory scan is set up when the system health check option is enabled (see “System (CPU and Backplane) Health Check” on page 70). auto-recovery When system health checks fail at the specified frequency, packet memory is invoked automatically. Automatic mode status is listed in the “sys-health-check”...
  • Page 56: Power On Self Test (Post)

    Diagnostics Power On Self Test (POST) The POST actually consists of two test processes: a “pre-POST” portion that runs before the POST, and the POST itself. The entire POST (both portions) runs every time the system is booted. It tests hardware components and verifies basic system integrity.
  • Page 57: Runtime (On-Demand) System Diagnostics

    Runtime (On-Demand) System Diagnostics Runtime (On-Demand) System Diagnostics The ExtremeWare diagnostics test suite offers a set of one-time test routines that can be run on demand by user command. Depending on the switch platform and model—differences in hardware architecture determine what aspects of the diagnostic tests apply, these tests are activated by different commands and different user-configurable options.
  • Page 58: Related Commands

    Diagnostics BlackDiamond systems—whether the module type being tested is an MSM or an I/O module), but adds the following two test sets: — Packet memory test (where possible, this test also attempts to remap up to eight errors) — Additional loop-back tests: Big packet (4k) MAC, transceiver, VLAN •...
  • Page 59: System Impact Of Running The Diagnostics On "I" Series Switches

    Runtime (On-Demand) System Diagnostics System Impact of Running the Diagnostics on “ i ” Series Switches These diagnostics are invasive diagnostics. The diagnostics perform different tests, depending on whether the test is being performed on the CPU subsystem or an individual I/O module. The diagnostics reset and erase all current hardware states.
  • Page 60: Related Commands

    Diagnostics NOTE Only run these diagnostics when the switch can be brought off-line. The tests performed are extensive and affect traffic that must be processed by the system CPU, because the diagnostics themselves are processed by the system CPU. Related Commands run diagnostics show diagnostics Running the Diagnostics on Summit “...
  • Page 61: Memory Scanning And Memory Mapping Behavior

    Automatic Packet Memory Scan (via sys-health-check) Automatic mode status is listed in the “sys-health-check” field of the display for the show switch command. When is configured, an automated background polling task checks every 20 seconds to auto-recovery determine whether any fabric checksums have occurred. Three consecutive samples must be corrupted for any module to invoke autoscan.
  • Page 62 Diagnostics Table 6: Auto-recovery memory scanning and memory mapping behavior (continued) Online Offline Errors Platform Mode Mode Detected Behavior BlackDiamond with • MSM64i kept online. two MSM64i • Errors mapped; MSM64i kept online. modules; error on • >7 Errors not mapped; MSM64i kept online. master •...
  • Page 63 Automatic Packet Memory Scan (via sys-health-check) Table 7: Manual diagnostics memory scanning and memory mapping behavior, normal (continued) Online Offline Errors Platform Mode Mode Detected Behavior Summit “ i ” series • Switch kept online. • Errors mapped; switch kept online. •...
  • Page 64: Limited Operation Mode

    Diagnostics Table 8: Manual diagnostics memory scanning and memory mapping behavior, extended (continued) Platform Errors Detected? Behavior BlackDiamond with one MSM64i Switch enters limited commands mode. (or slave MSM64i is offline) Switch kept online. BlackDiamond with two MSM64i Master MSM64i fails over. modules;...
  • Page 65 Automatic Packet Memory Scan (via sys-health-check) During the memory scan, the CPU utilization is high and mostly dedicated to executing the diagnostics—as is normal for running any diagnostic on the modules. During this time, other network activities where this system is expected to be a timely participant could be adversely affected, for example, in networks making use of STP and OSPF.
  • Page 66: Interpreting Memory Scanning Results

    Diagnostics Interpreting Memory Scanning Results If single-bit permanent errors are detected on an “i” series switch during the memory scanning process, these errors will be mapped out of the general memory map with only a minimal loss to the total available memory on the system.
  • Page 67: Per-Slot Packet Memory Scan On Blackdiamond Switches

    Per-Slot Packet Memory Scan on BlackDiamond Switches Per-Slot Packet Memory Scan on BlackDiamond Switches While the system health check auto-recovery mode is effective at recovering from suspected failures, it does not provide the depth of control over recovery options that many network administrators require. The per-slot packet memory scan capability on BlackDiamond switches gives administrators the ability to set the recovery behavior for each module—an important distinction when only certain modules can be taken offline, while others must remain online no matter what the error condition.
  • Page 68: System Impact Of Per-Slot Packet Memory Scanning

    Diagnostics To disable packet memory scanning on a BlackDiamond module and return to the behavior configured for the global system health check facility, use this command: unconfigure packet-mem-scan-recovery-mode slot [msm-a | msm-b | <slot number>] To view the recovery mode configuration for BlackDiamond slots that have per-slot packet memory scanning enabled, use this command: show packet-mem-scan-recovery-mode which displays the following information:...
  • Page 69 Per-Slot Packet Memory Scan on BlackDiamond Switches modules will trigger a reboot if the failed module is the master MSM. A failed MSM-64i in the slave slot is simply removed from service. In general, network redundancy protocols will work to recover the network. The impact on the network depends on the network topology and configuration (for example, OSPF ECMP versus a large STP network on a single domain).
  • Page 70: System (Cpu And Backplane) Health Check

    Diagnostics System (CPU and Backplane) Health Check The purpose of the system health check feature is to ensure that communication between the CPU on the management switch module (MSM) and all I/O cards within the chassis is functioning properly. NOTE The system health check feature is supported only on “i”...
  • Page 71: Related Commands

    System (CPU and Backplane) Health Check Related Commands enable sys-health-check disable sys-health-check configure sys-health-check alarm-level [card-down | default | log | system-down | traps] configure sys-health-check auto-recovery <number of tries> [offline | online] (BlackDiamond) configure sys-health-check alarm-level auto-recovery [offline | online] (Alpine or Summit) Health Check Functionality The system health check feature can be configured to operate in one of two mutually-exclusive modes: •...
  • Page 72: Backplane Health Check

    Diagnostics where: Specifies the number of times that the health checker attempts to auto-recover a faulty number of tries module. The range is from 0 to 255 times. The default is 3 times. Specifies that a faulty module is to be taken offline and kept offline if one of the offline following conditions is true: •...
  • Page 73: Viewing Backplane Health Check Diagnostic Results-Show Diagnostics Command

    System (CPU and Backplane) Health Check NOTE Frequent corrupted packets indicate a failure that you need to address immediately. Missed packets are also a problem, but you should consider the total number of missed packets as only a general check of the health of the system. Small numbers (fewer than five) can generally be ignored, as they can be caused by conditions where the CPU becomes too busy to receive the transmitted packets properly, subsequently causing the missed packet count to increase.
  • Page 74 Diagnostics Backplane Health Check Diagnostic Results—Example 1. Example 1 shows the report from one MSM, MSM-A in a BlackDiamond 6808 switch. If two MSMs are in the chassis, both MSM-A and MSM-B are reported. Total Tx Total Rv Total miss Error Pkt Diag fail Last fail time...
  • Page 75 System (CPU and Backplane) Health Check To clarify the relationship between MSM ports, the backplane links, and the I/O module slots shown in Example 1, consider the following annotated adaptation of the example’s output (not actual command output; for instructional purposes only): Module Port Channel...
  • Page 76 Diagnostics Backplane Health Check Diagnostic Results—Example 2. Example 2 shows a report for MSM-A again, but this time with missed and corrupted packets on different channels going to more than one I/O module slot. In example 2, the missed packets and corrupted packets on channels going to more than one I/O module (slots 1, 4, and 7 in this example) indicate what is most likely a problem with MSM-A, itself.
  • Page 77 System (CPU and Backplane) Health Check Backplane Health Check Diagnostic Results—Example 3. Example 3 shows a report for MSM-A again, but with missed and corrupted packets on channels going to the same slot. In example 3, the corrupted packets on channels going to the same I/O module (slot 7 in this example) indicate what is most likely a problem with the I/O module in slot 7.
  • Page 78: Analyzing The Results

    Diagnostics Backplane Health Check Diagnostic Results—Example 4. Example 4 shows a report for MSM-A again, but with small numbers of missed packets on channels going to different slots. In example 4, the small numbers of missed packets (fewer than five) indicate what is most likely not a serious hardware problem.
  • Page 79: Cpu Health Check

    System (CPU and Backplane) Health Check • If a health check checksum error message appears in the log, and the output of the command shows excessive backplane health check error counts, you can usually show diagnostics use those two sources of information to determine the location of the problem. •...
  • Page 80: Viewing Cpu Health Check Diagnostic Results-Show Diagnostics Command

    Diagnostics NOTE Be aware that the slot information in the log message might be symptomatic of a problem occurring on another module in the system rather than on the indicated module. When you have observed log messages indicating missed or corrupted health check packets, use the command as the next source of information about health check failures.
  • Page 81 System (CPU and Backplane) Health Check • CPU health check failures might indicate a faulty transceiver on one of the MSMs, but might also indicate other I/O control bus failures. Always use log messages in conjunction with the output of command.
  • Page 82: Transceiver Diagnostics

    Diagnostics Transceiver Diagnostics The transceiver diagnostics test the integrity of the management bus transceivers used for communication between the ASICs in the Inferno chipset and the CPU subsystem. (See Figure 10.) These diagnostics write test patterns to specific ASIC registers, read the registers, then compare results, looking for errors in the communication path.
  • Page 83: System Impacts Of The Transceiver Diagnostics

    1 to 8. If you do not specify a value, the test uses the default of 3 errors. NOTE Extreme Networks recommends against changing the default transceiver test threshold value. The default value of 3 errors is adequate for most networks.
  • Page 84: Viewing Diagnostics Results

    Diagnostics Viewing Diagnostics Results Use the following commands to view information related to the transceiver diagnostic test: show log show diagnostics show switch Example Log Messages for Transceiver Diagnostic Failures • If the transceiver diagnostic test detects a failure, any of the following messages will appear in the log one time.
  • Page 85: Examples, Show Diagnostics Command

    Transceiver Diagnostics • CARD_HWFAIL_RR_SCNTRL_REG_TIMEOUT • CARD_HWFAIL_BLIZZARD_REGOP_TIMEOUT • CARD_HWFAIL_BLIZZARD_SER_MGMT_REG_TIMEOUT • CARD_HWFAIL_BLIZZARD_STAT_CTRL_REG_TIMEOUT • CARD_HWFAIL_TSUNAMI_REGOP_TIMEOUT • CARD_HWFAIL_TSUNAMI_SER_MGMT_REG_TIMEOUT • CARD_HWFAIL_TSUNAMI_STAT_CTRL_REG_TIMEOUT • CARD_HWFAIL_BLADE_STATUS_REG_TIMEOUT • CARD_HWFAIL_BLADE_CONTROL_REG_TIMEOUT • CARD_HWFAIL_VLAN_LKUP_REG_TIMEOUT • CARD_HWFAIL_DIAG_FAILED • CARD_HWFAIL_DIAG_PMS_FAILED • CARD_HWFAIL_TRANCEIVER_TEST_FAILED • CARD_HWFAIL_SYNC_TEST_FAILED Examples, show diagnostics Command This section provides two examples of the results from the command.
  • Page 86: Example-Show Switch Command

    Diagnostics Example—show diagnostics command (Alpine system). The following example of the command displays the results of the transceiver diagnostics for an Alpine system. show diagnostics Transceiver system health diag result Pass/Fail Counters are in HEX Slot CardType Cardstate Test Pass Fail Time_last_fail ----...
  • Page 87: Transceiver Diagnostic Result Analysis

    Transceiver Diagnostics License: Full L3 + Security SysHealth Check: Enabled. Alarm Level = Log Recovery Mode: None Transceiver Diag: Enabled. Failure action: sys-health-check Fdb-Scan Diag: Enabled. Failure action: sys-health-check System Watchdog: Enabled. Transceiver Diagnostic Result Analysis • If transceiver test error counters are incrementing, but there is no associated log message, the problem is probably a transient problem.
  • Page 88: Fdb Scan

    Diagnostics FDB Scan The FDB scan diagnostic test addresses the possibility of hardware FDB memory issues where FDB hardware table entries do not match what was written to them by software. The test is a non-invasive test that scans the entire FDB RAM memory pool on all switch fabrics, compares existing software table entries against what is in the hardware table, and reports or otherwise acts on any discrepancies it detects.
  • Page 89: Related Commands

    FDB Scan The failure action that the FDB scan test performs depends on the command sys-health-check configuration. The command configurations options available under the system health check are described in “Health Check Functionality” on page Related Commands enable fdb-scan [all | slot {{backplane} | <slot number> | msm-a | msm-b}] disable fdb-scan configure fdb-scan failure-action [log | sys-health-check] configure fdb-scan period <1-60>...
  • Page 90: Configuring The Fdb Scan Diagnostics

    Diagnostics Configuring the FDB Scan Diagnostics • To set the interval between FDB scans, use the following command: configure fdb-scan period <1-60> The interval is a number in the range from 1 to 60 seconds. The default is 30 seconds. We recommend a period of at least 15 seconds.
  • Page 91: Viewing Diagnostics Results

    FDB Scan Viewing Diagnostics Results Use the following commands to view information related to the FDB Scan diagnostic test: show log show diagnostics show fdb remap clear fdb remap show switch Example Log Messages for FDB Scan Diagnostic Failures Look for the following types of messages in the log: slot entry FDB Scan: max number of remaps ( ) exceeded.
  • Page 92: Example Output From The Show Switch Command

    Diagnostics In the example output of the command, in those slots equipped with a module, a show diagnostics non-zero value in the “NumFail” column indicates that a problem has been detected with FDB memory. During the FDB scan, the test attempts to map an error location so that it will not be used. If that location is in use, and the entry cannot be removed safely, FDB scan marks it as suspect (see the description of the example log messages in “Example Log Messages for FDB Scan Diagnostic Failures”...
  • Page 93: Additional Diagnostics Tools

    The recommended ambient operating temperature for Extreme Networks switches is 32° to 104° F (0° to 40° C), but this range represents the absolute limits of the equipment. Whenever possible, the temperature should be kept at approximately 78°...
  • Page 94: System Impacts Of Temperature Logging

    To prevent the loss of important log messages, Extreme Networks recommends the use of an external syslog server. For more information about the use of an external syslog server, refer to “Syslog Servers”...
  • Page 95: Disabling Logging To Remote Syslog Server Targets

    Syslog Servers Disabling Logging to Remote Syslog Server Targets To disable logging to all remote syslog server targets, use this command: disable syslog NOTE This command disables logging to all remote syslog server targets, not to the switch targets. This setting is saved in FLASH and will be in effect upon boot up.
  • Page 96: Network Impact Of The Syslog Server Facility

    Additional Diagnostics Tools Network Impact of the Syslog Server Facility Network impact depends on the volume of log messages sent to the syslog server. But even under extreme conditions, the relative brevity of log messages means that even a very large message volume should not adversely affect network throughput.
  • Page 97: Running Cable Diagnostics

    Cable Diagnostics Running Cable Diagnostics You can run the CDM tests manually at any time, or you can schedule them to be run automatically. Running CDM tests Manually. To run the tests manually, use this command: run diagnostics cable port [<portlist> | all] This command initiates the CDM to obtain cable diagnostics values for the specified physical ports of the system.
  • Page 98: Viewing And Interpreting Cdm Test Data

    Additional Diagnostics Tools command also purges the cable diagnostics values for the selected disable diagnostics cable ports from the CDM data structures. Viewing and Interpreting CDM Test Data To display CDM test information currently stored in the CDM data structures, use this command: show diagnostics cable {ports {<portlist>...
  • Page 99 Cable Diagnostics Following is sample detailed diagnostic output from this command: ======================================================== Manual Diagnostics Collected @ Thu Jan 29 02:48:29 2004 ======================================================== Port Speed Avg Len Pair Fault Loc Skew Polarity Cable Pair-Swap Diagnostic (Mbps) (meters) (meters) (ns) Status Chan-AB Chan-CD Mode ----------------------------------------------------------------------------------------------...
  • Page 100 Additional Diagnostics Tools length. For example, a 2% error in the default value of the speed of wave propagation results in a two-meter error for a 100-meter cable. Cable Pair Information. Twisted pair conductors in the RJ-45 Ethernet cable are connected to pins of the PHY in the following pairings: 1-2, 3-6, 4-5, and 7-8.
  • Page 101: Cable Diagnostics For "E" Series Switches

    Cable Diagnostics Extended Cable Status Information (Gigabit link established). When the Gigabit link can be established, the CDM tests report additional status information on approximate cable length, pair skew, polarity swap, and pair swap. • Cable length—After link, the cable diagnostics use a non-TDR method to determine and report the approximate cable length between the near-end port and its far-end link partner.
  • Page 102 Additional Diagnostics Tools Advanced System Diagnostics and Troubleshooting Guide...
  • Page 103: Contacting Extreme Technical Support

    If you have a network issue that you are unable to resolve, contact the nearest Extreme Networks TAC. The TAC will create a service request (SR) and manage all aspects of the service request until the question or issue that spawned the service request is resolved.
  • Page 104: Asia Tac

    For a detailed description of the Extreme Networks TAC program and its procedures, including service request information requirements and return materials authorization (RMA) information requirements, please refer to the Extreme Networks What You Need to Know TAC User Guide at this Web location: http://www.extremenetworks.com/services/wwtac/TacUserGuide.asp...
  • Page 105: What Information Should You Collect

    — Trend: Recurrent event? Frequency? Etc. — If the problem was resolved, what steps did you take to diagnose and resolve the problem? • Optional information (upon request from Extreme Networks TAC personnel) — System dump (CPU memory dump) • Additional CLI commands for information include: —...
  • Page 106: Diagnostic Troubleshooting

    FDB scan will mark it as suspect (suspect entries are marked with an “S”). Look at the output of the command. Address suspect entries by manually removing the entries and show fdb remap re-adding them. Consult Extreme Networks TAC if this is not possible. • In the output from the command, if transceiver test error counters are show diagnostics incrementing, it might indicate a transceiver problem.
  • Page 107: Extreme Networks' Recommendations

    Extreme Networks’ Recommendations Extreme Networks’ Recommendations Extreme Networks strongly recommends that you observe the process shown in Figure 11 and outlined in the steps that follow when dealing with checksum errors. Figure 11: Diagnostic Troubleshooting Process Customer experiences checksum errors on Inferno/Triumph products.
  • Page 108 Did the extended diagnostics (plus the packet memory scan) detect errors? • If no errors were detected, you should call the Extreme Networks TAC. The next action will be determined by the frequency with which the error occurs and other problem details.
  • Page 109: Using Memory Scanning To Screen I/O Modules

    Using Memory Scanning to Screen I/O Modules Using Memory Scanning to Screen I/O Modules NOTE Memory scanning is available in ExtremeWare 6.2.2 and later releases, and applies only to “i” series Summit, Alpine, and BlackDiamond switches. To check modules supported by the memory scanning feature, you can screen existing or new modules without having to upgrade or certify new ExtremeWare software on your networks.
  • Page 110 Troubleshooting Guidelines Advanced System Diagnostics and Troubleshooting Guide...
  • Page 111: Limited Operation Mode And Minimal Operation Mode

    Limited Operation Mode and Minimal Operation Mode This appendix describes two switch operational modes wherein switch behavior is restricted to protect the stability of the switch and network and to allow troubleshooting or corrective action. The two switch operational modes are limited operation mode and minimal operation mode. They result from different failure conditions, but respond to similar procedures for troubleshooting and correcting their respective failure conditions.
  • Page 112: Triggering Limited Operation Mode

    CLI command to collect system information. show tech 5 Send all of the information about the problem to Extreme Networks technical support. Minimal Operation Mode If the system reboots due to a failure that persists after the reboot, the system will reboot again when it detects the failure again, and will continue that behavior across an endless cycle of reboots—referred to...
  • Page 113: Bringing A Switch Out Of Minimal Operation Mode

    3 Use the command to reboot the system. reboot 4 Use the CLI command to collect system information. show tech 5 Send all of the information about the problem to Extreme Networks technical support. Advanced System Diagnostics and Troubleshooting Guide...
  • Page 114 Limited Operation Mode and Minimal Operation Mode Advanced System Diagnostics and Troubleshooting Guide...
  • Page 115: Reference Documents

    Other Documentation Resources Extreme Networks customer documentation is available at the following Web site: http://www.extremenetworks.com/services/documentation/ The customer documentation support includes: • ExtremeWare Software User Guide • ExtremeWare Command Reference Guide Use the user guide and the command reference guide to verify whether the configuration is correct.
  • Page 116 Use the release notes to check for known issues, supported limits, bug fixes from higher ExtremeWare versions, etc. (Release notes are available to all customers who have a service contract with Extreme Networks via eSupport. The release notes are provided by product, under the Software Downloads area of eSupport.)
  • Page 117 Index Symbols text, About This Guide CPU health check "e" series 13, 59, 101 packet type descriptors "hitless" MSM failover 21, 39 system health check test "i" series CPU subsystem "inferno" series backplane health check CPU health check test diagnostic processing (figure) active backplane (Alpine systems) CRC, Ethernet active tests...
  • Page 118 Index defined test routines error detection mechanisms Power-on self test. See POST fast-path checksum error fast path checksum errors Real Time Clock (RTC) defined reboot 40, 41 forwarding recovery action FastPOST maintenance mode MSM failover background scan process reboot 40, 41 scan diagnostic shutdown field notices...
  • Page 119 Index tBGTask 40, 41 tConsole tEdpTask 40, 41 tEsrpTask 40, 41 tExcTask tExtTask tLogTask tNetTask 40, 41 tShell tSwFault tSyslogTask systematic errors tasks root task tBGTask 40, 41 tConsole tEdpTask 40, 41 tEsrpTask 40, 41 tExcTask tExtTask tLogTask tNetTask 40, 41 tShell tSwFault tSyslogTask...
  • Page 120 Index Advanced System Diagnostics and Troubleshooting Guide...
  • Page 121: Index Of Commands

    Index of Commands disable syslog 94, 95 disable system-watchdog abort diagnostics cable 96, 97 disable temperature-logging disable transceiver-test (Alpine) disable transceiver-test (BlackDiamond) clear fdb remap clear log clear log diag-status 112, 113 enable diagnostics cable configure diagnostics enable diagnostics cable port configure diagnostics cable enable fdb-scan 88, 89...
  • Page 122 Index of Commands show fdb remap 91, 106 show log 72, 84, 91, 93, 105, 112, 113 show packet-mem-scan-recovery-mode show port rxerrors show ports cable diagnostics show switch 43, 56, 84, 91, 105 show system-dump show tech 105, 112, 113 show version synchronize synchronize command...

Table of Contents