Compaq AlphaServer ES45 Service Manual
Compaq AlphaServer ES45 Service Manual

Compaq AlphaServer ES45 Service Manual

Hide thumbs Also See for AlphaServer ES45:
Table of Contents

Advertisement

Quick Links

AlphaServer ES45
Service Guide
Order Number: EK-ES450-SV. A01
This manual is for service providers and self-maintenance
customers responsible for ES45 systems.
Compaq Computer Corporation

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the AlphaServer ES45 and is the answer not in the manual?

Questions and answers

Summary of Contents for Compaq AlphaServer ES45

  • Page 1 AlphaServer ES45 Service Guide Order Number: EK-ES450-SV. A01 This manual is for service providers and self-maintenance customers responsible for ES45 systems. Compaq Computer Corporation...
  • Page 2 Open Group in the United States and other countries. All other product names mentioned herein may be trademarks of their respective companies. Compaq shall not be liable for technical or editorial errors or omissions contained herein. The information in this document is provided “as is” without warranty of any kind and is subject to change without notice.
  • Page 3 Japanese Notice Canadian Notice This Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulations. Avis Canadien Cet appareil numérique de la classe A respecte toutes les exigences du Règlement sur le matériel brouilleur du Canada. European Union Notice Products with the CE Marking comply with both the EMC Directive (89/336/EEC) and the Low Voltage Directive (73/23/EEC) issued by the Commission of the European Community.
  • Page 5: Table Of Contents

    Hard Disk Drive Storage..............1-30 1.17 System Access ..................1-31 1.18 Console Terminal ................1-33 Chapter 2 Troubleshooting Questions to Consider ................2-2 Diagnostic Tables .................. 2-3 Service Tools and Utilities ..............2-9 2.3.1 Error Handling/Logging Tools (Compaq Analyze)......2-9 2.3.2 Loopback Tests................2-9...
  • Page 6 2.4.3 Reviewing Results of the Q-Vet Run..........2-17 2.4.4 De-Installing Q-Vet............... 2-19 Information Resources ................ 2-20 2.5.1 Compaq Service Tools CD .............. 2-20 2.5.2 ES45 Service HTML Help File............2-20 2.5.3 Alpha Systems Firmware Updates ..........2-20 2.5.4 Fail-Safe Loader................2-21 2.5.5...
  • Page 7 4-62 4.21 sys_exer ....................4-64 4.22 test....................... 4-66 Chapter 5 Error Logs Error Log Analysis with Compaq Analyze..........5-2 5.1.1 WEB Enterprise Service (WEBES) Director........5-4 5.1.2 Using Compaq Analyze..............5-5 5.1.3 Bit to Test ..................5-10 Fault Detection and Reporting............5-17 Machine Checks/Interrupts..............
  • Page 8 Chapter 6 System Configuration and Setup System Consoles..................6-2 6.1.1 Selecting the Display Device............6-3 6.1.2 Setting the Control Panel Message..........6-4 Displaying the Hardware Configuration..........6-5 Setting Environment Variables ............6-6 Setting Automatic Booting..............6-16 6.4.1 Setting the Operating System to Auto Start ........ 6-16 Changing the Default Boot Device............
  • Page 9 Chapter 8 FRU Removal and Replacement FRUs ..................... 8-3 8.1.1 Power Cords ..................8-6 8.1.2 FRU Locations ................8-7 8.1.3 Important Information Before Replacing FRUs ......8-9 Removing Enclosure Panels..............8-11 Accessing the System Chassis in a Cabinet........8-15 Removing Covers from the System Chassis........8-17 Power Supply ..................
  • Page 10 Appendix C DPR Address Layout Appendix D Registers Ibox Status Register (I_STAT) ............. D-2 Memory Management Status Register (MM_STAT) ......D-5 Dcache Status Register (DC_STAT)............. D-6 Cbox Read Register ................D-7 Exception Address Register (EXC_ADDR) .......... D-9 Interrupt Enable and Current Processor Mode Register (IER_CM).. D-10 Interrupt Summary Register (ISUM) ..........
  • Page 11 Examples 3–1 Sample SROM Power-Up Display............3-6 3–2 SRM Power-Up Display ..............3-10 3–3 Sample Console Event Log..............3-13 3–4 Checksum Error and Fail-Safe Load........... 3-16 4–1 buildfru....................4-6 4–2 more el....................4-10 4–3 clear_error ................... 4-11 4–4 deposit and examine................4-13 4–5 exer......................
  • Page 12 Compaq Analyze Initial Screen............. 5-5 5–2 Problem Reports Screen ................ 5-6 5–3 Compaq Analyze Problem Report Details..........5-7 5–4 Compaq Analyze Problem Report Details (Continued)......5-8 6–1 CPU Slot Locations (Pedestal/Rack) ........... 6-21 6–2 CPU Slot Locations (Tower)..............6-22 6–3 Stacked and Unstacked DIMMs ............
  • Page 13 6–5 Memory Configuration (Tower)............6-28 6–6 PCI Slot Locations (Pedestal/Rack)............. 6-29 6–7 PCI Slot Voltages and Hose Numbers..........6-30 6–8 PCI Slot Locations (Tower) ..............6-31 6–9 PCI Status LEDs................. 6-32 6–10 Power Supply Locations ..............6-33 7–1 Data Flow in Through Mode ..............7-4 7–2 Data Flow in Bypass Mode..............
  • Page 14 RMC and SPC Jumpers ................B-2 B–2 TIG/SROM Jumpers................B-4 B–3 CSB Switchpack E16................B-7 B–4 PCI Board Jumpers................B-9 Tables Compaq AlphaServer ES45 Documentation ........xviii 1–1 Fan Descriptions ................. 1-28 2–1 Power Problems..................2-4 2–2 Problems Getting to Console Mode ............2-5 2–3 Problems Reported by the Console............
  • Page 15 8–1 FRU List....................8-3 8–2 Country-Specific Power Cords............... 8-6 A–1 SRM Commands Used on ES45 Systems..........A-1 B–1 RMC/SPC Jumper Settings..............B-3 B–2 TIG/FSL Jumper Descriptions..............B-5 B–3 Firmware Function Table (FIR_FUNC..........B-5 B–4 Clock Generator Settings ..............B-8 B–5 PCI Board Jumper Descriptions ............B-10 C–1 DPR Address Layout................C-2 D–1...
  • Page 16 E–1 Information Needed to Isolate Failing DIMMs........E-2 E–2 Determining the Real Failed Array for 4-Way Interleaving....E-3 E–3 Determining the Real Failed Array for 2-Way Interleaving....E-3 E–4 Description of DPR Locations 80, 82, 84, and 86 ........E-4 E–5 Failing DIMM Lookup Table..............E-6 E–6 Syndrome to Data Check Bits Table ...........E-19...
  • Page 17: Preface

    Preface Intended Audience This manual is for service providers and self-maintenance customers who are responsible for servicing ES45 systems. Document Structure This manual uses a structured documentation design. Topics are organized into small sections, usually consisting of two facing pages. Most topics begin with an abstract that provides an overview of the section, followed by an illustration or example.
  • Page 18: Compaq Alphaserver Es45 Documentation

    AG–RPJ5A–TS included) Loose Piece Items Basic Installation Card EK–ES450–PD Rackmount Installation Guide EK–ES450–RG Rackmount Installation Template ES–ES450–TP Information on the Internet Visit the Compaq Web site at www.compaq.com for service tools and more information about the AlphaServer ES45 system. xviii...
  • Page 19: Chapter 1 System Overview

    Chapter 1 System Overview This chapter provides an overview of the system in these sections: • System Architecture • System Enclosures • System Chassis—Front View/Top View • System Chassis—Rear View • Hot Swap Module • I/O Ports and Slots • Control Panel •...
  • Page 20: System Architecture

    System Architecture The system uses a switch-based interconnect system that maintains constant performance even as the number of transactions multiplies. Figure 1–1 System Block Diagram Command, Address, and Control Lines for Each Memory Array C-chip Control Lines for D-chips CAPbus P-chip 64 bit PCI 64 bit PCI...
  • Page 21 This system is designed to fully exploit the potential of the Alpha EV68 CB chip by using a switch-based (or point-to-point) interconnect system. With a traditional bus design, the processors, memory, and I/O modules share the bus. As the number of bus users increases, the transactions interfere with one another, increasing latency and decreasing aggregate bandwidth.
  • Page 22: System Enclosures

    System Enclosures The ES45 family consists of a standalone tower, a pedestal with expanded storage capacity, and a cabinet. Figure 1–2 ES45 Systems Cabinet Pedestal Tower PK0212B ES45 Service Guide...
  • Page 23 The ES45 system provides connectors for eight DIMMs on each of the memory motherboards (MMBs) and connectors for ten PCI options on the PCI backplane. The system comes with the following: • 1–4 CPUs • Up to 32 DIMMs (8 DIMMs on each MMB) •...
  • Page 24: System Chassis-Front View/Top View

    System Chassis—Front View/Top View Figure 1–3 Components Top/Front View (Pedestal/Rackmount Orientation) PK0201b Operator control panel CD-ROM drive Removable media bays Floppy diskette drive Storage drive bays Fans CPUs Memory PCI cards ES45 Service Guide...
  • Page 25: System Chassis-Rear View

    System Chassis—Rear View Figure 1–4 Rear Components (Pedestal/Rackmount Orientation) PK0206B Power supplies PCI bulkhead I/O ports Power harness access cover Speaker System Overview...
  • Page 26: Hot Swap Module

    Hot Swap Module WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. CAUTION: Hot swap is not currently supported by the operating systems. Do not press switches on the hot swap module while the system is powered.
  • Page 27: Hot Swap Module

    Figure 1–5 Hot Swap Module Open Position Closed Position MR0028 Module release button Momentary hot swap power switch (Not supported) Communication connector Module release button connection System Overview...
  • Page 28: I/O Ports And Slots

    I/O Ports and Slots Figure 1–6 Rear Connectors Pedestal/Rack Tower PK0209A 1-10 ES45 Service Guide...
  • Page 29 Rear Panel Connections Modem port—Dedicated 9-pin port for connection by modem to remote management console. COM2 serial port—Extra port to modem or any serial device. Keyboard port—To PS/2-compatible keyboard. Mouse port—To PS/2-compatible mouse. COM1 MMJ-type serial port/terminal port —For connecting a console terminal.
  • Page 30: Control Panel

    Control Panel The control panel provides system controls and status indicators. The controls are the Power, Halt, and Reset buttons. A 16-character back-lit alphanumeric display indicates system state. The panel has two LEDs: a green Power OK indicator and an amber Halt indicator. Figure 1–7 Control Panel PK0204 Control panel display.
  • Page 31: Halt In/Out

    Power LED (green). Lights when the power button is depressed and system power passes initial checks. Reset button. A momentary contact switch that restarts the system and reinitializes the console firmware. Power-up messages are displayed, and then the console prompt is displayed or the operating system boot messages are displayed, depending on how the startup sequence has been defined.
  • Page 32: System Motherboard

    System Motherboard The system motherboard is located on the floor of the system card cage. It has slots for the CPUs and memory motherboards (MMBs) and has the PCI backplane interconnect. Figure 1–8 Component and Connector Locations Connector to PCI Backplane CPU2 MMB3 MMB2...
  • Page 33 The system motherboard has the majority of the logic for the system, including: • CPU connectors • MMB connectors • Connector to PCI backplane • RMC jumpers • Fail-safe loader (FSL) jumpers • Vterm and Cterm regulators Figure 1–8 shows the location of components and connectors on the system motherboard.
  • Page 34: Cpu Card

    CPU Card An ES45 can have up to four CPU cards. The CPU card has an 8-Mbyte second-level cache and a DC-to-DC converter that provides the required voltage to the Alpha chip. Power-up diagnostics are stored in a flash SROM on the card. Figure 1–9 CPU Card PK0271A The EV68 CB microprocessor is a superscalar CPU with out-of-order execution...
  • Page 35: Memory Architecture And Options

    1.10 Memory Architecture and Options The system has two 256-bit wide memory data buses, which can move large amounts of data simultaneously. Figure 1–10 Memory Architecture MMB3 MMB2 MMB1 MMB0 Address Arrays 0 & 2 Address Arrays 1 & 3 256 Data + 32 Check Bits 256 Data + 32 Check Bits Data...
  • Page 36 Memory Architecture Memory throughput in this system is maximized by the following features: • Two independent, wide memory data buses • Very low memory latency (120 ns) and high bandwidth with 125 MHz clock • ECC memory Each data bus is 256 bits wide (32 bytes). The memory bus speed is 125 MHz. This yields 4 GB/sec bandwidth per bus (32 x 125 MHz = 4 GB/sec).
  • Page 37: Pci Backplane

    1.11 PCI Backplane The PCI backplane has four independent 64-bit, PCI buses, one at 33 MHz and three at 66 MHz. The PCI buses support 3.3 volt and 5 volt options. Figure 1–11 I/O Control Logic 2 Slots No Hot Plug PCI 2 66 MHz Acer/Yukon HPC...
  • Page 38 PCI modules are either designed specifically for 5.0 or 3.3 volt slots, or are universal in design and can plug into either 3.3 or 5.0 volt slots. CAUTION: PCI modules designed specifically for 5.0 volts or 3.3 volts are keyed differently.
  • Page 39: Remote System Management Logic

    1.12 Remote System Management Logic The remote system management logic consists of two major elements: the system power controller (SPC), used to monitor and control system power supplies, regulators, and cooling apparatus; and the remote management console (RMC), which facilitates remote interrogation and control of the system.
  • Page 40 The error log information is written to the DPR by Compaq Analyze (see Chapter 5) and then written back to the EEPROMs by the RMC. This ensures that the error log is available on a FRU after power has been lost.
  • Page 41: System Power Controller (Spc)

    1.12.1 System Power Controller (SPC) The system power controller (SPC) is responsible for sequencing the turn-on/turn-off of all power supplies and regulators, monitoring all system power supplies and regulators, generating hardware resets to all logic elements, and generating power system status signals for use by other functional units within the system.
  • Page 42: Remote Management Console (Rmc)

    1.12.2 Remote Management Console (RMC) The remote management console (RMC) provides a mechanism for remotely monitoring a system and manipulating it on a very low level. It also provides access to the repository for all error information in the system. This provides the operator, either remotely or locally, with the ability to monitor the system (voltages, temperature, fans, error status) and manipulate it (reset, power on/off, halt) without any interaction on the part of the operating system.
  • Page 43: Power Supplies

    1.13 Power Supplies The power supplies provide power to components in the system box. The number of power supplies required depends on the system configuration. Figure 1–13 Power Supplies Tower Pedestal/Rack 1 1 1 2 2 2 PK0207A System Overview 1-25...
  • Page 44 Two to three power supplies provide power to components in the system box. The system supports redundant power configurations to ensure continued system operation if a power supply fails. See Chapter 6 for power supply configurations. The power supplies select line voltage automatically (100V to 240V and 50 Hz or 60 Hz).
  • Page 45: Fans

    1.14 Fans The system has six hot-plug fans that provide front-to-back airflow. Figure 1–14 System Fans PK0208a System Overview 1-27...
  • Page 46: Fan Descriptions

    The system fans are shown in Figure 1–14 and described in Table 1–1. Table 1–1 Fan Descriptions Number Area Cooled Fan Failure Scenario PCI card cage Both fans are powered at all times. If one Removable media fan fails, all other system fans run at 4.5-in.
  • Page 47: Removable Media Storage

    1.15 Removable Media Storage The system box houses a CD-ROM drive and a high-density 3.5-inch floppy diskette drive and supports two additional 5.25-inch half- height drives or one additional full-height drive. The 5.25-inch half height area has a divider that can be removed to mount one full- height 5.25-inch device.
  • Page 48: Hard Disk Drive Storage

    1.16 Hard Disk Drive Storage The system chassis can house up to two storage disk cages. storage subsystem supports “hot pluggable" universal hard disk drives that can be replaced while the storage backplane is powered and operating. You can install six 1-inch universal hard drives in each storage disk cage. See Chapter 8 for information on replacing hard disk drives.
  • Page 49: System Access

    1.17 System Access At the time of delivery, the system keys are taped inside the small front door that provides access to the operator control panel and removable media devices. Figure 1–17 System Lock and Key Tower Pedestal PK0224A System Overview 1-31...
  • Page 50 Both the tower and pedestal systems have a small front door through which the control panel and removable media devices are accessible. At the time of deliv- ery, the system keys are taped inside this door. The tower front door has a lock that lets you secure access to the universal disk drives and to the rest of the system.
  • Page 51: Console Terminal

    1.18 Console Terminal The console terminal can be a serial (character cell) terminal connected to the COM1 or COM2 port or a VGA monitor connected to a VGA adapter. A VGA monitor requires a keyboard and mouse. Figure 1–18 Console Terminal Connections (Local) Tower Pedestal/Rack PK0225B...
  • Page 53: Chapter 2 Troubleshooting

    Chapter 2 Troubleshooting This chapter describes the starting points for diagnosing problems on ES45 systems. The chapter also provides information resources. • Questions to Consider • Diagnostic Tables • Service Tools and Utilities • Q-Vet Installation Verification • Information Resources Troubleshooting 2-1...
  • Page 54: Questions To Consider

    If you are unable to access the SRM console, enter the RMC CLI and issue commands to determine the hardware status. See Chapter 7. If the operating system has crashed and rebooted, the CCAT (Compaq Crash Analysis Tool), the Compaq Analyze service tools (to interpret error logs), the SRM crash command, and operating system exercisers can be used to diagnose system problems.
  • Page 55: Diagnostic Tables

    Diagnostic Tables System problems can be classified into the following five categories. Using these categories, you can quickly determine a starting point for diagnosis and eliminate the unlikely sources of the problem. 1. Power problems—Table 2–1 2. No access to console mode—Table 2–2 3.
  • Page 56: Power Problems

    Table 2–1 Power Problems Symptom Action Reference • System does not Check error messages on the OCP. power on. • Check that AC power is plugged in. • Check that the ambient room temperature is within environmental specifications (10–35° C, 50–95° F). •...
  • Page 57: Problems Getting To Console Mode

    Table 2–2 Problems Getting to Console Mode Symptom Action Reference Power-up screen is not Note any error beep codes and Chapter 3 displayed at system observe the OCP display for a console. failure detected during self-tests. Check keyboard and monitor Chapter 1 connections.
  • Page 58: Problems Reported By The Console

    Table 2–3 Problems Reported by the Console Symptom Action Reference No SRM messages are Console firmware is Chapter 3 displayed after the “jump to corrupted. Load new console” message. firmware with fail-safe loader. The system attempts to boot The system automatically Chapter 3 and from the floppy drive after a reverts to the fail-safe...
  • Page 59: Boot Problems

    Table 2–4 Boot Problems Symptom Action Reference System cannot find Use the show config and show device Chapter 6 boot device. commands to check the system configuration for the correct device parameters (node ID, device name, and so on). Examine the auto_action, bootdef_dev, boot_osflags, and os_type environment variables.
  • Page 60: Errors Reported By The Operating System

    If the problem is intermittent, ensure Chapter 5 that Compaq Analyze has been installed and is running in background mode (GUI does not have to be running) to determine the defective FRU. 2-8 ES45 Service Guide...
  • Page 61: Service Tools And Utilities

    The operating systems provide fault management error detection, handling, notification, and logging. The primary tool for error handling is Compaq Analyze, a fault analysis utility designed to analyze both single and multiple error/fault events. Compaq Analyze uses error/fault data sources other than the traditional binary error log.
  • Page 62: Remote Management Console (Rmc)

    For fatal errors, the operating systems save the contents of memory to a crash dump file. This file can be used to determine why the system crashed. CCAT, the Compaq Crash Analysis Tool, is the primary crash dump analysis tool for analyzing crash dumps on Alpha systems. CCAT compares the results of a crash dump with a set of rules.
  • Page 63: Q-Vet Installation Verification

    Q-Vet Installation Verification CAUTION: Customers are not authorized to access, download, or use Q- Vet. Q-Vet is for use by Compaq engineers to verify the system installation. Misuse of Q-Vet may result in loss of customer data. Q-Vet is the Qualification Verifier Exerciser Tool that is used by Compaq engineers to exercise systems under development.
  • Page 64 Swap or Pagefile Space The system must have adequate swap space (on Tru64 UNIX) or pagefile space (on OpenVMS) for proper Q-Vet operation. You can set this up either before or after Q-Vet installation. During initialization, Q-Vet will display a message indicating the minimum amount of swap/pagefile needed, if it determines that the system does not have enough.
  • Page 65: Installing Q-Vet

    2.4.1 Installing Q-Vet The procedures for installation of Q-Vet differ between operating systems. You must install Q-Vet on each partition in the system. Install and run Q-Vet from the SYSTEM account on VMS and the root account on UNIX. Remember to install Q-Vet in each partition. Tru64 UNIX 1.
  • Page 66 OpenVMS 1. Delete any QVETAXPxxx.A or QVETAXPxxx.EXE file from the current directory. 2. Copy the self-extracting kit image file (QVETAXPxxx.EXE) to the current directory. 3. It is highly recommended, but not required, that you purge the system disk before installing Q-Vet. This will free up space that may be needed for pagefile expansion during the AUTOGEN phase.
  • Page 67: Running Q-Vet

    Running Q-Vet You must run Q-Vet on each partition in the system to verify the complete system. Compaq recommends that you review the Special Notices and the Testing Notes section of the Release Notes located at http://chump2.mro.cpqcorp.net/qvet/ before running Q-Vet.
  • Page 68 OpenVMS Graphical Interface 1. From the Main Menu, select IVP, Load Script and select Long IVP (the IVP tests will then load into the Q-Vet process window). 2. Click the Start All button to begin IVP testing. $ vet /int=char Command-Line Q-Vet_setup>...
  • Page 69: Reviewing Results Of The Q-Vet Run

    ”Additional information may be available from Compaq Analyze” It is recommended that you run Compaq Analyze to review test results. The testing times (for use with Compaq Analyze) are printed to the Q-Vet run window and are available in the summary log.
  • Page 70: Installing Q-Vet

    2.4.4 De-Installing Q-Vet The procedures for de-installation of Q-Vet differ between operating systems. You must de-install Q-Vet from each partition in the system. Failure to do so may result in the loss of customer data at a later date if Q-Vet is misused.
  • Page 71: Information Resources

    Compaq Service Tools CD The Compaq Service Tools CD-ROM enables field engineers to upgrade customer systems with the latest version of software when the customer does not have access to Compaq Web pages. The Web site is: http://caspian1.zko.dec.com/service_tools/ 2.5.2 ES45 Service HTML Help File The information contained in this guide, including the FRU procedures and illustrations, is available in HTML Help format as part of the Maintenance Kit.
  • Page 72: Fail-Safe Loader

    • If you do not have a Web browser, you can download the files using anonymous ftp: http://gatekeeper.research.compaq.com/pub/Digital/Alpha/firmware/ Individual Alpha system firmware releases that occur between releases of the firmware CD are located in the interim directory: http://gatekeeper.research.compaq.com/pub/Digital/Alpha/firmware/interim/ 2.5.4...
  • Page 73: Supported Options

    The information includes firmware updates, the latest configuration utilities, software patches, lists of supported options, and more. http://www.compaq.com/alphaserver/es40/es40.html 2.5.8 Supported Options A list of options supported on the system is available on the Internet: http://www.compaq.com/alphaserver/es40/ Troubleshooting 2-21...
  • Page 75: Chapter 3 Power-Up Diagnostics And Display

    Chapter 3 Power-Up Diagnostics and Display This chapter describes the power-up process and RMC, SROM, and SRM power- up diagnostics. The following topics are covered: • Overview of Power-Up Diagnostics • System Power-Up Sequence • Power-Up Displays • Power-Up Error Messages •...
  • Page 76: Overview Of Power-Up Diagnostics

    Overview of Power-Up Diagnostics The power-up process begins with the power-on of the power supplies. After the AC and DC power-up sequences are completed, the remote management console (RMC) reads EEROM information and deposits it into the DPR. The SROM minimally tests the CPUs, initializes and tests backup cache, and minimally tests memory.
  • Page 77: System Power-Up Sequence

    System Power-Up Sequence The power-up sequence is described below and illustrated in Figure 3–1. 1. When the power cord is plugged into the wall outlet, 5V auxiliary AC voltage is enabled. The 5 V AUX LEDs on the power supplies are lit, and the system power controller and RMC are initialized.
  • Page 78: Power-Up Sequence

    Figure 3–1 Power-Up Sequence Apply AC power 5 V AUX LEDs on PS are lit OCP Power button = IN Turn on power supplies Turn on CPU converters Turn on VTERM regulators Set all CPU_DCOK = True Set SYS_DC_OK = True Set SYS_RESET = False Set CPU(n)_RESET = False Set CPU(n)_RESET = False...
  • Page 79 Figure 3–1 Power-Up Sequence (Continued) SROM Power-Up Init EV68 Test PCI Determine Config Good Reload Using Flash SROM Init EV68 Test PCI Release CPUs B-Cache Tests Memory Config and Tests Load SRM PK0964 Power-Up Diagnostics and Display...
  • Page 80: Power-Up Displays

    Power-Up Displays Power-up information is displayed on the operator control panel and on the console terminal startup screen. Messages sent from the RMC and SROM programs are displayed first, followed by messages from the SRM console. 3.3.1 SROM Power-Up Display The following example describes the SROM power-up sequence and shows the SROM power-up messages and corresponding OCP messages.
  • Page 81 SROM Power-Up Sequence When the system powers up, the SROM code is loaded into the I-cache (instruction cache) on the first available CPU, which becomes the primary CPU. The order of precedence is CPU0, CPU1, and so on. The primary CPU attempts to access the PCI bus.
  • Page 82 Example 3–1 Sample SROM Power-Up Display (Continued) Bcache ECC data tests in progress Size Mem Bcache TAG lines tests in progress Memory sizing in progress Memory configuration in progress Testing AAR3 Memory data test in progress Memory address test in progress Memory pattern test in progress Testing AAR2 Memory data test in progress...
  • Page 83 SROM Power-Up Sequence The primary CPU initiates all memory tests. The memory is tested for address and data errors for the first 32 MB of memory in each array. It also initializes all the “sized” memory in the system. If a memory failure occurs, an error is reported. An untested memory array is assigned to address 0 and the failed memory array is de-assigned.
  • Page 84: Srm Console Power-Up Display

    3.3.2 SRM Console Power-Up Display When SROM power-up is complete, the primary CPU transfers control to the SRM console program. The console program continues the system initialization. Failures are reported to the console terminal through the power-up screen and a console event log. The following section shows the messages that are displayed once the SROM has transferred control to the SRM console.
  • Page 85 Example 3–2 SRM Power-Up Display (Continued) Hose 0 - PCI bus running at 33Mhz entering idle loop probing hose 0, PCI probing PCI-to-ISA bridge, bus 1 probing PCI-to-PCI bridge, bus 2 bus 0, slot 8 -- pka -- NCR 53C895 bus 0, slot 9 -- eia -- DE600-AA bus 2, slot 0 -- pkb -- NCR 53C875 bus 2, slot 1 -- pkc -- NCR 53C875...
  • Page 86 1024Mb 0000000240000000 2-Way 10240 MB of System Memory AlphaServer ES45 Console V5.9-9, built on June 2001 at 17:09:49 The console is started on the secondary CPUs. The example shows a four- processor system. Various diagnostics are performed. The console terminal displays the SRM console banner and the prompt, Pnn>>>.
  • Page 87: Srm Console Event Log

    3.3.3 SRM Console Event Log The SRM console event log helps you troubleshoot problems that do not prevent the system from coming up to the SRM console. console event log consists of status messages received during power-up self-tests. Example 3–3 Sample Console Event Log >>>...
  • Page 88: Power-Up Error Messages

    Power-Up Error Messages Error messages at power-up may be displayed by the RMC, SROM, and SRM. A few SROM messages are announced by beep codes. 3.4.1 SROM Messages with Beep Codes Table 3–1 Error Beep Codes Beep Associated Code Messages Meaning Jump to SROM code has completed execution.
  • Page 89 Table 3–1 Error Beep Codes (Continued) Beep Associated Code Messages Meaning 1-2-4 BC error Backup cache (B-cache) error. Indicates a bad CPU. CPU error BC bad 1-3-3 No mem No usable memory detected. Some memory DIMMs may not be properly seated or some DIMM sets may be faulty.
  • Page 90: Checksum Error

    3.4.2 Checksum Error If no messages are displayed on the operator control panel after the Jump to Console message, the console firmware is corrupted. When the system detects the error, it attempts to load a utility called the fail- safe loader (FSL) so that you can load new console firmware images. A sequence similar to the one in Example 3–4 occurs.
  • Page 91 The sequence shown in Example 3–4 is as follows: The system detects the checksum error and writes a message to the console screen. The system attempts to automatically load the FSL program from the floppy drive. As the FSL program is initialized, messages similar to the console power-up messages are displayed.
  • Page 92: No Mem Error

    3.4.3 No MEM Error If the SROM code cannot find any usable memory, a 1-3-3 beep code is issued (one beep, a pause, a burst of three beeps, a pause, and another burst of three beeps), and the message “No MEM” is displayed on the OCP.
  • Page 93: Rmc Error Messages

    3.4.4 RMC Error Messages Table 3–2 lists the fatal error messages that could potentially be displayed on the OCP by the remote management console during power-up. Most fatal error messages prevent the system from completing power-up. The warning messages listed in Table 3–3 require prompt attention but might not prevent the system from completing power-up or booting the operating system.
  • Page 94: Rmc Warning Messages

    Table 3–2 RMC Fatal Error Messages (Continued) Message Meaning Bad CPU ROM data Invalid data in EEROM on the CPU. 2.5V bulk failed 2.5V regulator failed AGP config error Power consumption requirement for AGP failed Table 3–3 RMC Warning Messages Message Meaning PSn failed...
  • Page 95 Table 3–3 RMC Warning Messages (Continued) Message Meaning CPUn VCORE warn CPU core voltage over or under threshold. “n” is 0, 1, 2, or 3. CPUn VIO warn I/O voltage on CPU over or under threshold. “n” is 0, 1, 2, or 3. CPUn VCACHE warn Cache voltage on CPU over or under threshold.
  • Page 96: Srom Error Messages

    3.4.5 SROM Error Messages The SROM power-up identifies errors that may or may not prevent the system from coming up to the console. It is possible that these errors may prevent the system from successfully booting the operating system. Errors encountered during SROM power-up are displayed on the OCP.
  • Page 97 Table 3–4 SROM Error Messages (Continued) Code SROM Message OCP Message Floppy driver error Flpy Err No real-time clock (TOY) TOY Err Memory data path error Mem Err Memory address line error Mem Err Memory pattern error Mem Err Memory pattern ECC error Mem Err Configuration error on CPU #3 CfgERR 3...
  • Page 98: Forcing A Fail-Safe Floppy Load

    Forcing a Fail-Safe Floppy Load Under some circumstances, you may need to force the activation of the FSL. For example, if you install a system motherboard that has an older version of the firmware than your system requires, you may not be able to bring up the SRM console.
  • Page 99 1. Turn off the system. Unplug the power cord from each power supply and wait for the 5V AUX indicators to extinguish. 2. Remove enclosure covers (tower and pedestal) or the front bezel (rackmount) to access the system chassis. See Chapter 8 for illustrations. 3.
  • Page 100: Updating The Rmc

    Updating the RMC Under certain circumstances, the RMC will not function. If the problem is caused by corrupted RMC flash ROM, you need to update RMC firmware. The RMC will not function if: • No AC power is provided to any of the power supplies. •...
  • Page 101 You can update the remote management console firmware from flash ROM using the LFU. 1. Load the update medium. 2. At the UPD> prompt, exit from the update utility, and answer y to the manual update prompt. Enter update RMC to update the firmware. UPD>...
  • Page 103: Chapter 4 Srm Console Diagnostics

    Chapter 4 SRM Console Diagnostics This chapter describes troubleshooting with the SRM console. The SRM console firmware contains ROM-based diagnostics that allow you to run system-specific or device-specific exercisers. The exercisers run concur- rently to provide maximum bus interaction between the console drivers and the target devices.
  • Page 104: Diagnostic Command Summary

    Diagnostic Command Summary Diagnostic commands are used to test the system and help diagnose failures. Table 4–1 gives a summary of the SRM diagnostic commands and related commands. See Chapter 6 for a list of SRM environment variables, and see Appendix A for a list of SRM commands most commonly used for the ES45 system.
  • Page 105 Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function grep Searches for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. Dumps the contents of a file (byte stream) in hexadecimal and ASCII. Controls the default PCI bus speed for the specified hose hose_x_default_ when no PCI devices are present.
  • Page 106 Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function php_button_test Tests the attention switch of each hot-plug slot on a specified I/O hose. The user is prompted to press the attention switch for each slot that has a blinking green LED.
  • Page 107 Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function test Verifies the configuration of the devices in the system. test -lb Runs loopback tests for the COM2 serial port and the parallel port in addition to verifying the configuration of devices. SRM Console Diagnostics 4-5...
  • Page 108 Field Service. Example 4–1 buildfru P00>>> buildfru smb0.mmb0.J3 54-24941-EA NI90200100 P00>>> buildfru smb0.cpu0 30-30158-05.AX05 NI94060554 Compaq P00>>> buildfru -s smb0.mmb0.J3 80 45 P00>>> buildfru -s smb0.mmb0.J3 80 47 46 45 44 43 42 41 Building of the FRU descriptor on a DIMM, passing a part number and a...
  • Page 109 NOTE: Be sure to enter the FRU information carefully. If you enter incorrect information, the callout used by Compaq Analyze will not be accurate. Three areas of the EEPROM can be initialized: the FRU generic data, the FRU specific data, and the system specific data. Each area has its own checksum, which is recalculated any time that segment of the EEPROM is written.
  • Page 110 The ES45 FRU assembly hierarchy has three levels. The FRU types from the top to the bottom of the hierarchy are as follows: Level FRU Type Meaning First Level System motherboard I/O connector module (junk I/O) Operator control panel PWR (0–2) Power supplies Fans Second Level...
  • Page 111 FRU. This ASCII string may be up to 10 characters (extra characters are truncated). This field is optional, unless <alias> is specified. <other> The FRU's Compaq alias number, if one exists. This ASCII string may be up to 16 characters (extras are truncated). This field is optional. <offset>...
  • Page 112 cat el and more el The cat el and more el commands display the contents of the console event log. In Example 4–2, the console reports that CPU 1 did not power up and fans 1 and 2 failed. Example 4–2 more el >>>...
  • Page 113 clear_error The clear_error command clears errors logged in the FRU EEPROMs as reported by the show error command. Example 4–3 clear_error P00>>> clear_error smb0 P00>>> P00>>> clear_error all P00>>> Clears all errors logged in the FRU EEPROM on the system motherboard (SMB0).
  • Page 114: Crash

    crash The SRM crash command forces a crash dump to the selected device for UNIX and OpenVMS systems. P00>>> crash CPU 0 restarting DUMP: 19837638 blocks available for dumping. DUMP: 118178 wanted for a partial compressed dump. DUMP: Allowing 2060017 of the 2064113 available on 0x800001 device string for dump = SCSI 1 1 0 0 0 0 0.
  • Page 115: Deposit And Examine

    deposit and examine The deposit command writes data to the specified address of a memory location, register, or device. The examine command displays the contents of a memory location, register, or a device. Example 4–4 deposit and examine deposit P00>>> dep -b -n 1ff pmem:0 0 P00>>>...
  • Page 116 Deposit The deposit command stores data in the location specified. If no options are given, the system uses the options from the preceding deposit command. If the specified value is too large to fit in the data size listed, the console ignores the command and issues an error.
  • Page 117 Defines data size as byte. Defines data size as word. -l (default) Defines data size as longword. Defines data size as quadword. Defines data size as octaword. Defines data size as hexword. Instruction decode (examine command only) The number of consecutive locations to modify. -n value The address increment size.
  • Page 118 The program counter. The address space is set to GPR. The location immediately following the last location referenced in a deposit or examine command. For physical and virtual memory, the referenced location is the last location plus the size of the reference (1 for byte, 2 for word, 4 for longword).
  • Page 119: Exer

    exer The exer command exercises one or more devices by performing specified read, write, and compare operations. Typically exer is run from the built-in console script. Advanced users may want to use the specific options described here. Note that running exer on disks can be destructive.
  • Page 120 P00>>> ls -l dk*.* r--- dka0.0.0.0.0 P00>>> exer dk*.* -bc 10 -sec 20 -m -a 'r' dka0.0.0.0.0 exer completed packet elapsed idle 8192 3325 27238400 1360288 P00>>> exer -eb 64 -bc 4 -a '?w-Rc' dka0 A destructive write test over block numbers 0 through 100 on disk dka0. The packet size is 2048 bytes.
  • Page 121 6. From the current block address on the disk, read a packet into buffer2. 7. Compare buffer1 with buffer2 and report any discrepancies. 8. Repeat the above steps until each block on the disk has been written once and read twice. You can tailor the behavior of exer by using options to specify the following: •...
  • Page 122 the size of device. The default is 1. -bs <block_size> Specifies the block size (hex) in bytes. The default is 200 (hex). -bc <block_per_io> Specifies the number of blocks (hex) per I/O. On devices without length (tape), use the specified packet size or default to 2048.
  • Page 123 • Seek to a random block offset within the -a <action_string> specified range of blocks. exer calls the program, (continued) random, to “deal” each of a set of numbers once. exer chooses a set that is a power of two and is greater than or equal to the block range.
  • Page 124 floppy_write The floppy_write script runs a write test on the floppy drive to determine whether or not you can write on the diskette. Use this script if a customer is unable to write data to the floppy. This is a destructive test, so use a blank floppy.
  • Page 125: Grep

    grep The grep command is very similar to the UNIX grep command. It allows you to search for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. Using grep is similar to using wildcards. Example 4–7 grep P00>>>...
  • Page 126 Syntax grep ( [-{c|i|n|v}] [-f <file>] [<expression>] [<file>...] ) Arguments <expression> Specifies the target regular expression. If any regular expression metacharacters are present, the expression should be enclosed with quotes to avoid interpretation by the shell. <file>... Specifies the files to be searched. If none are present, then standard input is searched.
  • Page 127 4.10 hd The hd command dumps the contents of a file (byte stream) in hexadecimal and ASCII. Example 4–8 hd P00>>> hd -eb 0 dpr:2b00 block 0 00000000 48 45 4C 4C 4F FF FF FF FF FF FF FF FF FF FF FF HELLO...
  • Page 128 Example 4–8 shows a hex dump to DPR location 2b00, ending at block 0. Syntax hd [-{byte|word|long|quad}] [-{sb|eb} <n>] <file>[:<offset>]. Arguments <file>[:<offset>] Specifies the file (byte stream) to be displayed. Options -byte Print out data in byte sizes -word Print out data by word -long Print out data by longword Print out data by quadword...
  • Page 129: Info

    4.11 info The info command displays registers and data structures. You can enter the command by itself or followed by a number (0 − − − − 8). If you do not specify a number, a list of selections is displayed and you are prompted to enter a selection.
  • Page 130 2 see the Galaxy console and Alpha Systems V5.0 FRU configuration Tree Specification. • info 3 see the Titan Chipset Engineering Specification. • info 6 and 7 see the AlphaServer ES45 Platform Fault Management Specification. 4-28 ES45 Service Guide...
  • Page 131 Example 4–10 shows an info 1 display. Example 4–10 info 1 P00>>> info 1 pte 000000003FFA8000 0000000100001101 va 0000000010000000 pa 0000000000002000 pte 000000003FFA8008 0000000200001101 va 0000000010002000 pa 0000000000004000 pte 000000003FFA8010 0000000300001101 va 0000000010004000 pa 0000000000006000 pte 000000003FFA8018 0000000400001101 va 0000000010006000 pa 0000000000008000 pte 000000003FFA8020 0000000500001101 va 0000000010008000...
  • Page 132 Example 4–11 shows an info 2 display. Example 4–11 info 2 P00>>> info 2 GCT_ROOT_NODE GCT_NODE 21e000 Type Subtype Hd_extension Size c000 Rev_major Rev_minor 0000000000000000 node_flags saved_owner affinity parent child fw_usage Root->lock ffffffff Root->transient_level Root->current_level Root->console_req 200000 Root->min_alloc 100000 Root->min_align 100000 Root->base_alloc 2000000...
  • Page 133 Example 4–12 shows an info 3 display. Example 4–12 info 3 P00>>> info 0. HWRPB MEMDSC 1. Console PTE 2. GCT/FRU 5 3. Dump System CSRs 4. IMPURE area (abbreviated) 5. IMPURE area (full) 6. LOGOUT area 7. Dump Error Log 8.
  • Page 134 SERREN 000000000000000E 0440 GPERROR 0000000000400000 0500 GPERREN 00000000000007F6 0540 SCTL 0000000002831611 0700 AWSBA0 0000000000800000 : 0000 AWSBA1 0000000080000001 : 0040 AWSBA2 0000000000000000 : 0080 AWSBA3 0000000000000002 : 10c0 AWSM0 0000000000700000 : 1100 AWSM1 000000003FF00000 : 1140 AWSM2 0000000000000000 : 1180 AWSM3 0000000000000000 : 11c0 ATBA0...
  • Page 135 Example 4–13 shows an info 4 display. Example 4–13 info 4 P00>>> info 4 cpu00 cpu01 cpu02 cpu03 per_cpu impure area 00004200 00004800 00004e00 00005400 cns$flag 00000001 00000001 00000001 00000001 : 0000 cns$flag+4 00000000 00000000 00000000 00000000 : 0004 cns$hlt 00000000 00000000 00000000 00000000 : 0008 cns$hlt+4 00000000 00000000 00000000 00000000 : 000c...
  • Page 136 Example 4–14 shows an info 5 display. Example 4–14 info 5 P00>>> info 5 cpu00 cpu01 cpu02 cpu03 per_cpu impure area 00004200 00004800 00004e00 00005400 cns$flag 00000001 00000001 00000001 00000001 : 0000 cns$flag+4 00000000 00000000 00000000 00000000 : 0004 cns$hlt 00000000 00000000 00000000 00000000 : 0008 cns$hlt+4 00000000 00000000 00000000 00000000 : 000c...
  • Page 137 cns$shadow23+4 00000000 00000000 00000000 00000000 : 0314 cns$fpcr 00000000 00000000 00000000 00000000 : 0318 cns$fpcr+4 8ff00000 8ff00000 8ff00000 8ff00000 : 031c cns$va ffffffec fe00385f fe00385f fe00385f : 0320 cns$va+4 ffffffff 00000801 00000801 00000801 : 0324 cns$va_ctl 00000000 00000000 00000000 00000000 : 0328 cns$va_ctl+4 00000000 00000000 00000000 00000000 : 032c cns$exc_addr...
  • Page 138 Example 4–15 show an info 6 display. Example 4–15 info 6 P00>>> info 6 cpu00 per_cpu logout area 00006000 mchk_crd__flag_frame 00000000 : 0000 mchk_crd__flag_frame+4 00000000 : 0004 mchk_crd__offsets 00000000 : 0008 mchk_crd__offsets+4 00000000 : 000c mchk_crd__mchk_code 00000000 : 0010 mchk_crd__mchk_code+4 00000000 : 0014 mchk_crd__i_stat 00000000 : 0018...
  • Page 139 mchk__dc_stat+4 00000000 : 00d4 mchk__c_addr 00000000 : 00d8 mchk__c_addr+4 00000000 : 00dc mchk__dc1_syndrome 00000000 : 00e0 mchk__dc1_syndrome+4 00000000 : 00e4 mchk__dc0_syndrome 00000000 : 00e8 mchk__dc0_syndrome+4 00000000 : 00ec mchk__c_stat 00000000 : 00f0 mchk__c_stat+4 00000000 : 00f4 mchk__c_sts 00000000 : 00f8 mchk__c_sts+4 00000000 : 00fc mchk__mm_stat...
  • Page 140 Example 4–16 shows as info 7 display. Example 4–16 info 7 P00>>> info 7 Number of Errors Saved = 3 Error 1 0000 : 0001000400050018 Console Uncorrectable Error Frame Header 0008 : 0000300a190f1324 OCT 25 15:19:36 0010 : 0000000300000170 0000 : 00010001000c0108 Processor Machine Check Frame 0008 : 0000000000000000 CPU ID...
  • Page 141 0000 : 00010004000c0010 Clipper DPR Extended Memory Frame 0008 : DPR AAR0 Config 0009 : DPR AAR0 Size 000a : DPR AAR1 Config 000b : DPR AAR1 Size 000c : DPR AAR2 Config 000d : DPR AAR2 Size 000e : DPR AAR3 Config 000f : DPR AAR3 Size...
  • Page 142 Example 4–17 shows an info 8. Example 4–17 info 8 P00>>> info 8 0. HWRPB MEMDSC 1. Console PTE 2. GCT/FRU 5 3. Dump System CSRs 4. IMPURE area (abbreviated) 5. IMPURE area (full) 6. LOGOUT area 7. Dump Error Log 8.
  • Page 143 4.12 kill and kill_diags The kill and kill_diags commands terminate diagnostics that are currently executing. Example 4–18 kill and kill_diags P00>>> memexer 3 P00>>> show_status Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------- 00000001 idle system 0000125e...
  • Page 144: Memexer

    4.13 memexer The memexer command runs a specified number of memory exercisers in the background. Nothing is displayed unless an error occurs. Each exerciser tests all available memory in twice the backup cache size blocks for each pass. The following example shows no errors. Example 4–19 memexer P00>>>...
  • Page 145 If the memory configuration is very large, the console might not test all of the memory. The upper limit is 1 GB. Use the show_status command to display the progress of the tests. Use the kill or kill_diags command to terminate the test. Syntax memexer [number] Arguments...
  • Page 146: Memtest

    4.14 memtest The memtest command exercises a specified section of memory. Typically memtest is run from the built-in console script. Advanced users may want to use the specific options described here. Example 4–20 memtest P00>>> sh mem Array Size Base Address Intlv Mode --------- ----------...
  • Page 147 Use the show memory command or an info 0 command to see where memory is located. Starting address Length of the section to test in bytes Passcount. In this example, the test will run for 10 passes. The test detected a failure on DIMM 3, which is located on MMB 2. Use the show_status command to display the progress of the test.
  • Page 148 NOTE: If memtest is used to test large sections of memory, testing may take a while to complete. If you issue a Ctrl/C or kill PID in the middle of testing, memtest may not abort right away. For speed reasons, a check for a Ctrl/C or kill is done outside of any test loops.
  • Page 149 Syntax memtest ( [-sa <start_address>] [-ea <end_address>] [-l <length>] [-bs <block_size>] [-i <address_inc>] [-p <pass_count>] [-d <data_pattern>] [-rs <random_seed>] [-ba <block_address>] [-t <test_mask>] [-se <soft_error_threshold>] [-g <group_name>] [-rb] [-f] [-m] [-z] [-h] [-mb] ) Options Start address. Default is first free space in memzone. End address.
  • Page 150 Options Timer. Prints out the run time of the pass. Default = off . Tests the specified memory address without allocation. Bypasses all checking but allows testing in addresses outside of the main memory heap. Also allows unaligned input. CAUTION: This flag can overwrite the console. If the system hangs, press the Reset button.
  • Page 151: Net

    4.15 net The net command performs maintenance operations on a specified Ethernet port. Net -ic initializes the MOP counters for the specified Ethernet port, and net -s displays the current status of the port, including the contents of the MOP counters. Example 4–21 net -ic and net -s P00>>>...
  • Page 152 Syntax net [-ic] net [-s] Arguments <port_name> Specifies the Ethernet port on which to operate, either ei*0 or ew*0. 4-50 ES45 Service Guide...
  • Page 153: Nettest

    4.16 nettest The nettest command tests the network ports using MOP loopback. Typically nettest is run from the built-in console script. Advanced users may want to use the specific options and environment variables described here. Example 4–22 nettest P00>>> nettest ei* P00>>>...
  • Page 154 Nettest performs a network test. It can test the ei* or ew* ports in internal loopback, external loopback, or live network loopback mode. Nettest contains the basic options to run MOP loopback tests. Many environment variables can be set from the console to customize nettest before nettest is started.
  • Page 155 Syntax nettest ( [-f <file>] [-mode <port_mode>] [-p <pass_count>] [-sv <mop_version>] [-to <loop_time>] [-w <wait_time>] [<port>] ) Arguments <port> Specifies the Ethernet port on which to run the test. Options -f <file> Specifies the file containing the list of network station addresses to loop messages to.
  • Page 156 -sv <mop_version> Specifies which MOP version protocol to use. If 3, then MOP V3 (DECNET Phase IV) packet format is used. If 4, then MOP V4 (DECNET Phase V IEEE 802.3) format is used. -to <loop_time> Specifies the time in seconds allowed for the loop messages to be returned.
  • Page 157: Set Sys_Serial_Num

    FRU devices that have EEPROMs. The sys_serial_num environment variable can be read by the operating system. IMPORTANT: The system serial number must be set correctly. Compaq Analyze will not work with an incorrect serial number. Example 4–23 set sys_serial_num P00>>>...
  • Page 158: Show Error

    4.18 show error The show error command reports errors logged to the FRU EEPROMs. Example 4–24 show error P00>>> show error SMB0 TDD - Type: 15 Test: 15 SubTest: 15 Error: 15 001f8408 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ....
  • Page 159 An SDD error has been logged. SDDs (symptom-directed diagnostics) are generic diagnostic exercisers that try to cause random behavior and look for failures or “symptoms.” All SDDs are logged by Compaq Analyze. Three checksum errors have been logged. There was a mismatch between the serial number on the system motherboard and the system serial number.
  • Page 160: Show Error Message Translation

    <fruname> TDD - Type:0 Test: 0 Serious error. Run the Compaq Analyze SubTest: Error: 0 GUI, if necessary, to determine what action to take. If you cannot run Compaq Analyze, replace the module. <fruname> SDD - Type:0 Serious error. Compaq Analyze (CA) has...
  • Page 161: Show Fru

    4.19 show fru The show fru command displays the physical configuration of FRUs. Use show fru -e to display FRUs with errors. Example 4–25 show fru P00>>> build smb0 54-25385-01.e01 ay94412345 P00>>> show fru FRUname Part# Serial# Model/Other Alias/Misc SMB0 00 54-30292-02.A01 SW03300011 SMB0.CPU0...
  • Page 162 FRUs with errors have a non-zero value that represents a bit mask of possible errors. See Table 4–3. Part # The part number of the FRU in ASCII, either a Compaq part number or a vendor part number. Serial # The serial number.
  • Page 163: Bit Assignments For Error Field

    Model/Other Optional data. For Compaq FRUs, the Compaq part alias number (if one exists). For vendor FRUs, the year and week of manufacture. Alias/Misc Miscellaneous information about the FRUs. For Compaq FRUs, a model name, number, or the common name for the entry in the Part # field.
  • Page 164: Show_Status

    4.20 show_status The show_status command displays the progress of diagnostics. The command reports one line of information per executing diagnostic. Many of the diagnostics run in the background and provide information only if an error occurs. Example 4–26 show status P00>>>...
  • Page 165 Process ID The SRM diagnostic for the particular device The ID of the device under test Number of diagnostic passes that have been completed Error count (hard and soft). Soft errors are not usually fatal; hard errors halt the system or prevent completion of the diagnostics. Bytes successfully written by the diagnostic.
  • Page 166 4.21 sys_exer The sys_exer command exercises the devices displayed with the show config command. Tests are run concurrently and in the background. Nothing is displayed after the initial test startup messages unless an error occurs. Example 4–27 sys_exer P00>>> sys_exer Default zone extended at the expense of memzone.
  • Page 167: Sys_Exer

    Use the show_status command to display the progress of diagnostic tests. The diagnostics started by the sys_exer command automatically reallocate memory resources, because these tests require additional resources. Use the init command to reconfigure memory before booting an operating system. Because the sys_exer tests are run concurrently and indefinitely (until you stop them with the init command), they are useful in flushing out intermittent hardware problems.
  • Page 168: Test

    4.22 test The test command verifies all the devices in the system. This command can be used on all supported operating systems. Example 4–28 test -lb P00>>> test -lb Testing the Memory Testing the DK* Disks(read only) No DU* Disks available for testing No DR* Disks available for testing Testing the DQ* Disks(read only) Testing the DF* Disks(read only)
  • Page 169 4. VGA console tests: These tests are run only if the console environment variable is set to serial. The VGA console test displays rows of the word compaq. 5. Network internal loopback tests for EW* networks. SRM Console Diagnostics 4-67...
  • Page 171: Chapter 5 Error Logs

    Chapter 5 Error Logs This chapter tells how to interpret error logs reported by the operating system. The following topics are covered: • Error Log Analysis with Compaq Analyze • Fault Detection and Reporting • Machine Checks/Interrupts • Environmental Errors Captured by SRM...
  • Page 172: Error Log Analysis With Compaq Analyze

    Compaq Analyze may or may not be installed on the customer's system with the operating system, depending on the release cycle. If CA is installed, the Compaq Analyze Director starts automatically as part of the system start-up. CA provides automatic background analysis.
  • Page 173 UNIX Indictment For each CPU indictment that is sent to the operating system a callout report is generated. After the bad component is replaced the following commands must be executed to bring the new components on-line for use. The following is an example of using the Indictment command.
  • Page 174: Web Enterprise Service (Webes) Director

    Director. If the Director has stopped running, restart it by following the instructions in the WEBES Compaq Analyze User Guide documentation. Compaq Analyze includes a graphical user interface (WUI) that allows the user to interact with the Director. While only one Director process executes on the machine at any time, many WUI processes can run at the same time, connected to the single Director.
  • Page 175: Using Compaq Analyze

    5.1.2 Using Compaq Analyze After you have logged on to Compaq Analyze the following screen appears. If an event has occurred, it is listed under “localhost” events. See Figure 5–1. Figure 5–1 Compaq Analyze Initial Screen 1. In this example, the Other Logs file is selected and the Problem Reports display in Figure 5–2 appears.
  • Page 176: Problem Reports Screen

    Figure 5–2 Problem Reports Screen 2. Cpu_Mem_630.sys is selected and the problem reports are listed. You may select any log listed in Other Logs to view a list of all problems found. You may also view each report by clicking on the underlined hot link under Problem Reports.
  • Page 177: Compaq Analyze Problem Report Details

    3. Figure 5–3 provides an example problem report. Figure 5–3 Compaq Analyze Problem Report Details Error Logs...
  • Page 178: Compaq Analyze Problem Report Details (Continued)

    The Managed Entity designator includes the system host name (typically a computer name for networking purposes), the type of computer system (“Compaq AlphaServer ES45”), and the error event identification. The error event identification uses new common event header Event_ID_Prefix and Event_ID_Count components.
  • Page 179 Callout ID The last 12 characters of the Callout ID designator can be used to determine the revision level of the analysis rule-set that is being used. Full Description The Full Description designator provides detailed error information, which can include a description of the detected fault or error condition, the specific address or data bit where this fault or error occurred, the probable FRU list, and service related information.
  • Page 180: Bit To Test

    Table 5–1 Common Event Header Example Table (CEH) V2.0 OS_Type -- OpenVMS AXP Hardware_Arch -- Alpha CEH_Vendor_ID 3,564 -- Compaq Computer Corp Hdwr_Sys_Type -- Titan Corelogic Logging_CPU -- CPU Logging this Event CPUs_In_Active_Set -- Correctable Processor Entry_Type...
  • Page 181 Logout_Frame_CPU_Section Frame_Size x0000 00B0 Frame_Flags x8000 0000 CPU_Area_Offset x0000 0018 System_Area_Offset x0000 0058 Machine Check Logout Mchk_Error_Code x0000 0086 Frame Error Code Value[31:0] CPU Non-Fatal Frame_Rev x0000 0001 I_STAT x0000 0000 0000 0000 Ibox Status Register DC_STAT x0000 0000 0000 0008 Dcache Status Register Dcache ECC during load ECC_Err_Ld[3] instruction...
  • Page 182 Register Nxs[31:29] CPU 0 Source Device P0_Serror x0000 0000 0000 0000 No Error Detected Bus_Source[53:52] GPCI Bus TransAction_Cmd[55:54] x0 DMA Read ECC_Syndrome[63:56] No Data Bit Error P0_GPerror x0000 0000 0000 0000 No Error Detected PCI_Cmd[55:52] Interrupt Acknowledge P0_APerror x0000 0000 0000 0000 No Error Detected PCI_Cmd[55:52] Interrupt Acknowledge...
  • Page 183 System Memory / IO Configuration Subpacket, Version 1 x0000 0000 0000 6005 Memory Array 0 AAR_0 Configuration Register Sa0[8] Non - Split Array Asiz0[15:12] 512 Mb Array0 Base Address [34:24] Addr0[34:24] Bits x0000 0000 2000 6005 Memory Array 1 AAR_1 Configuration Register Sa1[8] Non - Split Array...
  • Page 184 PRIGRP[15:8] PPRI[16] PCISPD66[17] GPCI Frequency = 66 MHz 12 DMA Reads Retry w/no CNGSTLT[21:18] delayed Completion PTPDESTEN[29:22] Data Parity Checking DPCEN[30] Disabled Address Parity Checking APCEN[31] Disabled DCR_Timer[33:32] DCR Timer Count = 2^11 EN_Stepping[34] Address Stepping Enabled x3320 6C65 646F 4D20 Pchip0 Aport Control P0_APCTL Register FBTB[0]...
  • Page 185 AMU Enabled to Perform PTE NEWAMU[29] Fetch Xactions PTP Writes Disabled During PTPWAR[30] Pending Reads x7261 7473 2E2E 2E0A Pchip1 Gport Control P1_GPCTL Register PCI Fast Back-To_Back FBTB[0] Xactions Disabled THDIS[1] TLB Anti-Thrashing Enabled GPCI PIO Write Chaining CHAINDIS[2] Disabled Target RetryTimer = 64 PCI TGLAT[4:3] Clocks...
  • Page 186 AGP_SBA_EN[54] SideBand Addressing Enabled AGP_EN[55] AGP Xactions Disabled AGP_Present[57] agp_present = 1 AGP_HP_RD[60:58] 4 Cchip Pending HP Reads AGP_LP_RD[63:61] 3 Cchip Pending LP Reads 5-16 ES45 Service Guide...
  • Page 187: Fault Detection And Reporting

    3. If error/event logging is required, control is passed through the OS Privileged Architecture Library (PAL) handler. The operating system error handler logs the error condition into the binary error log. Compaq Analyze should then diagnose the error to the defective FRU.
  • Page 188: Es45 Fault Detection And Correction

    Table 5–2 ES45 Fault Detection and Correction Component Fault Detection/Correction Capability Alpha 21264 (EV68) Contains error checking and correction (ECC) microprocessor logic for data cycles. Check bits are associated with all data entering and exiting the microprocessor. A single-bit error on any of the four longwords being read can be corrected (per cycle).
  • Page 189: Machine Checks/Interrupts

    Machine Checks/Interrupts The exceptions that result from hardware system errors are called machine checks/interrupts. They occur when a system error is detected during the processing of a data request. During the error-handling process, errors are first handled by the appropriate PALcode error routine and then by the associated operating system error handler.
  • Page 190 Table 5–3 Machine Checks/Interrupts (Continued) Error Type Error Descriptions System Correctable Error (620) System detected ECC single-bit error ES45-specific correctable errors. System Uncorrectable Error Uncorrectable ECC error (660) Nonexistent memory reference PCI system bus error (SERR) A system-detected machine PCI read data parity error (RDPE) check that occurred as a result of PCI address/command parity error (APE) an “off-chip”...
  • Page 191: Error Logging And Event Log Entry Format

    CPU, memory, and I/O. Table 5–4 shows an event structure map for a system uncorrectable PCI target abort error. NOTE: See Appendix D for the source data Compaq Analyze uses to isolate to the FRUs. Error Logs...
  • Page 192: Sample Error Log Event Structure Map (Es45 With 10 Pci Slots)

    Table 5–4 Sample Error Log Event Structure Map (ES45 with 10 PCI Slots) OFFSET(hex) ech0000 NEW COMMON OS HEADER ech+nnnn lfh0000 STANDARD LOGOUT FRAME HEADER lfh+nnnn lfEV680000 COMMON PAL EV68 SECTION lfEV68+nn (first 8 QWs Zeroed) lfctt_A0[u] SESF<63:32> = <39:32>= SESF<31:16>...
  • Page 193: Environmental Errors Captured By Srm

    Environmental Errors Captured by SRM If an environmental error occurs while the SRM console is running, a logout frame similar to Example 5–1 is sent to the console output device. The logout frame is preceded by the message “***unexpected system event through vector 680 on CPU n.” (usually CPU 0.) For register definitions, see Appendix D.
  • Page 195: Chapter 6 System Configuration And Setup

    Chapter 6 System Configuration and Setup This chapter describes how to configure and set up ES45 systems. The following topics are covered: • System Consoles • Displaying the Hardware Configuration • Setting Environment Variables • Setting Automatic Booting • Changing the Default Boot Device •...
  • Page 196: System Consoles

    The procedure for installing Linux on an Alpha system is described in the Alpha Linux installation document for your Linux distribution. The installation document can be downloaded from the following Web site: http://www.compaq.com/alphaserver/linux RMC CLI The remote management console (RMC) provides a command-line interface (CLI) for controlling the system.
  • Page 197: Selecting The Display Device

    6.1.1 Selecting the Display Device The SRM console environment variable determines to which display device (VT-type terminal or VGA monitor) the console display is sent. The console terminal that displays the SRM user interface can be either a serial terminal (VT320 or higher, or equivalent) or a VGA monitor. The SRM console environment variable determines the display device.
  • Page 198: Setting The Control Panel Message

    6.1.2 Setting the Control Panel Message You can create a customized message to be displayed on the operator control panel after startup self-tests and diagnostics have been completed. When the operating system is running, the control panel displays the console revision.
  • Page 199: Displaying The Hardware Configuration

    Displaying the Hardware Configuration View the system hardware configuration by entering commands from the SRM console. It is useful to view the hardware configuration to ensure that the system recognizes all devices, memory configuration, and network connections. Use the following SRM console commands to view the system configuration. See the Owner’s Guide for details.
  • Page 200: Setting Environment Variables

    Setting Environment Variables Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. • To check the setting for a specific environment variable, enter the show envar command, where the name of the environment variable is substituted for envar.
  • Page 201: Env

    set envar The set command sets or modifies the value of an environment variable. It can also be used to create a new environment variable if the name used is unique. Environment variables pass configuration information between the console and the operating system.
  • Page 202: Srm Environment Variables

    Table 6–1 SRM Environment Variables Variable Attributes Description auto_action NV,W Action the console should take following an error halt or power failure. Defined values are: boot—Attempt bootstrap. halt—Halt, enter console I/O mode. restart—Attempt restart. If restart fails, try boot. bootdef_dev NV,W Device or device list from which booting is to be attempted when no path is specified.
  • Page 203 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description boot_flags: The hexadecimal value of the bit NV,W boot_osflags number or numbers to set. To specify multiple boot (continued) flags, add the flag values (logical OR). 1—Bootstrap conversationally (enables you to modify SYSGEN parameters in SYSBOOT).
  • Page 204 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description D—Full dump; implies s as well. By default, if boot_osflags Tru64 UNIX crashes, it completes a partial (continued) memory dump. Specifying D forces a full dump at system crash. Common settings are a, autoboot, and Da, autoboot and create full dumps if the system crashes.
  • Page 205 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description com1_modem NV,W Used to tell the operating system whether a com2_modem modem is present on the COM1 or COM2 ports, respectively. On—Modem is present. Off—Modem is not present (default value). console Sets the device on which power-up output is displayed.
  • Page 206 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description cpu_enabled Enables or disables a specific secondary CPU. All CPUs are enabled by default. The primary CPU cannot be disabled. The primary CPU is the lowest numbered working CPU. ei*0_inet_init or Determines whether the interface's internal ew*0_inet_init Internet database is initialized from nvram or...
  • Page 207 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description heap_expand Increases the amount of memory available for the SRM console's heap. Valid selections are: NONE (default) 64KB 128KB 256KB 512KB kbd_hardware Sets the keyboard hardware type as either type PCXAL or LK411 and enables the system to interpret the terminal keyboard layout correctly.
  • Page 208 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description os_type Sets the default operating system. vms or unix—Sets system to boot the SRM firmware. password Sets a console password. Required for placing the SRM into secure mode. pci_parity Disable or enable parity checking on the PCI bus. On—PCI parity enabled (default value) Off—PCI parity disabled Some PCI devices do not implement PCI parity...
  • Page 209 Table 6–1 SRM Environment Variables (Continued) Variable Attribute Description pk*0_host_id Sets the controller host bus node ID to a value between 0 and 7. 0 to 7—Assigns bus node ID for specified host adapter. pk*0_soft_term Enables or disables SCSI terminators for optional SCSI controllers.
  • Page 210: Setting Automatic Booting

    Setting Automatic Booting Tru64 UNIX and OpenVMS systems are factory set to halt in the SRM console. You can change these defaults, if desired. Systems can boot automatically (if set to autoboot) from the default boot device under the following conditions: •...
  • Page 211: Changing The Default Boot Device

    Changing the Default Boot Device You can change the default boot device with the set bootdef_dev command. You can designate a default boot device. You change the default boot device by using the set bootdef_dev SRM console command. For example, to set the boot device to the IDE CD-ROM, enter commands similar to the following: P00>>>...
  • Page 212: Setting Srm Security

    Setting SRM Security The set password and set secure commands set SRM security. login command turns off security for the current session. The clear password command returns the system to user mode. The SRM console has two modes, user mode and secure mode. •...
  • Page 213 The password length must be between 15 and 30 alphanumeric characters. Any characters entered after the 30th character are not stored. Example 6–3 set secure P00>>> set secure Console is secure. Please login. P00>>> login Please enter the password: P00>>> b dkb0 The set secure command console puts the console into secure mode.
  • Page 214: Configuring Devices

    Configuring Devices Become familiar with the configuration requirements for CPUs and memory before removing or replacing those components. Chapter 8 for removal and replacement procedures. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others.
  • Page 215: Cpu Configuration

    6.7.1 CPU Configuration Figure 6–1 CPU Slot Locations (Pedestal/Rack) CPU 3 CPU 1 CPU 0 CPU 2 PK0228B System Configuration and Setup 6-21...
  • Page 216: Cpu Card

    Figure 6–2 CPU Slot Locations (Tower) CPU 2 CPU 0 CPU 1 CPU 3 PK0229A CPU Configuration Rules 1. A CPU must be installed in slot 0. The system will not power up without a CPU in slot 0. 2. CPU cards must be installed in numerical order, starting at CPU slot 0. See Figure 6–1 and Figure 6–2.
  • Page 217: Memory Configuration

    6.7.2 Memory Configuration Become familiar with the rules for memory configuration before adding DIMMs to the system. Refer to Figure 6–4 or Figure 6–5 and observe the following rules for installing DIMMs. • You can install up to 16 DIMMs or up to 32 DIMMs. •...
  • Page 218: Stacked And Unstacked Dimms

    (see Figure 6–3). Stacked DIMMs provide twice the capacity of unstacked DIMMs, and, at the time of shipment, are the highest capacity DIMMs offered by Compaq. The system may have either stacked or unstacked DIMMs. A memory option consists of a “set” of four DIMMs. The system supports two sets per “array”...
  • Page 219 Only the following DIMMs and DIMM options can be used in the ES45 system. Density DIMM DIMM Option (4 DIMMs per) 128 MB 20-01CBA-09 MS620-AA (512 MB) 256 MB 20-01DBA-09 MS620-BA (1 GB) 512 MB 20-01EBA-09 MS620-CA (2 GB) 1 GB 20-L0FBA-09 MS620-DA (4 GB)∗...
  • Page 220 Memory Performance Considerations Interleaved operations reduce the average latency and increase the memory throughput over non-interleaved operations. With one memory option (4 DIMMs) installed, memory interleaving will not occur. For 2-way interleaving, array 0 & 2 and 1 & 3 must have the same size memory. For 4-way interleaving, array 0 through 3 must have the same size memory.
  • Page 221: Memory Configuration (Pedestal/Rack)

    Figure 6–4 Memory Configuration (Pedestal/Rack) Set # Set # Set # MMB 0 Array 0 Set # Set # 0 & 4 MMB 1 Array 1 Set # 1 & 5 Array 2 MMB 2 Set # 2 & 6 Array 3 Set # 3 &...
  • Page 222: Memory Configuration (Tower)

    Figure 6–5 Memory Configuration (Tower) Set # MMB 3 Set # MMB 2 Set # MMB 1 Set # Array 2 Array 0 Set # 2 & 6 Set # 0 & 4 MMB 0 Array 1 Array 3 Set # 1 & 5 Set # 3 &...
  • Page 223: Pci Configuration

    6.7.3 PCI Configuration PCI modules are either designed for 5.0 volts or 3.3 volt slots, or are universal in design and can plug into either 3.3 or 5.0 volt slots. Figure 6–6 PCI Slot Locations (Pedestal/Rack) 10-Slot PK0226C CAUTION: Check the keying before you install the PCI module and do not force it in.
  • Page 224: Pci Slot Voltages And Hose Numbers

    Hose 1 Slot ID 1 Slot ID 2 33 MHz 5.0V Hose 0 Slot ID 9 Hose 3 Slot ID 1 33 MHz 5.0V Hose 0 Slot ID 8 Slot ID 2 PK0974B For more information, see http://www.compaq.com/alphaserver/. 6-30 ES45 Service Guide...
  • Page 225 PCI modules are either designed for 5.0 volts or 3.3 volt slots, or are universal in design and can plug into either 3.3 or 5.0 volt slots. CAUTION: Check the keying before you install the PCI module and do not force it in.
  • Page 226: Pci Module Leds

    6.7.4 PCI Module LEDs CAUTION: Hot plug is not currently supported by the operating systems. Figure 6–9 PCI Status LEDs Hot Plug Hot Plug Green Green Amber Amber Side View Rear View MR0073 Status Green Power applied Amber Power fault 6-32 ES45 Service Guide...
  • Page 227: Power Supply Configurations

    6.7.5 Power Supply Configurations Figure 6–10 Power Supply Locations Pedestal/Rack Tower 1 1 1 2 2 2 PK0207B The system can have the following power configurations: Two Power supply System (minimum configuration) • Two CPUs • One storage cage • Four to sixteen DIMMs Redundant Power Supply.
  • Page 228: Booting Linux

    V5.6-3 June 15 2001 08:36:11 P00>> 2. Enter the show device command to determine the unit number of the drive for your boot device, in this case dka0.0.0.17.0. P00>>> sh dev dka0.0.0.17.0 DKA0 COMPAQ BD018122C9 B016 dka200.2.0.7.1 DKA200 COMPAQ BD018122C9 B016 dqa0.0.0.105.0...
  • Page 229 After installing Linux, set boot environment variables appropriately for your installation. The typical values indicating booting from dka0 with the first aboot.conf entry are shown in this example. P00>>> set bootdef_dev dka0 P00>>> set boot_file P00>>> set boot_osflags 0 P00>>> show boot* boot_dev dka0.0.0.17.0 boot_file...
  • Page 230 memcluster 0, usage 1, start 0, end memcluster 1, usage 0, start 362, end 262135 memcluster 2, usage 1, start 262135, end 262144 freeing pages 362:1024 freeing pages 1700:262135 SMP: 4 CPUs probed -- cpu_present_mask = f On node 0 totalpages: 262144 zone(0): 262144 pages.
  • Page 231: Chapter 7 Using The Remote Management Console

    Chapter 7 Using the Remote Management Console You can manage the system through the remote management console (RMC). The RMC is implemented through an independent microprocessor that resides on the system motherboard. The RMC also provides access to the repository for all error information in the system.
  • Page 232: Rmc Overview

    RMC Overview The remote management console provides a mechanism for monitoring the system (voltages, temperatures, and fans) and manipulating it on a low level (reset, power on/off, halt). It also provides functionality to read and write configuration and error log information to FRU error log devices.
  • Page 233 FRU after power has been lost. The RMC console provides several commands for accessing error information in the DPR. See Section 7.6. Compaq Analyze, described in Chapter 5, can access the FRU EEPROM error logs to provide diagnostic information for system FRUs.
  • Page 234: Operating Modes

    Operating Modes The RMC can be configured to manage different data flow paths defined by the com1_mode environment variable. In Through mode (the default), all data and control signals flow from the system COM1 port through the RMC to the active external port. You can also set bypass modes so that the signals partially or completely bypass the RMC.
  • Page 235 Through Mode Through mode is the default operating mode. The RMC routes every character of data between the internal system COM1 port and the active external port, either the local COM1 serial port (MMJ) or the 9-pin modem port. If a modem is connected, the data goes to the modem.
  • Page 236: Bypass Modes

    7.2.1 Bypass Modes For modem connection, you can set the operating mode so that data and control signals partially or completely bypass the RMC. bypass modes are Snoop, Soft Bypass, and Firm Bypass. Figure 7–2 Data Flow in Bypass Mode System DUART COM1...
  • Page 237 Figure 7–2 shows the data flow in the bypass modes. Note that the internal system COM1 port is connected directly to the modem port. NOTE: You can connect a serial terminal to the modem port in any of the bypass modes. The local terminal is still connected to the RMC and can still connect to the RMC CLI to switch the COM1 mode if necessary.
  • Page 238 After downloading binary files, you can set the com1_mode environment variable from the SRM console to switch back to Snoop mode or other modes for accessing the RMC, or you can hang up the current modem session and reconnect it. Firm Bypass Mode In Firm Bypass mode all data and control signals are routed directly between the system COM1 port and the external modem port.
  • Page 239: Terminal Setup

    Terminal Setup You can use the RMC from a modem hookup or the serial terminal connected to the system. As shown in Figure 7–3, a modem is connected to the dedicated 9-pin modem port and a terminal is connected to the COM1 serial port/terminal port (MMJ) Figure 7–3 Terminal Setup for RMC (Tower View) PK0934A Using the Remote Management Console...
  • Page 240: Connecting To The Rmc Cli

    Connecting to the RMC CLI You type an escape sequence to connect to the RMC CLI. You can connect to the CLI from any of the following: a modem, the local serial console terminal, the local VGA monitor, or the system. The “system” includes the operating system, SRM, or an application.
  • Page 241 Connecting from the Local VGA Monitor To connect to the RMC CLI from the local VGA monitor, the console environment variable must be set to graphics and the SRM console must be running. Invoke the SRM console and enter the rmc command. P00>>>...
  • Page 242: Srm Environment Variables For Com1

    SRM Environment Variables for COM1 Several SRM environment variables allow you to set up the COM1 serial port (MMJ) for use with the RMC. You may need to set the following environment variables from the SRM console, depending on how you decide to set up the RMC. com1_baud Sets the baud rate of the COM1 serial port and the modem port.
  • Page 243: Rmc Command-Line Interface

    RMC Command-Line Interface The remote management console supports setup commands and commands for managing the system. The RMC commands are listed below. clear {alert, port} disable {alert, remote} dump enable {alert, remote} halt {in, out} hangup help or ? power {on, off} quit reset send alert...
  • Page 244 Command Conventions Observe the following conventions for entering RMC commands: • Enter enough characters to distinguish the command. NOTE: The reset and quit commands are exceptions. You must enter the entire string for these commands to work. • For commands consisting of two words, enter the entire first word and at least one letter of the second word.
  • Page 245: Defining The Com1 Data Flow

    7.6.1 Defining the COM1 Data Flow Use the set com1_mode command from SRM or RMC to define the COM1 data flow paths. You can set com1_mode to one of the following values: through All data passes through RMC and is filtered for the escape sequence.
  • Page 246: Displaying The System Status

    7.6.2 Displaying the System Status The RMC status command displays the current RMC settings. Table 7–1 explains the status fields. Example 7–2 status RMC> status PLATFORM STATUS On-Chip Firmware Revision: V1.0 Flash Firmware Revision: V1.2 Server Power: ON System Halt: Deasserted RMC Power Control: ON Escape Sequence: ^[^[RMC Remote Access: Enabled...
  • Page 247: Status Command Fields

    Table 7–1 Status Command Fields Field Meaning On-Chip Firmware Revision of RMC firmware on the microcontroller. Revision: Flash Firmware Revision of RMC firmware in flash ROM. Revision: Server Power: ON = System is on. OFF = System is off. System Halt: Asserted = System has been halted.
  • Page 248: Displaying The System Environment

    7.6.3 Displaying the System Environment command provides a snapshot the system environment. Example 7–3 env RMC> env System Hardware Monitor Temperature (warnings at 48.00C, power-off at 53.00C) CPU0: 27.00C CPU1: 28.00C CPU2: 27.00C CPU3: 28.00C Zone0: 26.00C Zone1: 28.00C Zone2: 26.00C Fan RPM Fan1: 2149 Fan2: 2177...
  • Page 249 CPU temperature. In this example four CPUs are present. Temperature of PCI backplane: Zone 0 includes PCI slots 1–3, Zone 1 includes PCI slots 7–10, and Zone 2 includes PCI slots 4–6. Fan RPM. With the exception of Fan 5, all fans are powered as long as the system is powered on.
  • Page 250: Dumping Dpr Data

    7.6.4 Dumping DPR Data The dump command dumps unformatted data from DPR locations 0–3FFF hex. The information might be useful for system trouble- shooting. Use the DPR address table in Appendix C to analyze the data. Example 7–4 dump RMC> dump Address: 10 Count: ee 0010:03 31 07 28 01 09 00 00 00 00 00 00 00 00 00 00...
  • Page 251 DPR address Number of bytes dumped (in hex). In the example the dump command dumps EF bytes from address 10. Bytes 10:15 are the time stamp. See Appendix C for the meaning of other locations. The dump command allows you to dump data from the DPR. You can use this command locally or remotely if you are not able to access the SRM console because of a system crash.
  • Page 252: Power On And Off, Reset, And Halt

    7.6.5 Power On and Off, Reset, and Halt The RMC power {on, off}, halt {in, out}, and reset commands perform the same functions as the buttons on the operator control panel. Power On and Power Off The RMC power on command powers the system on, and the power off command powers the system off.
  • Page 253 Halt In and Halt Out The halt in command halts the system. The halt out command releases the halt. When you issue either the halt in or halt out command, the terminal exits RMC and reconnects to the server's COM1 port. Example 7–6 halt in/out RMC>...
  • Page 254: Configuring Remote Dial-In

    7.6.6 Configuring Remote Dial-In Before you can dial in through the RMC modem port or enable the system to call out in response to system alerts, you must configure RMC for remote dial-in. Connect your modem to the 9-pin modem port and turn it on. Connect to the RMC CLI from either the local serial terminal or the local VGA monitor to set up the parameters.
  • Page 255 Sets the password that is prompted for at the beginning of a modem session. The string cannot exceed 14 characters and is not case sensitive. For security, the password is not echoed on the screen. When prompted for verification, type the password again. Sets the initialization string.
  • Page 256: Configuring Dial-Out Alert

    7.6.7 Configuring Dial-Out Alert When you are not monitoring the system from a modem connection, you can use the RMC dial-out alert feature to remain informed of system status. If dial-out alert is enabled, and the RMC detects alarm conditions within the managed system, it can call a preset pager number.
  • Page 257 The elements of the dial string and alert string are shown in Table 7–2. Paging services vary, so you need to become familiar with the options provided by the paging service you will be using. The RMC supports only numeric messages. Sets the string to be used by the RMC to dial out when an alert condition occurs.
  • Page 258: Elements Of Dial String And Alert String

    Table 7–2 Elements of Dial String and Alert String Dial String The dial string is case sensitive. The RMC automatically converts all alphabetic characters to uppercase. ATXDT AT = Attention. X = Forces the modem to dial “blindly” (not seek the dial tone).
  • Page 259: Resetting The Escape Sequence

    7.6.8 Resetting the Escape Sequence The RMC set escape command sets a new escape sequence. The new escape sequence can be any character string, not to exceed 14 characters. A typical sequence consists of two or more control characters. It is recommended that control characters be used in preference to ASCII characters.
  • Page 260: Resetting The Rmc To Factory Defaults

    Resetting the RMC to Factory Defaults If the non-default RMC escape sequence has been lost or forgotten, RMC must be reset to factory settings to restore the default escape sequence. Figure 7–4 RMC Jumpers (Default Positions) 1 2 3 PK0211A NOTE: J1, J2, and J3 are reserved.
  • Page 261 The following procedure restores the default settings: 1. Shut down the operating system and press the Power button on the operator control panel to the OFF position. 2. Unplug the power cord from each power supply. Wait until the +5V Aux LEDs on the power supplies go off before proceeding.
  • Page 262: Troubleshooting Tips

    Troubleshooting Tips Table 7–3 lists possible causes and suggested solutions for symptoms you might see. Table 7–3 RMC Troubleshooting Symptom Possible Cause Suggested Solution You cannot connect to The RMC may be in Issue the show com1_mode the RMC CLI from the Soft Bypass or Firm command from SRM and modem.
  • Page 263 Table 7–3 RMC Troubleshooting (Continued) Symptom Possible Cause Suggested Solution RMC will not answer On AC power-up, Wait 30 seconds after when modem is called. RMC defers powering up the system and (continued from initializing the modem RMC before attempting to previous page) for 30 seconds to allow dial in.
  • Page 265: Chapter 8 Fru Removal And Replacement

    Chapter 8 FRU Removal and Replacement This chapter describes the procedures for removing and replacing FRUs on ES45 systems. Unless otherwise specified, install a FRU by reversing the steps shown in the removal procedures. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience.
  • Page 266 CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system. Remove jewelry before working on internal parts of the system. IMPORTANT! After you have replaced FRUs and determined that the system has been restored to its normal operating con- dition, you must clear the system error information re- pository (error information logged to the DPR).
  • Page 267: Frus

    FRUs Table 8–1 lists the FRUs by part number and description. Figure 8–1 shows the location of FRUs in the pedestal/rack systems, and Figure 8–2 shows the location of FRUs in the tower system. Table 8–1 FRU List Part # Description Cables 17-04787-01...
  • Page 268 Table 8–1 FRU List (Continued) Part # Description Fans 70-40074-01 Fan assembly, 172 MM Fan 6 70-40073-01 Fan assembly, 120 MM Fans 1 and 2 70-40073-02 Fan assembly, 120 MM Fan 5 70-40072-01 Fan assembly, 120 MM Fan 3 70-40071-01 Fan assembly, 120 MM Fan 4 CPU Module...
  • Page 269 Table 8–1 FRU List (Continued) Part # Description Other Modules and Components 70-33894-02 54-30414-02 PCI Hot swap module 54-30348-02 8-slot MMB for 200-pin DIMMs 54-30348-03 4-slot MMB for 200-pin DIMMs 70-31349-01 Speaker assembly 30-50802-01 Hard drive cage assembly, 6 slot, 1-in. universal drives 54-30292-02 System motherboard 54-25575-02...
  • Page 270: Power Cords

    8.1.1 Power Cords Tower enclosures ordered in North America include a 220 V power cord. Non-North American orders require one country-specific power cord. Pedestal systems ordered in North American include two 220 V power cords. Non-North American orders require two country-specific power cords.
  • Page 271: Fru Locations

    8.1.2 FRU Locations Figure 8–1 and Figure 8–2 show the location of FRUs in the pedestal and rackmount configurations. Figure 8–1 FRUs — Front/Top (Pedestal/Rack View) Memory DIMMs CPU Cards Fans Backplane Fans Primary Drive Cage Floppy Drive Secondary CD-ROM Drive Drive Cage PK0285A FRU Removal and Replacement...
  • Page 272: Frus - Rear (Pedestal/Rack View)

    Figure 8–2 FRUs — Rear (Pedestal/Rack View) I/O Connector Module (Junk I/O) Speaker Power Harness Access Cover Power System Supplies Motherboard PK0286A ES45 Service Guide...
  • Page 273: Important Information Before Replacing Frus

    8.1.3 Important Information Before Replacing FRUs The system must be shut down before you replace most FRUs. The ex- ceptions are power supplies, individual fans, universal hard drives, and PCI cards in slots 4 – 10 (when the operating system supports this function).
  • Page 274 Before Replacing Non Hot-Plug FRUs Follow the procedure below before replacing non hot-plug FRUs. For universal disk drives, you must shut down the operating system, but you do not need to turn off system power. 1. Shut down the operating system. 2.
  • Page 275: Removing Enclosure Panels

    Removing Enclosure Panels Figure 8–3 Enclosure Panel Removal (Tower) PK0221B FRU Removal and Replacement 8-11...
  • Page 276 To Remove Enclosure Panels from a Tower The enclosure panels are secured by captive screws. 1. From the open position , lift up and away to remove the front door 2. To remove the top panel, loosen the top left and top right screws .
  • Page 277: Enclosure Panel Removal (Pedestal)

    Figure 8–4 Enclosure Panel Removal (Pedestal) PK0234A FRU Removal and Replacement 8-13...
  • Page 278 To Remove Enclosure Panels from a Pedestal The enclosure panels are secured by captive screws. 1. From the open position, lift up and away to remove the front door (the bottom door is removed in the same way). 2. Remove the top enclosure panel by loosening the captive screws shown in .
  • Page 279: Accessing The System Chassis In A Cabinet

    Accessing the System Chassis in a Cabinet In a rackmount system, the system chassis is mounted to slides. WARNING: Pull out the stabilizer bar and extend the leveler foot to the floor before you pull out the system. This precaution prevents the cabinet from tipping over.
  • Page 280: Moving The Inner Race Forward

    WARNING: 1. Make sure that all other hardware in the cabinet is pushed in and attached. 2. The system is very heavy. Do not attempt to lift it manually. Use a material lift or other mechanical device. 3. The inner race must be moved forward prior to installing the system.
  • Page 281: Removing Covers From The System Chassis

    Removing Covers from the System Chassis The system chassis has three covers: the fan cover, the system card cage cover, and the PCI card cage cover. Remove a cover by loosening the quarter-turn captive screw, pulling up on the ring, and sliding the cover from the system chassis.
  • Page 282 Figure 8–7 and Figure 8–8 show the location and removal of covers on the tower and pedestal/rackmount systems, respectively. The numbers in the illustrations correspond to the following: 3mm Allen captive quarter-turn screw that secures each cover. Spring-loaded ring that releases cover. Each cover has a ring. Fan area cover.
  • Page 283: Covers On The System Chassis (Tower)

    Figure 8–7 Covers on the System Chassis (Tower) PK0216A FRU Removal and Replacement 8-19...
  • Page 284: Covers On The System Chassis (Pedestal/Rack)

    Figure 8–8 Covers on the System Chassis (Pedestal/Rack) PK0215A 8-20 ES45 Service Guide...
  • Page 285: Power Supply

    Power Supply Figure 8–9 Replacing or Adding a Power Supply PK0232A FRU Removal and Replacement 8-21...
  • Page 286 WARNING: Hazardous voltages are contained within the power supply. Do not attempt to service. Return to factory for service. The power supply is a hot-plug component. As long as the system has a redun- dant supply, you can replace a supply while the system is running. Replacing a Power Supply 1.
  • Page 287: Fans

    Fans Figure 8–10 Replacing Fans PK0208a FRU Removal and Replacement 8-23...
  • Page 288 The fans are hot-plug components. You can replace individual fans while the system is running. WARNING: Contact with moving fan can cause severe injury to fingers. Avoid contact or remove power prior to access. WARNING: High current area. Currents exceeding V @ >240VA 240 VA can cause burns or eye injury.
  • Page 289: Universal Hard Disk Drives

    Universal Hard Disk Drives The system uses hot-pluggable universal hard disk drives. Hot- pluggable drives can be replaced without removing power from the system or interrupting the transfer of data over the SCSI bus. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience.
  • Page 290: Replacing Or Adding A Hard Drive

    Figure 8–11 Replacing or Adding a Hard Drive MR0064 8-26 ES45 Service Guide...
  • Page 291 Installing a Drive 1. Access the storage drive area and remove the drive blank for the next available slot (Drives are installed left to right, SCSI ID 0 – 5). 2. Insert the new drive into the cage and push it in while pivoting the re- lease lever in toward the drive.
  • Page 292: Removing The Shipping Bracket

    Removing the Shipping Bracket The shipping bracket provides protection and stabilization for CPU modules during shipment. Figure 8–12 Removing the Shipping Bracket MR0059 8-28 ES45 Service Guide...
  • Page 293 Complete the following procedure to remove the shipping bracket: Unscrew and loosen the slide that holds the CPUs Push the bracket toward point to release it and then pull up at Save the shipping bracket for possible future shipment of the server. NOTE: The shipping bracket is only needed when shipping the server.
  • Page 294: Cpus

    CPUs Shut the system down before adding or replacing a CPU. Figure 8–13 Adding or Replacing CPU Cards Slot 3 Slot 1 Slot 0 Slot 2 PK0240B 8-30 ES45 Service Guide...
  • Page 295 WARNING: To prevent injury, access is limited to per- sons who have appropriate technical training and ex- perience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others.
  • Page 296 1. Shut down the operating system and turn off power to the system. Unplug the power cord from each power supply. 2. Access the system chassis by following the instructions in Section 8.2 or 8.3. 3. Remove the covers from the fan area and the system card cage as explained in Section 8.4.
  • Page 297: Memory Dimms

    DIMMs are manufactured with two types of SRAMs, stacked and unstacked. Stacked DIMMs provide twice the capacity of unstacked DIMMs and, at the time of shipment, are the highest capacity DIMMs offered by Compaq. Your system may have either stacked or unstacked DIMMs.
  • Page 298: Installing And Removing Mmbs And Dimms

    Figure 8–14 Installing and Removing MMBs and DIMMs Pedestal/Rack Tower PK0205B 8-34 ES45 Service Guide...
  • Page 299 WARNING: To prevent injury, access is limited to per- sons who have appropriate technical training and ex- perience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. WARNING: Modules have parts that operate at high temperatures.
  • Page 300: Aligning Dimm In Mmb

    6. Release the clips (Figure 8–15) on the MMB slot where you will install the DIMM 7. Install the DIMM and align the notches on the gold fingers with the connec- tor keys. 8. Secure the DIMM with the clips on the MMB slot.
  • Page 301: Determining Memory Configuration

    8.10.1 Determining Memory Configuration For optimum memory utilization and performance load memory DIMMs into arrays in the following order: 0, 1, 2, 3, 4, 6, 5, and 7. Figure 8–16 Pedestal/Rack Memory Configuration Set # Set # Set # MMB 0 Array 0 Set # Set # 0 &...
  • Page 302: Tower Memory Configuration

    Figure 8–17 Tower Memory Configuration Set # MMB 3 Set # MMB 2 Set # MMB 1 Set # Array 0 Array 2 Set # 2 & 6 Set # 0 & 4 MMB 0 Array 1 Array 3 Set # 1 & 5 Set # 3 &...
  • Page 303: Pci Cards

    8.11 PCI Cards Some PCI options require drivers to be installed and configured. These options come with a floppy or a CD-ROM. Refer to the installa- tion document that came with the option and follow the manufac- turer's instructions. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience.
  • Page 304: Installing Or Replacing A Pci Card

    Figure 8–18 Installing or Replacing a PCI Card MR0027 8-40 ES45 Service Guide...
  • Page 305 Adding or Replacing a PCI Card CAUTION: Hot plug is not currently supported on the operating system. Do not press switches on the hot swap board. Pressing switches can result in loss of data. Complete the following procedure to add or remove a PCI option card. 1.
  • Page 306: Pci Module Hot Swap Assembly

    Figure 8–19 PCI Module Hot Swap Assembly Closed Position MR0074 8-42 ES45 Service Guide...
  • Page 307: Replacing The Pci Hot Swap Module

    8.11.1 Replacing the PCI Hot Swap Module Shut the system down before adding or replacing the PCI hot swap module. 1. Halt all applications and power down the system. 2. Unscrew and remove the three M3x 6mm screws and attached washers. (Note: make sure you remove the three lower screws that hold the switch board in place.) 3.
  • Page 308: Ocp Assembly

    8.12 OCP Assembly Figure 8–20 Replacing the OCP Assembly PK0282a Replacing the OCP Assembly Shut the system down before removing the OCP assembly. 1. Press the two tabs on the top of the OCP assembly to release it. 2. Rotate the assembly toward you and lift it out of the two bottom tabs. 3.
  • Page 309: Installing Disk Cages

    8.13 Installing Disk Cages Figure 8–21 Cabling and Preparation for Installing Disk Cages PKO974-0A FRU Removal and Replacement 8-45...
  • Page 310 WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such per- sons are expected to understand the hazards of working within this equipment and take measures to minimize danger to them- selves or others. WARNING: To prevent injury, unplug the power cord from each power supply before installing com- ponents.
  • Page 311: Disk Cage Installation

    Figure 8–22 Disk Cage Installation PKO975-0B FRU Removal and Replacement 8-47...
  • Page 312 CAUTION: Always plug the cables into the cage that is on the same side as the cage that was removed. 9. Partially slide the drive cage into the system chassis. 10. Connect the power source (located inside enclosure) to the drive cage. 11.
  • Page 313: Cabling A Second Disk Drive Cage

    8.13.1 Cabling a Second Disk Drive Cage If you are installing a second drive cage, refer to the following illustra- tion for cable routing. Figure 8–23 Cabling a Second Disk Cage PKO976-00 FRU Removal and Replacement 8-49...
  • Page 314: Adding Or Replacing Removable Media

    8.14 Adding or Replacing Removable Media Figure 8–24 Adding a 5.25-Inch Device PK0235A 8-50 ES45 Service Guide...
  • Page 315 WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such per- sons are expected to understand the hazards of working within this equipment and take measures to minimize danger to them- selves or others. WARNING: To prevent injury, unplug the power cord from each power supply before installing com- ponents.
  • Page 316 9. Slide the storage device into the desired storage slot and secure the device to the unit with four of the screws provided inside the removable media drive cage. 10. Pull the floppy cables back in. 11. Slide the removable media cage back in and replace the four screws set aside previously.
  • Page 317: Floppy Drive

    8.15 Floppy Drive Figure 8–25 Replacing the Floppy Drive PK0281A WARNING: To prevent injury, unplug the power cord from each power supply before installing com- ponents. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is re- moved before touching any module.
  • Page 318 Replacing the Floppy Drive Shut the system down before removing the floppy drive. 1. Remove the cover to the PCI card cage. 2. Unplug the signal cable and power cable from all devices except the floppy. 3. Remove and set aside the four screws that secure the removable media cage.
  • Page 319: I/O Connector Assembly

    8.16 I/O Connector Assembly Figure 8–26 Replacing the I/O Connector Assembly PK0284A WARNING: To prevent injury, unplug the power cord from each power supply before installing com- ponents. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is re- moved before touching any module.
  • Page 320 Replacing the I/O Connector Assembly Shut the system down before removing the I/O connector assembly. 1. Unplug all I/O connectors from the rear of the unit. 2. Remove the cover from the PCI card cage. 3. Remove PCI cards as needed for access. 4.
  • Page 321: Pci Backplane

    8.17 PCI Backplane Figure 8–27 Cables Connected to PCI Backplane PK0279 Cable Connects To: 17-05021-01 CD-ROM 17-03970-04 Floppy 17-04400-06 I/O controller module 70-31349-01 Speaker 17-04785-01 Fans 17-04786-01 Cover sensors 17-03971-07 17-05042-01 Hot swap module FRU Removal and Replacement 8-57...
  • Page 322 WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. WARNING: To prevent injury, unplug the power cord from each power supply before installing com- ponents. Disconnecting the Cables Shut the system down before accessing the PCI area. 1.
  • Page 323: Removing The Separators

    Figure 8–28 Removing the Separators MR0060 FRU Removal and Replacement 8-59...
  • Page 324: Replacing The Pci Backplane

    Figure 8–29 Replacing the PCI Backplane PK0280A 8-60 ES45 Service Guide...
  • Page 325 Replacing the PCI Backplane CAUTION: When removing the PCI backplane, be careful not to flex the board. Flexing the board may damage the BGA component connections. 1. Remove the three screws that secure the base holding the PCI divid- ers. 2.
  • Page 326: System Motherboard

    8.18 System Motherboard Figure 8–30 Replacing the System Motherboard PK1207a 8-62 ES45 Service Guide...
  • Page 327 WARNING: To prevent injury, unplug the power cord from each power supply before installing compo- nents. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. CAUTION: When removing the system motherboard, be careful not to flex the board.
  • Page 328 (The serial number is on a label on the back of the system.) For example: P00>>> set sys_serial_num NI900100022 IMPORTANT: The system serial number must be set correctly. Compaq Ana- lyze will not work with an incorrect serial number. The serial number propagates to all FRU devices that have EEPROMs.
  • Page 329: Power Harness

    8.19 Power Harness Figure 8–31 Replacing the Power Harness Front Back PK1208 FRU Removal and Replacement 8-65...
  • Page 330 WARNING: To prevent injury, unplug the power cord from each power supply before installing compo- nents. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. NOTE: Replacing the power harness requires the removal of other system FRUs.
  • Page 331 12. Remove the two screws and two plastic bushings on each of the three power supply connectors . The screws are located deep inside the power supply cavity. Set aside the screws and bushings for reinstallation. 13. Starting with the left connector (as viewed from the rear of the system), pull the connector to the right and angle it so that you can push the left end out through the opening.
  • Page 333: Srm Console Commands

    Appendix A SRM Console Commands This appendix lists the SRM console commands that are most frequently used with the ES4x family of systems. Table A–1 SRM Commands Used on ES45 Systems Command Function boot Loads and starts the operating system. buildfru Initializes I Cbus EEPROM data structures for the named FRU.
  • Page 334 Table A–1 SRM Commands Used on ES45 Systems (Continued) Command Function exer Exercises one or more devices by performing specified read, write, and compare operations. floppy_write Runs a write test on the floppy drive to determine whether you can write on the diskette. galaxy Same as lpinit.
  • Page 335 Table A–1 SRM Commands Used on ES45 Systems (Continued) Command Function Invokes the remote management console from the local VGA monitor. set envar Sets or modifies the value of an environment variable. Sets the system serial number. set sys_serial_num show envar Displays the state of the specified environment variable.
  • Page 337: Appendix B Jumpers And Switches

    Appendix B Jumpers and Switches This chapter lists and describes the configuration jumpers and switches on the system motherboard and PCI board. Sections are as follows: • RMC and SPC Jumpers on System Motherboard • TIG/SROM Jumpers on System Motherboard •...
  • Page 338: Rmc And Spc Jumpers

    RMC and SPC Jumpers on System Motherboard The RMC jumpers can be used to override the RMC defaults. example, if a high-speed modem is connected to COM1, you can disable J4 to prevent RMC from receiving characters that might cause interference.
  • Page 339: Rmc/Spc Jumper Settings

    Table B–1 RMC/SPC Jumper Settings Jumper Description 1–2: Disables RMC flash update 2–3: Enables RMC flash update (default) Disabling RMC flash update prevents other operators from erasing or updating the RMC. 1–2: Sets RMC back to defaults 2–3: Normal RMC operating mode (default) If the RMC escape sequence is set to something other than the default, and you have forgotten the sequence, RMC must be reset to factory settings to restore the default escape sequence.
  • Page 340: Tig/Fsl Jumpers On System Motherboard

    TIG/FSL Jumpers on System Motherboard TIG/SROM jumpers allow you to load the TIG if flash RAM is corrupted or load the fail-safe loader (FSL) if SRM firmware is corrupted. Figure B–2 TIG/FSL Jumpers 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3...
  • Page 341: Tig/Fsl Jumper Descriptions

    Table B–2 TIG/FSL Jumper Descriptions Jumper Description 1–2: Load TIG from flash ROM (default) 2–3: Load TIG from serial ROM. This setting allows you to load the TIG if the flash ROM is corrupted. Jumper for enabling fail-safe loader (FSL) FIR_FUNC1 (bit 1) 1–2= 0, 2–3= 1 Must be in default positions over pins 1 and 2 to enable FSL.
  • Page 342 SYS_EXT_DELAY1 (off) SYS_EXT_DELAY0 (on) SYS_FILL_DELAY (on) CPU_CFWD_PSET (off) Reserved Reserved Y_DIV3 (on) Y_DIV2 (on) Y_DIV1 (on) SW10 Y_DIV0 (on) ES45 Service Guide...
  • Page 343: Clock Generator Switch Settings

    Clock Generator Switch Settings Switchpack E16 on the system motherboard sets the frequency of the main clock on the system motherboard. The settings should not be changed. Figure B–3 CSB Switchpack E16 SC0034B Jumpers and Switches B-7...
  • Page 344: Clock Generator Settings

    Table B–4 Clock Generator Settings M0 (on) M1 (off) M2 (off) M3 (off) M4 (off) M5 (on) M6 (on) N0 (off) N1 (on) SW10 XTAL_SEL (off) ES45 Service Guide...
  • Page 345: Jumper On Pci Board

    Jumper on PCI Board You can set J19 on the PCI board to force DTR so that a modem will not be disconnected if the system is power cycled. Figure B–4 PCI Board Jumper 1 2 3 9 10 SC0044B Jumpers and Switches B-9...
  • Page 346 Table B–5 PCI Board Jumper Description Jumper Description 1–2: Do not force COM1 DTR 2–3: Force COM1 DTR (default) This jumper allows you to force DTR. The default position prevents disconnection of the modem on a power cycle. B-10 ES45 Service Guide...
  • Page 347: Setting Jumpers

    Setting Jumpers Review the material in the previous sections of this appendix before setting any system jumpers. First, shut down the system and remove the power cord from each power supply. CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system.
  • Page 349: Appendix Cdpr Address Layout

    Appendix C DPR Address Layout This appendix shows the address layout of the dual-port RAM (DPR). Use the SRM examine dpr:address command (where address is the offset from the base of the DPR) or use the RMC dump command to view locations in the DPR. See Appendix D for definitions of locations written when environmental error events occur.
  • Page 350 Table C–1 DPR Address Layout Location Logical Written (Hex) Indicator Used For SROM EV6 BIST status 1=good 0=bad SROM Bit[7]=Master Bits[0,1]=CPU_ID SROM Test STR status 1=good 0=bad SROM Test CSC status 1=good 0=bad SROM Test Pchip 0 PCTL status1=good 0=bad SROM Test Pchip 1 PCTL status 1=good 0=bad...
  • Page 351 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By Used For SROM SROM Power On Error Indication for CPU is “alive.” For example; 0 = no error, 2 = Secondary time-out Error, 3 = Bcache Error 17:1D Unused SROM Last “sync state”...
  • Page 352 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By Used For SROM Array 0 (AAR 0)Size (x64 Mbytes) 0 = no good memory 1 = 64 Mbyte 2 = 128 Mbyte 4 = 256 Mbyte 8 = 512 Mbyte 10 = 1 Gbyte 20 =...
  • Page 353 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By Used For 93:96 Temperature from CPU(x) in BCD 97:99 Temperature Zone(x) from 3 PCI temp sensors 9A:9F Fan Status; Raw Fan speed value A0:A9 Failure registers used as part of the 680 machine check logout frame.
  • Page 354 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By Used For Status of RMC to read SCSI backplane Definition: Bit 0 — SCSI backplane 0 Bit 1 — SCSI backplane 1 Bit 4 — Power supply 0 Bit 5 —...
  • Page 355 100:1FF Copy of EEROM on MMB0 J4 DIMM 1, initially read on I C bus by RMC when 5 volts supply turned on. Written by Compaq Analyze after error diagnosed to particular FRU 200:2FF Copy of EEROM on MMB0 J8...
  • Page 356 Copy of EEROM on CPB (PCI backplane) 2A00:2AFF 2A00 Copy of EEROM on CSB (motherboard) 2B00:2BFF 2B00 Last EV68 Correctable Error—ASCII character string that indicates correctable error occurred, type, FRU, and so on. Backed up in CSB (motherboard) EEROM. Written by Compaq Analyze ES45 Service Guide...
  • Page 357 2C00 Last Redundant Failure—ASCII character string that indicates redundant failure occurred, type, FRU, and so on. Backed up in system CSB (motherboard) EEROM. Written by Compaq Analyze 2D00:2DFF 2D00 Last System Failure—ASCII character string that indicates system failure occurred, type, FRU, and so on. Backed up in CSB (motherboard) EEROM.
  • Page 358 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator Used For 3418 SROM/SRM Waiting to jump to flag for CPU0 3419 SROM Shadow of value written to EV6 DC_CTL register. 341A:341E SROM Shadow of most recent writes to EV6 CBOX “Write-many”...
  • Page 359 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator Used For 3500:35FF Firmware Used as the dedicated buffer in which SRM writes OCP or FRU EEROM data. Firmware will write this data, RMC will only read this data. 3600:36FF 3600 Reserved...
  • Page 361 Appendix D Registers This appendix describes 21264 (EV68) internal processor registers; 21274 (Titan) system support chipset registers; and dual-port RAM (DPR) registers that are related to general logout frame errors. It also provides CPU and system uncorrectable and correctable machine logout frames and error state bit definitions of all the platform logout frame registers.
  • Page 362: Ibox Status Register (I_Stat

    Table D–1 Ibox Status Register Fields Name Bits Type Description Reserved <63:41> Reserved for Compaq. <40> ROProfileMeMispredictTrap. If the I_STAT<TRP> bit is set, this bit indicates that the profiled instruction caused a mispredict trap. JSR/JMP/RET/COR or HW_JSR/ HW_JMP/HW_RET/HW_COR mispredicts do not set...
  • Page 363 Table D–1 Ibox Status Register Fields (Continued) Name Bits Type Description <38> ProfileMe Load-Store Order Trap. LS0 <38> RO ProfileMe Load-Store Order Trap. If the profiled instruction caused a replay trap, this bit indicates that the precise trap cause was an Mbox load-store order replay trap.
  • Page 364 When a parity error is detected, the Icache is flushed, a replay trap back to the address of the error instruction is generated, and a correctable read interrupt is requested. See also I_STAT<LAM>. Reserved <28:0> Reserved for COMPAQ ES45 Service Guide...
  • Page 365: Memory Management Status Register (Mm_Stat

    Table D–2 Memory Management Status Register Fields Name Bits Type Description Reserved <63:11> Reserved for Compaq. <10> DC_TAG_ This bit is set when a D-cache tag parity error occurs during PERR the initial tag probe of a load or store instruction. The error created a synchronous fault to the D_FAULT PALcode entry point and is correctable.
  • Page 366: Dcache Status Register Fields

    Table D–3 Dcache Status Register Fields Name Bits Type Description Reserved <63:5> Reserved for Compaq. <4> Second error occurred. When set, indicates that a second D-cache store ECC error occurred within 6 cycles of the previous D-cache store ECC error. ECC_ERR_LD <3>...
  • Page 367: Cbox Read Register

    D.4 Cbox Read Register The Cbox Read Register is read only by PAL code and is an element in the CPU or system uncorrectable and correctable machine check error logout frame. Table D–4 Cbox Read Register Fields Name Description C_SYNDROME_1<7:0> Syndrome for the upper QW in the OW of victim that was scrubbed.
  • Page 368 Table D–4 Cbox Read Register Fields (Continued) Name Description C_STAT<4:0> Bits Error Status (continued) 01100 ISTREAM_BC_ERR 01101 Reserved 0111X Reserved 10011 DSTREAM_MEM_DBL 10100 DSTREAM_BC_DBL 11011 ISTREAM_MEM_DBL 11100 ISTREAM_BC_DBL C_STS<3:0> If C_STAT equals xxx_MEM_ERR or xxx_BC_ERR, then C_STAT contains the status of the block as follows; otherwise, the value of C_STAT is X.
  • Page 369: Exception Address Register (Exc_Addr

    D.5 Exception Address Register (EXC_ADDR) The Exception Address Register (EXC_ADDR) is a read-only register that is updated by hardware when it encounters an exception or interrupt bit. 2 1 0 PC[63:2] LK99-0018A EXC_ADDR<0> is set if the associated exception occurred in PAL mode. The exception actions are: •...
  • Page 370 D.6 Interrupt Enable and Current Processor Mode Register (IER_CM) The Interrupt Enable and Current Processor Mode Register (IER_CM) contains the interrupt enable and current processor mode bit fields. These bit fields can be written either individually or together with a single HW_MTPR instruction.
  • Page 371: Ier_Cm Register Fields

    Table D–5 IER_CM Register Fields Name Extent Type Description Reserved <63:39> EIEN<5:0> <38:33> External Interrupt Enable SLEN <32> Serial Line Interrupt Enable CREN <31> Corrected Read Error Interrupt Enable PCEN<1:0> <30:29> Performance Counter Interrupt Enables SIEN<15:1> <28:14> Software Interrupt Enables ASTEN <13>...
  • Page 372: Interrupt Summary Register (Isum

    D.7 Interrupt Summary Register (ISUM) The Interrupt Summary Register (ISUM) is a read-only register that records all pending hardware, software, and AST interrupt requests that have their corresponding enable bit set. If a new interrupt (hardware, serial line, crd, or performance counters) occurs simultaneously with an ISUM read, the ISUM read returns zeros.
  • Page 373: Isum Register Fields

    Table D–6 ISUM Register Fields Name Extent Type Description Reserved <63:39> EI<5:0> <38:33> External Interrupts <32> Serial Line Interrupt <31> Corrected Read Error Interrupts PC<1:0> <30:29> Performance Counter Interrupts PC0 when PC<0> is set. PC1 when PC<1> is set. SI<15:1> <28:14>...
  • Page 374: Pal Base Register (Pal_Base

    PAL_BASE[43:15] LK99-0027A Table D–7 PAL_BASE Register Fields Name Extent Type Description Reserved <63:44> RO, 0 Reserved for COMPAQ. PAL_BASE <43:15> Base physical address for PALcode. Reserved <14:0> RO, 0 Reserved for COMPAQ. D-14 ES45 Service Guide...
  • Page 375 D.9 Ibox Control Register (I_CTL) The Ibox Control Register (I_CTL) is a read-write register that controls various Ibox functions. Its contents are cleared by a chip reset. SEXT(VPTB[47]) VPTB[47:30] CHIP_ID[5:0] BIST_FAIL TB_MB_EN MCHK_EN ST_WAIT_64K PCT1_EN PCT0_EN SINGLE_ISSUE_H VA_FORM_32 VA_48 SL_RCV SL_XMIT BP_MODE[1:0] SBE[1:0]...
  • Page 376 Table D–8 I_CTL Register Fields Name Extent Type Description SEXT(VPTB<47>) <63:48> RW,0 Sign extended VPTB<47>. VPTB<47:30> <47:30> RW,0 Virtual Page Table Base. CHIP_ID<5:0> <29:24> This is a read-only field that supplies the revision ID number for the EV68CB/EV68DC part. EV68CB/EV68DC pass 2.3 ID is 010111.
  • Page 377 Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description PCT0_EN <18> RW,0 Enable performance counter #0. If this bit is one, the performance counter will count if EITHER the system (SPCE) or process (PPCE) performance counter enable is set. SINGLE_ISSUE_H <17>...
  • Page 378: I_Ctl Register Fields

    Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description SL_RCV <14> When in native mode, any transition on SL_RCV, driven from the SromData_H pin, results in a trap to the PALcode interrupt handler. When in PALmode, all interrupts are blocked. The interrupt routine then begins sampling SL_RCV under a software timing loop to input as much data as needed, using the chosen serial line protocol.
  • Page 379 Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description SPE<2:0> <5:3> RW,0 Super Page Mode Enable. Identical to the SPE bits in the Mbox M_CTL SPE<2:0>. IC_EN<1:0> <2:1> RW,3 Icache Set Enable. At least one set must be enabled. The entire cache may be enabled by setting both bits.
  • Page 380: Process Context Register (Pctx

    D.10 Process Context Register (PCTX) The process context register (PCTX) contains information associated with the context of a process. The process context register (PCTX) contains information associated with the context of a process. Any combination of the bit fields within this register may be written with a single HW_MTPR instruction.
  • Page 381 Table D–9 PCTX Register Fields Name Extent Type Description Reserved <63:47> ASN<7:0> <46:39> Address space number. Reserved <38:13> ASTRR<3:0> <12:9> AST request register—used to request AST interrupts in each of the four processor modes. To generate a particular AST interrupt, its corresponding bits in ASTRR and ASTER must be set, along with the ASTE bit in IER.
  • Page 382: Pctx Register Fields

    Table D–9 PCTX Register Fields (Continued) Name Extent Type Description <2> RW,1 Floating-point enable—if clear, floating- point instructions generate FEN exceptions. This bit is set by hardware on reset. PPCE <1> Process performance counting enable. Enables performance counting for an individual process with counters PCTR0 or PCTR1, which are enabled by setting PCT0_EN or PCT1_EN, respectively.
  • Page 383: 21274 Cchip Miscellaneous Register (Misc

    D.11 21274 Cchip Miscellaneous Register (MISC) This register is designed so that there are no read side effects, and that writing a 0 to any bit has no effect. Therefore, when software wants to write a 1 to any bit in the register, it need not be concerned with read- modify-write or the status of any other bits in the register.
  • Page 384 Table D–10 21274 Cchip Miscellaneous Register Fields Initial Name Bits Type State Description <63:44> MBZ, RAZ Reserved. DEVSUP <43:40> <39:32> Cchip revision reads as 16 <31:29> NXM source—Device that caused the NXM. Unpredictable if NXM not set. 0 = CPU0 1 = CPU1 2 = CPU2 3 = CPU3...
  • Page 385 Table D–10 21274 Cchip Miscellaneous Register Fields (Continued) Initial Name Bits Type State Description IPINTR <11:8> R, W1C Interprocessor interrupt pending— one bit per CPU. ITINTR <7:4> R, W1C Interval timer interrupt pending— one bit per CPU. <3:2> MBZ, RW Reserved.
  • Page 386: 21274 Cchip Cpu Device Interrupt Request Register (Dirn

    D.12 21274 Cchip CPU Device Interrupt Request Register (DIRn, n=0,1,2,3) Register n applies to CPUn. These registers indicate which interrupts are pending to the CPUs. If a raw request bit is set and the corresponding mask bit is set, then the corresponding bit in this register will be set and the appropriate CPU will be interrupted.
  • Page 387: 21274 Array Address Registers (Aar0-Aar3

    D.13 21274 Array Address Registers (AAR0–AAR3) The Array Address Registers define the base address and size for each memory array. Table D–12 21274 Array Address Register (AAR) Field Bits Type Init Description <63:35> MBZ,RAZ Reserved. ADDR <34:24> Base address – Bits <34:24> of the physical byte address of the first byte in the array.
  • Page 388 Table D–12 21274 Array Address Register (AAR) (Continued) Field Bits Type Init Description <11:10> MBZ,RAZ Reserved. <9> Double (Twice)-split array <8> Split array. <7:4> MBZ,RAZ Reserved. ROWS <3:2> Number of row bits in the SDRAMs. Value Number of Bits Reserved BNKS <1:0>...
  • Page 389: Pchip System Error Register (Serror

    D.14 Pchip System Error Register (SERROR) This register is used for logging system errors. When system error bits <4, 2:0> are set, this entire register is frozen. Only the NXIO bit and the LOST bit can be set after that. All other values will be held until bits <2:1>...
  • Page 390 Table D–13 Pchip System Error Register (Continued) Field Bits Type Init Description <51:47> Reserved ADDR <46:15> Address of the erroneous quadword <14:5> Reserved LOST_CRE <4> R,W1C Lost a correctable ECC error because it was detected after this register was locked. c_err is asserted as long as this bit is set.
  • Page 391: Pchip A/G Pci Error Register (Gperror, Aperror

    D.15 Pchip A/G PCI Error Register (GPERROR, APERROR) This register is used for logging PCI errors on the GPCI or APCI buses respectively. The GPCI and APCI registers are identical. If any of bits <11:2> are set, then this entire register is frozen and the Pchip output signal h_err is asserted. Only bits <1:0>...
  • Page 392: Pchip Error Register

    Table D–14 Pchip Error Register Field Bits Type Init Description <63:56> Reserved <55:52> PCI command Value Command 0000 Interrupt Ackn 0001 Special Cycle 0010 I/O Read 0011 I/O Write 0100 Reserved 0101 Reserved 0110 Memory Read 0111 Memory Write 1000 Reserved 1001 Reserved...
  • Page 393 Table D–14 Pchip Error Register (Continued) Field Bits Type Init Description ADDR <46:14> Contains PCI address bits <34:02> <13:11> Reserved IPTPW <10> R,W1C Invalid peer-to-peer read. IPTPR <9> R,W1C Invalid peer-to-peer write. <8> R,W1C No devsel received as a PCI master. <7>...
  • Page 394 Table D–14 Pchip Error Register (Continued) Field Bits Type Init Description DCRTO <3> R,W1C Delayed completion Retry timeout as PCI target. PERR <2> R,W1C Pchip received a p_err_l assertion on data it sent to the PCI. If the command logged in bits <55:52> of this register is a read, the PCI transaction that encountered the error was a DMA Read or a PTP read on the source bus and the...
  • Page 395: Pchip Agp Error Register (Agperror

    D.16 Pchip AGP Error Register (AGPERROR ) The register is used for logging AGP errors. If any of bits <6:4, 0> are set, then this entire register is frozen and the Pchip output signal h_err is asserted. Only bit <0> can be set after that. All other values will be held until bits <6:4>...
  • Page 396 Table D–15 Pchip AGP Error Register (Continued) Field Bits Type Init Description MWIN <49> Monster Window hit <48> <47> Reserved ADDR <46:15> AGP address <34:3> corresponding to the erroneous quadword. <14.7> Reserved NOWINDOW <6> R,W1C An incoming AGP address did not match the Window registers.
  • Page 397: Dpr Registers For 680 Correctable Machine Check Logout Frames

    D.17 DPR Registers for 680 Correctable Machine Check Logout Frames DPR Locations A0:A9 represent the information that the console will read when a 680 machine check logout frame is loaded. They provide the interrupt information obtained by the RMC through the LM78 sensors.
  • Page 398 Table D–16 DPR Locations A0:A9 (Continued) Location Description If bit is set the associated fault is active. Bit 0 CPU0_VCORE out of tolerance CPU0_VIO out of tolerance CPU1_VCORE out of tolerance CPU1_VIO out of tolerance PCI backplane LM78 1 is overtemp Not used Fan 4 fault Fan 5 fault...
  • Page 399 Table D–16 DPR Locations A0:A9 (Continued) Location Description These bits indicate a door has been opened. Bit 0 unused CPU door is open Fan door is open PCI door is open System CPU door is open System fan door is open System PCI door is open Temperature Warning Mask Bit 0...
  • Page 400: Dpr Power Supply Status Registers

    D.18 DPR Power Supply Status Registers The RMC reads nine bytes of information from each of the three power supplies. The first byte is read from an I/O expander port, the second four bytes and the last four bytes are read from the A–D converter. Table D–17 Nine Bytes Read from Power Supply DPR Location Definition...
  • Page 401: Dpr 680 Fatal Registers

    D.19 DPR 680 Fatal Registers The RMC is powered by an auxiliary 5V supply that is independent from the system power subsystem. When any catastrophic failures (such as overtemperature failure) occur, this error state is captured as shown in Table D–18. The information is used to populate the console data log uncorrectable error frame in Environ_QW_8.
  • Page 402: Cpu And System Uncorrectable Machine Check Logout Frame

    D.20 CPU and System Uncorrectable Machine Check Logout Frame The SRM console builds the uncorrectable machine check logout frames and passes them to the OS error handlers. The OS error handlers further process and subsequently log the formatted error event into the system binary error log. Table D–19 CPU and System Uncorrectable Machine Check Logout Frame 48 47...
  • Page 403: Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable

    D.21 Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable) Compaq Analyze uses the logout frame in Table D–20 for its decomposition of all 680 system environmental uncorrectable error frames. Table D–20 Console Data Log Event Environmental Error Logout...
  • Page 404: Cpu And System Correctable Machine Check Logout Frame

    D.22 CPU and System Correctable Machine Check Logout Frame The SRM console builds the correctable machine check logout frames and passes them to the OS error handlers. The OS error handlers further process and subsequently log the formatted error event into the system binary error log.
  • Page 405: Environmental Error Logout Frame (680 Correctable

    D.23 Environmental Error Logout Frame (680 Correctable) Table D–22 shows Environ_QW_1:7 and Environ_QW_8 error state capture information from locations A0:A9 BD:BF, respectively. Table D–22 Environmental Error Logout Frame 56 55 48 47 40 39 32 31 24 23 16 15 0 Offset (Hex) Retryable/Second Error Flags Frame Size 0070)
  • Page 406: Platform Logout Frame Register Translation

    D.24 Platform Logout Frame Register Translation Compaq Analyze uses information from all logout frames for its decomposition of all error events. The error state bit definitions of all platform logout frame registers is shown in Table D–23. Table D–23 Bit Definition of Logout Frame Registers...
  • Page 407 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Text Translation Description Field C_SYNDROME_0 <7:0>(Hex) Data Bit <7:0>(Hex)70 Data Bit (continued) C_SYNDROME_1 <7:0> Syndrome for upper quadword in octaword of victim that was scrubbed (same as specified above) C_STAT <4:0>...
  • Page 408: Bit Definition Of Logout Frame Registers

    Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description C_ADDR <42:6> Address of last reported ECC or parity error. If C_STAT<4:0> = 05(Hex) then only C_ADDR<19:6> are valid. I_STAT <63:41> Reserved <40> ProfileMe Mispredict Trap <39>...
  • Page 409 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description EXC_ADDR <0> Set = exception or interrupt occurred in PAL mode <63:2> Contains the PC address of the instruction that would have executed if the error interrupt did not occur. IER_CM <4:3>...
  • Page 410 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description I_CTL <2:1> 01(Bin) and 10(Bin) for Icache set 1 or 2 enabled, respectively <7:6> 01(Bin) and 10(Bin) for R8-R11 & R24-R27 and R4-R7 & R20- R23 are used for PAL shadow registers, respectively <13>...
  • Page 411 Table D–23 Bit Definition of Logout Frame Registers (Continued) Bit Field Text Translation Description MISC <43:40> Suppress IRQ1 interrupts to 1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 Cchip <39:32> Cchip Revision Level : 00-07(Hex) for C2, 08-0F(Hex) for C4 <31:29>...
  • Page 412 Table D–23 Bit Definition of Logout Frame Registers (Continued) Bit Field Text Translation Description DIRx <63> Internal Cchip asynchronous error <i.e.NXM> (IRQ0) <62> P0_Pchip error (IRQ0) <61> P1_Pchip error (IRQ0)) <60> P2_Pchip error (future designs) (IRQ0) <59> P3_Pchip error (future designs) (IRQ0) <58>...
  • Page 413 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description P0 & 1_ERROR <63:56> ECC Syndrome of CRE or UECC error - Same as EV68. When CRE or UECC failing transaction: 0000(Bin) = <55:52> DMA Read;...
  • Page 414 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description SMIR <7> Inverted Sys_Rst = System is being reset <6> Inverted PCI_Rst1 = PCI Bus #1 is in reset (Environ_QW_1) <5> Inverted PCI_Rst0 = PCI Bus #0 is in reset <4>...
  • Page 415 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description System_PS/Temp/ <0> Set = PS +3.3V out of tolerance Fan_Fault_ <1> Set = PS +5V out of tolerance LM78_ISR <2> Set = PS +12V out of tolerance <3>...
  • Page 416 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description <47> Set = Power supply AC input high limit warning <63:48> Unused System_Doors <0> Unused <1> Set = System CPU door is open (Environ_QW_5) <2>...
  • Page 417 Table D–23 Bit Definition of Logout Frame Registers Register Identification Bit Field Text Translation Description Fatal_Power_Down_Codes <0> Set = Power Supply 0 AC input fail <1> Set = Power Supply 1 AC input fail (Environ_QW_8) <2> Set = Power Supply 2 AC input fail <3:7>...
  • Page 419: Appendix E Isolating Failing Dimms

    Appendix E Isolating Failing DIMMs This appendix explains how to manually isolate a failing DIMM from the failing address and failing data bits. It also covers how to isolate single-bit errors. The following topics are covered: • Information for Isolating Failures •...
  • Page 420: Information For Isolating Failures

    Information for Isolating Failures Table E–1 lists the information needed to isolate the failure. Appendix D for the register table for the Array Address Registers (AARs). The failing address and failing data can come from a variety of different locations such as the SROM serial line, SRM screen displays, the SRM event log, and errors detected by the 21264 (EV68) chip.
  • Page 421: Dimm Isolation Procedure

    DIMM Isolation Procedure Use the procedure in this section to isolate the failing DIMM. 1. Find the failing array by using the failing address and the Array Address Registers (AARs—see Appendix D). Use the AAR base address and size to create an Address range for comparing the failing address.
  • Page 422 3. After finding the real array, determine whether it is the lower array set or the upper array set. Use DPR locations 80, 82, 84, and 86 listed in Table E–1. Table E–4 shows the description of these locations. Table E–4 Description of DPR Locations 80, 82, 84, and 86 Location Description Array 0 (AAR 0) Configuration...
  • Page 423 4. Use the following table to determine the proper set. Bits<27,28,29,30,31,32> are from the failing address. Array Configuration Type Bits <7:4> from DPR Size 4 & 5 D & F 256MB Lower Set Upper Set Bit <27> == 0 – Lower Set, 1– Upper Set 512MB Lower Set Upper Set...
  • Page 424 Table E–5 Failing DIMM Lookup Table Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits ES45 Service Guide...
  • Page 425: Failing Dimm Lookup Table

    Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits Continued on next page Isolating Failing DIMMs...
  • Page 426 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits ES45 Service Guide...
  • Page 427 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits Continued on next page Isolating Failing DIMMs...
  • Page 428 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits E-10 ES45 Service Guide...
  • Page 429 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits Continued on next page Isolating Failing DIMMs E-11...
  • Page 430 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits E-12 ES45 Service Guide...
  • Page 431 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits Continued on next page Isolating Failing DIMMs E-13...
  • Page 432 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits E-14 ES45 Service Guide...
  • Page 433 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits Continued on next page Isolating Failing DIMMs E-15...
  • Page 434 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Data Lower Upper Lower Upper Lower Upper Lower Upper Bits E-16 ES45 Service Guide...
  • Page 435 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Check Lower Upper Lower Upper Lower Upper Lower Upper Bits Continued on next page Isolating Failing DIMMs E-17...
  • Page 436 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Array 1 Array 2 Array 3 Check Lower Upper Lower Upper Lower Upper Lower Upper Bits E-18 ES45 Service Guide...
  • Page 437: Ev68 Single-Bit Errors

    EV68 Single-Bit Errors The procedure for detection down to the set of DIMMs for a single-bit error is very similar to the procedure described in the previous sections. However, you cannot isolate down to a specific data or check bit. The 21264 (EV68) chip detects and reports a C_ADDR<42:6>...
  • Page 438 Table E–6 Syndrome to Data Check Bits Table (Continued) Syndrome C_Syndrome 0 C_Syndrome 1 Data Bit 14 or 142 Data Bit 78 or 206 Data Bit 15 or 143 Data Bit 79 or 207 Data Bit 16 or 144 Data Bit 80 or 208 Data Bit 17 or 145 Data Bit 81 or 209 Data Bit 18 or 146...
  • Page 439 Table E–6 Syndrome to Data Check Bits Table (Continued) Syndrome C_Syndrome 0 C_Syndrome 1 Data Bit 46 or 174 Data Bit 110 or 238 Data Bit 47 or 175 Data Bit 111 or 239 Data Bit 48 or 176 Data Bit 112 or 240 Data Bit 49 or 177 Data Bit 113 or 241 Data Bit 50 or 178...
  • Page 441 6-10 bootdef_dev environment variable, 6-8 com2_modem environment variable, 6-11 Booting Linux, 6-34 Command conventions, RMC, 7-14 buildfru command, 4-6 Compaq Analyze, 2-9 Bypass modes, 7-6 Bypassing the RMC, 7-6 and SDD errors, 4-58 and TDD errors, 4-58 documentation, 5-4...
  • Page 442 Dial-out alert, 7-26 DIMM arrays, 6-23 DIMMs configuring, 6-23 Data buses, 1-18 part numbers, 8-4 Data structures, displaying, 4-27 Director, Compaq Analyze, 5-4 D-chips, 1-3 Disk cages, installing, 8-45 De-installing Q-VET, 2-19 Display device deposit and examine commands, 4-13 selecting, 6-3...
  • Page 443 dump command (RMC), 7-20 replacing, 8-23 Fault detection and correction, 5-17 Firm Bypass mode, 7-8 Firmware updates, 2-20, 3-17 ECC logic, 5-18 Flash SROM, 3-7 ei*0_inet_init environment variable, 6-12 Floppy diskette drive, 1-6 ei*0_mode environment variable, 6-12 Floppy drive ei*0_protocols environment variable, 6-12 part number, 8-5 Enclosure panels replacing, 8-53...
  • Page 444 hose_x_default_speed, 4-3 LEDs Hot swap module control panel, 1-12 assembly, 1-8 power supply, 1-26 replacing, 8-43 LFU utility, 3-17, 3-27 Hot-plug FRUs, 8-9 Line voltage, 1-26 Linux booting, 6-34 Local mode, 7-5 I/O connector assembly, replacing, 8-55 login command, 6-19 I/O connectors, 1-10 Logout frame I/O control logic, 1-19...
  • Page 445 PCI slot locations pedestal, 6-29 net command, 4-49 tower, 6-31 net -ic command, 4-49 pci_parity environment variable, 6-14 net -s command, 4-49 php_button_test, 4-4 nettest command, 4-51 php_led_test, 4-3 Network ports, testing, 4-51 PIC processor, 1-23, 1-24, 7-3 No MEM error, 3-18 pk*0_fast environment variable, 6-14 pk*0_host_id environment variable, 6-15 pk*0_soft_term environment variable, 6-15...
  • Page 446 Processor card, 1-14 escape sequence, 7-10 exiting, 7-10 exiting from local VGA, 7-11 Firm Bypass mode, 7-8 quit command (RMC), 7-10 hangup command, 7-25 Q-VET jumpers, 7-30 de-installing, 2-19 Local mode, 7-5 installation verification, 2-11 logic, 1-24, 7-3 installing, 2-13 operating modes, 7-4 results review, 2-17 overview, 1-24, 7-2...
  • Page 447 Tools and utilities, 2-9 diagnostics, 4-1 Troubleshooting power-up display, 3-10 boot problems, 2-7 problems accessing, 2-5 Compaq Analyze, 5-2 problems reported by, 2-6 crash dumps, 2-10 SRM console commands, 2-9, A-1 errors reported by operating system, 2-8 SROM power problems, 2-4...
  • Page 448 Verifying devices, 4-66 VGA console tests, 4-67 VGA monitor, 1-33, 6-3 VT terminal, 6-3 WEBES Director, 5-4 Write test, on floppy, 4-22 Index-8...

Table of Contents