Summary of Contents for Digital Equipment DIGITAL Ultimate Workstation 533
Page 1
AlphaServer 1200 DIGITAL Ultimate Workstation 533 Service Manual PRELIMINARY Order Number: EK–1200A–SV. A01 This manual is for anyone who services an AlphaServer/AlphaStation system. It includes troubleshooting information, configuration rules, and instructions for removal and replacement of field-replaceable units. Digital Equipment Corporation...
Page 2
The software, if any, described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software or equipment that is not supplied by Digital Equipment Corporation or its affiliated companies.
Contents Preface …………………………………………………………….………………x Chapter 1 Overview 1200 System .................... 1-2 Control Panel and Drives ................. 1-4 System Consoles..................1-6 System Architecture................. 1-8 CPU Types .................... 1-10 Memory ....................1-12 Memory Addressing................1-14 System Motherboard ................1-16 1.8.1 System Bus (Backplane) ..............1-18 1.8.2 System Bus to PCI Bus Bridge ............
Page 4
Troubleshooting Power Problems............. 3-4 Running Diagnostics — Test Command ..........3-6 Releasing Secure Mode................3-7 Testing an Entire System ................. 3-8 3.5.1 Testing Memory................3-10 3.5.2 Testing PCI..................3-12 Other Useful Console Commands ............3-14 Chapter 4 Error Logs Using Error Logs ..................4-2 4.1.1 Hard Errors ..................
Page 5
Chapter 5 Error Registers External Interface Status Register - EL_STAT......... 5-2 5.1.1 External Interface Address Register - EI_ADDR....... 5-6 5.1.2 MC Error Information Register 0 (MC_ERR0 - Offset = 800)..5-8 5.1.3 MC Error Information Register 1 (MC_ERR1 - Offset = 840)..5-9 5.1.1 CAP Error Register (CAP_ERR - Offset = 880) ......
Page 6
Updating Firmware from AlphaBIOS............ A-24 Upgrading AlphaBIOS................A-25 Hard Disk Partitioning ................A-26 A.7.1 Hard Disk Error Conditions ..............A-26 A.7.2 System Partitions .................. A-27 A.7.3 How AlphaBIOS Works with System Partitions........A-28 Using the Halt Button ................A-29 Halt Assertion..................A-30 Appendix B SRM Console Commands and Environment Variables...
Preface Intended Audience This manual is written for the customer service engineer. Document Structure This manual uses a structured documentation design. Topics are organized into small sections for efficient online and printed reference. Each topic begins with an abastract, followed by an illustration or example, and ends with descriptive text. This manual has six chapters and three appendixes, as follows: •...
Page 11
Documentation Titles Table 1 lists books in AlphaServer 1200 documentation set. Table 1 AlphaServer 1200 Documentation Title Order Number User and Installation Documentation Kit QZ–011AA–GZ AlphaServer 1200 User’s Guide EK–AS120–UG AlphaServer1200 Basic Installation Guide EK–AS120–IG Service Information AlphaServer 1200 Service Manual EK–AS1200–SV...
Page 13
Chapter 1 System Overview This chapter introduces the DIGITAL AlphaServer 1200 and the DIGITAL AlphaStation 1200 systems. These systems are available in cabinets or pedestals. Pedestal systems contain a maximum of two CPUs, up to 2 Gbytes of memory, and 6 PCI I/O slots or 5 PCI I/O slots and 1 EISA/ISA slot.
1200 System The 1200 system has up to two CPU modules and 2Gbytes of memory. A single fast wide SCSI StorageWorks shelf provides storage. The system is ready for the next generation of SCSI drives. Figure 1-1 1200 System PKW-0500-97...
Page 15
The numbered callouts in Figure 1-1 refer to components of the system. System card cage, which holds the system motherboard and the CPU, and ² memory, and system I/O. PCI/EISA section of the system card cage. ³ Server Control panel assembly, which includes the control panel, the LCD ´...
Control Panel and Drives The control panel includes the On/Off, Halt, and Reset buttons and an LCD display. Figure 1-3 Control Panel Assembly CD ROM Floppy OCP Display PKW-0501-97 OCP display. The OCP display is a 16-character LCD that indicates status during power-up and self-test.
Page 17
³ Halt button. When the halt button is pressed, different results are manifest depending upon the state of the machine. These states/results follow: Machine State Result of pressing the Halt button OpenVMS running/hung Simple halt. The SRM console runs. DIGITAL Unix running/hung Simple halt.
System Consoles There are two console programs: the SRM console and the AlphaBIOS console. SRM Console Prompt On systems running the DIGITAL UNIX or OpenVMS operating system, the following console prompt is displayed after system startup messages are displayed, or whenever the SRM console is invoked: P00>>>...
Page 19
SRM Console The SRM console is a command-line interface that is used to boot the DIGITAL UNIX and OpenVMS operating systems. It also provides support for examining and modifying the system state and configuring and testing the system. The SRM console can be run from a serial terminal or a graphics monitor.
System Architecture Alpha microprocessor chips are used in these systems. The CPU, memory, and the I/O module(s) are connected to the system motherboard. Figure 1-4 Architecture Diagram Memory Pair System Bus 128-Bit Data Bus + 16 ECC and 40-Bit Command/Address Bus PCI Bus 0 PCI Bus 1 System to...
Page 21
AlphaServer 1200 systems use the Alpha chip for the CPU. The CPU, memory, and I/O devices connect to the system motherboard. On the system motherboard is: • the system bus • two system bus to PCI bus chip sets that bridge two PCI busses to the system bus •...
CPU Types There are several CPU variants differentiated by CPU speeds. Figure 1-5 CPU Module Placement Power connectors Floppy connector connectors CPU 0 MEM L CPU 1 MEM H Switch- pack LEDs PCI Bridges PCI 0 Slot 2 PCI 0 Slot 3 Internal SCSI connector PCI 0 Slot 4...
Page 23
Alpha Chip Composition The Alpha chip is made using state-of-the-art chip technology, has a transistor count of 9.3 million, consumes 50 watts of power, and is air cooled (a fan is on the chip). The default cache system is write-back and when the module has an external cache, it is write-back.
Memory Memory consists of two riser cards and up to eight pairs of DIMMs. Each riser card receives one of the two DIMMs in the DIMM pair. There are two DIMM variants: a 32MB version and a 128MB version. Figure 1-6 Memory Placement Power connectors Floppy connector...
Page 25
Memory Variants Memory consists of two riser cards supporting 8 DIMM pairs. There are two DIMM variants: a 32MB version and a 128MB version. Maximum memory using 32MB DIMMs is 512MB and the maximum memory using 128MB DIMMs is 2GB. All memory is synchronous DRAM Option...
Memory Addressing Memory addressing in these systems is fixed regardless of the size of the DIMMs. The address of DIMM pair is fixed according to the slot in which the pair is placed. The starting address of each pair in each slot on the riser card starts on a 512 MB boundary.
Page 27
The rules for addressing memory are as follows: 1. A memory pair consists of 2 DIMMs of the same size. 2. Memory pairs in riser cards may be of different sizes. 3. The memory pair in slot 0 must be the largest of all memory pairs. Other memory pairs may be as large but none may be larger.
System Motherboard The system motherboard contains five major logic sections performing five major system functions. Figure 1-8 System Motherboard Power connectors Floppy connector connectors CPU 0 Power MEM L Control Logic CPU and Memory Backplane CPU 1 MEM H Server Control System Bus Logic...
Page 29
The five sections on the system motherboard are: • The system bus or the CPU and Memory backplane. • The power control logic. • The server control logic. • The system bus to PCI bus bridges. • The PCI backplane containing two PCI busses, an EISA/ISA bus, a built-in CD- ROM controller, and an XBUS with several devices integral to the system on it.
1.8.1 System Bus (Backplane) The system bus consists of a 40-bit command/address bus, a 128-bit plus ECC data bus, and several control signals and clocks. The system bus is part of the system motherboard. Figure 1-9 System Bus Block Diagram MEM0 SIM_ADR DATA...
Page 31
The system bus consists of a 40-bit command/address bus, a 128-bit plus ECC data bus, and several control signals, clocks, and a bus arbiter. The bus requires that all CPUs have the same high-speed oscillator providing the clock to the Alpha chip. The 1200 system bus connects up to two CPUs, up to eight DIMM memory pairs on two riser cards, and two I/O bus bridges.
1.8.2 System Bus to PCI Bus Bridge The bridge is the physical interconnect between the system bus and the PCI bus. Figure 1-10 System bus to PCI bus Bridge Block Diagram System Bus PCI Bus Control AD<31:0> Address Data A Control to B bus ECC &...
Page 33
The system bus to PCI bus bridge module converts system bus commands and data addressed to I/O space to PCI commands and data; and converts PCI bus commands and data addressed to system memory or CPUs to system bus commands and data. The bridge has two major components: •...
1.8.3 PCI I/O Subsystem The I/O subsystem consists of two 64-bit PCI busses. One has an imbedded EISA/ISA bridge and three PCI option slots; the other has a built in CD-ROM driver and three PCI options slots. Figure 1-11 PCI Block Diagram PCI-1 Bus SCSI Control 40MHz...
Table 1-1 PCI Motherboard Slot Numbering Slot PCI0 PCI1 PCI to EISA/ISA Internal CD-ROM bridge controller PCI slot PCI slot PCI slot PCI slot PCI slot PCI slot The logic for two PCI buses is on each PCI motherboard. • PCI0 is a 64-bit bus with a built-in PCI to EISA/ISA bus bridge.
1.8.4 Remote Control Logic A section of the motherboard provides remote control operation of the system. A four-switch switchpack controls use of remote control. Figure 1-12 Remote Control Logic PKW0504C-97 1-24...
Page 37
The 1200 system allows both local and remote control. A set of switches enables or disables remote control. The switches and their functions are: Switch Condition Function 1 EN RCM Allows remote system control Does not allow remote system control 2 Modem Off Disables the RCM modem port Enable the RCM modem port...
1.8.5 Power Control Logic The power control section of the motherboard controls power sequencing and monitors power supply voltage, system temperature, and fans. Figure 1-13 Power Control Logic System Motherboard Power control logic PKW0504D-97 1-26...
Page 39
The power control logic performs these functions: • Controls power sequencing. • Monitors the combined output of power supplies and shuts down power if it is not in range. • Monitors system temperature and powers down the system 30 seconds after it detects that internal temperature of the system is above the value of the environment variable over_temp.
Power Circuit and Cover Interlock Power is distributed throughout the system and mechanically can be broken by the On/Off switch, the cover interlock, or remotely through the RCM. Figure 1-14 Power Circuit Diagram Power Supply Cover Interlock Push button ON/OFF Switch pack DC_ENABLE_L...
Page 41
Figure 1-15 shows the distribution of power throughout the system. Opens in the circuit or the RCM signal RCM_DC_EN_L, or a power supply detected power fault interrupt DC power applied to the system. The opens can be caused by the On/Off button or the cover interlock.
1.10 Power Supply Two power supplies provide system power. The power system is described in detail in Chapter 4. Figure 1-15 Back of Power Supply and Location Power Supply 1 Current share Power Supply 0 +5V/Return +12V/Return +5V/Return +3.4V/Return Misc. Signal PKW0513-97 1-30...
Page 43
Description Two power supplies provide system power. Redundant power is not available at this time and each has 450 W output. Power Supply Features • 88–132 and 176–264 Vrms AC input • 450 watts output. Output voltages are as follows: Output Voltage Min.
1.11 Power Up/Down Sequence System power can be controlled manually by the On/Off button on the OCP or remotely through the RCM. The power-up/down sequence flow is shown below. Figure 1-16 Power Up/Down Sequence Flowchart Apply AC Power Vaux on On-Off Button On-Off...
Page 45
When AC is applied to the system, Vaux (auxiliary voltage) is asserted and is sensed by the PCM section of the motherboard if the On-Off Button is On. The PCM asserts DC_ENABLE_L starting the power supplies. If there is a hard fault on power-up, the power supplies shut down immediately;...
1.12 Maintenance Bus (I C Bus) The I C bus (referred to as the “I squared C bus”) is a small internal maintenance bus used to monitor system conditions scanned by the power control module, write the fault display, store error state, and track configuration information in the system.
Page 47
Monitor The I C bus monitors the state of system conditions scanned by the PC logic. There are two registers the PC logic writes data to: • One records the state of the fans and power supplies and is latched when there is a fault.
1.13 StorageWorks The system supports up to seven 31/2 StorageWorks drives. The 9.3 GByte drive is not supported internally. Figure 1-18 StorageWorks Drive Location StorageWorks PKW0514-97 Drives Shelf 1-36...
Page 49
The StorageWorks drives are to the right of the system cage. Up to seven drives fit into the shelf. The system is fitted as Fast Wide Ultra SCSI. Fast Wide SCSI has a maximum transfer rate of 20 Mbytes, the Ultra SCSI version doubles that rate to 40 Mbytes.
Chapter 2 Power-Up This chapter describes system power-up testing and explains the power-up displays. The following topics are covered: • Control Panel • Power-Up Sequence • SROM Power-Up Test Flow • SROM Errors Reported • XSROM Power-Up Test Flow • XSROM Errors Reported •...
Control Panel The control panel display indicates the likely device when testing fails. Figure 2-1 Control Panel and LCD Display &RQWURO 3DQQHO P0 TEST 11 CPU0 PKW0510-97 • When the On/Off button LED is on, power is applied and the system is running. When it is off, the system is not running, but power may or may not be present.
Table 2-1 Control Panel Display Field Content Display Meaning ² CPU number P0–P1 CPU reporting status ³ Status TEST Tests are executing FAIL Failure has been detected MCHK Machine check has occurred INTR Error interrupt has occurred ´ Test number µ...
Power-Up Sequence Console and most power-up tests reside on the I/O subsystem, not on the CPU nor on any other module on the system bus. Figure 2-2 Power-Up Flow XSROM tests execute Power-Up/Reset SROM code loaded SRM console loaded into each CPU’s into memory I-cache SRM console tests...
XSROM code resides in sector 0 of FEPROM 0 on the XBUS. Sector 2 of FEPROM 0 contains a duplicate copy of the code and is used if sector 0 is corrupt. Code for sizing DIMM memory resides in sector 1 of FEPROM 0 along with the PAL code. FEPROM.
For the console to run, the path from the CPU to the XSROM must be functional. The XSROM resides in FEPROM0 on the XBUS, off the EISA bus, off PCI 0, off IOD 0. See Figure 2-4. This path is minimally tested by SROM. . Figure 2-4 Console Code Critical Path (1200 Block Diagram) Memory Pair...
Page 57
The SROM contents are loaded into each CPU’s I-cache and executed on power- up/reset. After testing the caches on each processor chip, it tests the path to the XSROM. Once this path is tested and deemed reliable, layers of the XSROM are loaded sequentially into the processor chip on each CPU.
SROM Power-Up Test Flow The SROM tests the CPU chip and the path to the XSROM. Figure 2-5 SROM Power-Up Test Flow For each CPU Initialize CPU chip Initialize Turn off CPU LED PCI-EISA bridge chip D-cache HANG errors Read TOY NVRAM All 3 S-cache HANG...
Page 59
The Alpha chip built-in self-test tests the I-cache at power-up and upon reset. Each CPU chip loads its SROM code into its I-cache and starts executing it. If the chip is partially functional, the SROM code continues to execute. However, if the chip cannot perform most of its functions, that CPU hangs and that CPU pass/fail LED remains off.
Table 2-2 lists the tests performed by the SROM. Table 2-2 SROM Tests Test Name Logic Tested D-cache RAM March D-cache access, D-cache data, D-cache address logic test D-cache Tag RAM D-cache tag store RAM, D-cache bank address logic March test S-cache Data March S-cache RAM cells, S-cache data path, S-cache address test...
SROM Errors Reported The SROM reports machine checks, pending interrupt/exception errors, and errors related to corruption of FEPROM 0. If SROM errors are fatal, the particular CPU will hang and only the CPU self-test pass LEDs and/or the LEDs on the system motherboard will indicate the failure. The CPU self-test pass LED is not visible but the IOD0 and IOD1 pass LEDs are.
XSROM Power-Up Test Flow Once the SROM has completed its tests and verified the path to the FEPROM containing the XSROM code, it loads the first 8 Kbytes of XSROM into the primary CPU’s S-cache and jumps to it. Figure 2-6 XSROM Power-Up Flowchart XSROM banner to OCP/console device Run memory texts.
After jumping to the primary CPU’s S-cache, the code then intentionally I-caches itself and is completely register based (no D-stream for stack or data storage is used). The only D-stream accesses are writes/reads during testing. Each FEPROM has sixteen 64-Kbyte sectors. The first sector contains B-cache tests, memory tests, and a fail-safe loader.
Page 64
Table 2-4 Memory Tests Test Test Name Logic Tested Description Memory Data path to and from Test floats 1 and 0 across Data test memory data and check bit data lines. Data path on memory and Errors are reported for each RAMs DIMM memory card from MEM0_L to MEM7_H.
XSROM Errors Reported The XSROM reports B-cache test errors and memory test errors. It also reports a warning if memory is illegally configured. Example 2–1 XSROM Errors Reported at Power-Up B-cache Error (CPU Error) TEST ERR on cpu0 #CPU running the test cpu0 err# tst#...
Console Power-Up Tests Once the SRM console is loaded, it tests of each IOD further. Table 2-5 describes the IOD power-up tests, and Table 2-6 describes the PCI power-up tests. Table 2-5 IOD Tests Test # Test Name Description IOD CSR Access test Read and write all CSRs in each IOD.
Table 2-6 PCI Motherboard Tests Test Diagnostic Number Test Name Name Description PCEB pceb_diag Tests the PCI to EISA bridge chip esc_diag Tests the EISA system controller 8K NVRAM nvram_diag Tests the NVRAM Real-Time Clock ds1287_diag Tests the real-time clock chip Keyboard and i8242_diag Tests the keyboard/mouse chip...
Console Device Determination After the SROM and XSROM have completed their tasks, the SRM console program, as it starts, determines where to send its power-up messages. Figure 2-7 Console Device Determination Flowchart Power-Up/Reset P00>>> Init Console Envar Console Envar = graphics = serial Enable COM port 1 and send messages...
Page 69
Console Device Options The console device can be either a serial terminal or a graphics monitor. Specifically: • A serial terminal connected to COM1 off the server control module. The terminal connected to COM1 must be set to 9600 baud. This baud rate cannot be changed.
Console Power-Up Display The entire power-up display prints to a serial terminal (if the console environment variable is set to serial), and parts of it print to the control panel display. The last several lines print to either a serial terminal or a graphics monitor.
Page 71
² At power-up or reset, the SROM code on each CPU module is loaded into that module’s I-cache and tests the module. If all tests pass, the processor’s LED lights. If any test fails, the LED remains off and power-up testing terminates on that CPU.
Page 72
Example 2–1 Power-Up Display (Continued) · starting console on CPU 0 ¸ sizing memory 256 MB DIMM 256 MB DIMM 64 MB DIMM 64 MB DIMM starting console on CPU 1 ¹ probing IOD1 hose 1 bus 0 slot 1 - NCR 53C810 bus 0 slot 2 - DECchip 21041-AA bus 0 slot 3 - NCR 53C810 probing IOD0 hose 0...
Page 73
· The final primary CPU determination is made. The primary CPU unloads PALcode and decompression code from the FEPROM on the PCI 0 to its B- cache. The primary CPU then jumps to the PALcode to start the SRM console. The primary CPU prints a message indicating that it is running the console.
2.10 Fail-Safe Loader The fail-safe loader is a software routine that loads the SRM console image from floppy. Once the console is running you will want to run LFU to update FEPROM 0 with a new image. NOTE: FEPROM 0 contains images of the SROM, XSROM, PAL, decompression, and SRM console code.
Chapter 3 Troubleshooting This chapter describes troubleshooting during power-up and booting, as well as diagnostics for AlphaServer/AlphaStation 1200 systems. The following topics are covered: • Troubleshooting with LEDs • Troubleshooting Power Problems • Running Diagnostics—Test Command • Other useful Console Commands Troubleshooting...
Troubleshooting with LEDs During power-up, reset, initialization, or testing, diagnostics are run on CPUs, memories, I/O bridges, and the PCI backplane and its imbedded options. The following sections describes possible problems that can be identified by checking LEDs. Unfortunately LEDs on the CPU module cannot be seen, the only LEDs available are on the system motherboard.
Page 77
System Motherboard LEDs You may see the system motherboard LEDs by looking through the grate at the back of the machine. The normal state of the LEDs are shown in Figure 3.1. • If either IOD0 or IOD1 LEDs are off, the system bus to PCI bus bridge has failed.
Troubleshooting Power Problems Power problems can occur before the system is up or while the system is running. If a system stops running, make a habit of checking the PCM. Power Problem List The system will halt for the following: 1.
Page 79
If Power Problem Occurs at Power-Up If the system has a power problem on a cold start, the motherboard LEDs and the OCP display will indicate a problem. The console, for systems running DIGITAL UNIX or OpenVMS, will also indicate the problem. The console on systems running NT will not print an error message.
Running Diagnostics — Test Command The test command runs diagnostics on the entire system, CPU devices, memory devices, and the PCI I/O subsystem. The test command runs only from the SRM console. Ctrl/C stops the test. The console must NOT be secure. Example 3–1 Test Command Syntax P00>>>...
Releasing Secure Mode The console must not be secure for most SRM console commands to run. If the console is not secure, user mode console commands can be entered. See the system manager if the system is secure and you do not know the password. Example 3–1 Releasing/Reestablishing Secure Mode P00>>>login Please enter password: xxxx...
Testing an Entire System A test command with no modifiers runs all exercisers for subsystems and devices on the system. I/O devices tested are supported boot devices. The test runs for 10 minutes. Example 3–1 Sample Test Command P00>>> test Console is in diagnostic mode System test, runtime 600 seconds Type ^C to stop testing...
3.5.1 Testing Memory The test mem command tests individual memory devices or all memory. The test shown in Example 3–1 runs for 2 minutes. Example 3–1 Sample Test Memory Command P00>>> test memory Console is in diagnostic mode System test, runtime 120 seconds Type ^C to stop testing Starting background memory test, affinity to all CPUs..
3.5.2 Testing PCI The test pci command tests PCI buses and devices. The test runs for 2 minutes. Example 3–1 Sample Test Command for PCI P00>>> test pci* Console is in diagnostic mode System test, runtime 120 seconds Type ^C to stop testing Configuring all PCI buses..
Other Useful Console Commands There are several console commands that help diagnose the system. The show power command can be used to identify power, temperature, and fan faults. Example 3–1 Show Power P00>>>show power Status Power Supply 0 good Power Supply 1 good System Fans good...
The show fru command lists all FRUs in the system. Example 3–3 Show FRU The P00>>>show fru Digital Equipment Corporation AlphaServer 1200 Console V5.0-2 OpenVMS PALcode V1.19-12, Digital UNIX PALcode V1.21-20 Module Part # Type Name Serial # System Motherboard...
Chapter 4 Error Logs This chapter provides information on troubleshooting with error logs. The following topics are covered: • Using Error Logs • Using DECevent • Error Log Examples and Analysis • Troubleshooting IOD-Detected Errors • Double Error Halts and Machine Checks While in PAL Mode Error registers are described in Chapter 5.
Using Error Logs Error detection is performed by CPUs, the IOD, and the EISA to PCI bus bridge. (The IOD is the acronym used by software to refer to the system bus to PCI bus bridge.) Figure 4-1 Error Detector Placement Memory CPU Module System Bus...
Page 93
Lines Protected Device ECC Protected System bus data lines IOD on every transaction, CPU when using the bus B-cache IOD on every transaction, CPU when using the bus Parity Protected System bus command/address lines IOD on every transaction, CPU when using the bus Duplicate tag store IOD on every transaction, CPU when using the bus...
4.1.1 Hard Errors There are two categories of hard errors: • System-independent errors detected by the CPU. These errors are processor machine checks handled as MCHK 670 interrupts and are: Internal EV5 or EV56 cache errors CPU B-cache module errors •...
4.1.3 Error Log Events Several different events are logged by OpenVMS and DIGITAL UNIX. Windows NT does not log errors in this fashion. Table 4-1 Types of Error Log Events Error Log Event Description MCHK 670 Processor machine checks.These are synchronous errors that inform precisely what happened at the time the error occurred.
Using DECevent DECevent produces bit-to-text ASCII reports derived from system event entries or user-supplied event logs. The format of the reports is determined by commands, qualifiers, parameters, and keywords appended to the comand. The maximum command line length is 255 characters. DECevent allows you to do the following: •...
4.2.1 Translating Event Files To produce a translated event report using the default event log file, SYS$ERRORLOG:ERRLOG.SYS, enter the following command: OpenVMS $ DIAGNOSE DIGITAL UNIX > dia -a The DIAGNOSE command allows DECevent to use built-in defaults. This command produces a full report, directed to the terminal screen, from the input event file, SYS$ERRORLOG:ERRLOG.SYS.
To reverse the order of the input events OpenVMS $ DIAGNOSE/TRANSLATE/REVERSE DIGITAL UNIX > dia -R These commands reverse the order in which events are displayed. The default order is forward chronologically. 4.2.2 Filtering Events /INCLUDE and /EXCLUDE qualifiers allow you to filter input event log files. The /INCLUDE qualifier is used to create output for devices named in the command.
Page 99
Use the /BEFORE and /SINCE qualifiers to select events before or after a certain date and time. OpenVMS $ DIAGNOSE/TRANSLATE/BEFORE=15-JAN-1996:10:30:00 $ DIAGNOSE/TRANSLATE/SINCE=15-JAN-1996:10:30:00 DIGITAL UNIX > dia -t s:15-jan-1996 e:20-jan-1996 If no time is specified, the default time is 00:00:00, and all events for that day are selected.
4.2.3 Selecting Alternative Reports Table 4-2 describes the DECevent report formats. Report formats are mutually exclusive. No combinations are allowed. The default format is /Full. Table 4-2 DECevent Report Formats Format Description /Full Translates all available information for each event /Brief Translates key information for each event /Terse...
Error Log Examples and Analysis The following sections provide examples and analysis of error logs. 4.3.1 MCHK 670 CPU-Detected Failure The error log in Example 4–1 shows the following: ² CPU1 logged the error in a system with two CPUs. ³...
Page 102
Example 4–1 MCHK 670 Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number Timestamp of occurrence 04-APR-1997 17:20:04 Host name whip16 ² System type register x00000016 AlphaStation 4000/1200 Series Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000001 Event validity 1.
TEST_STATUS_H Pin Asserted Icache Par Err Stat Reg x00000000 Dcache Par Err Stat Reg x00000000 Virtual Address Reg xFFFFFFFE8F63BD38 Memory Mgmt Flt Sts Reg x000000000166D1 Ref which caused err was a write Ref resulted in DTB miss RA Field x0000000000001B Opcode Field x0000000000002C Scache Address Reg...
4.3.2 MCHK 670 CPU and IOD-Detected Failure The error log in Example 4–1 shows the following: ² CPU1 logged the error in a system with two CPUs. ³ The External Interface Status Register logged an uncorrectable ECC error during a D-ref fill. (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-cache.
Page 106
Example 4–1 MCHK 670 CPU and IOD-Detected Failure Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number Timestamp of occurrence 08-APR-1997 11:27:55 Host name whip16 ² System type register x00000016 AlphaStation 40001200 Series Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000001 Event validity...
Page 107
Dcache Par Err Stat Reg x00000000 Virtual Address Reg x00000001407D6000 Memory Mgmt Flt Sts Reg x00000000011A10 Ref resulted in DTB miss RA Field x0000000008 Opcode Field x00000000000023 Scache Address Reg xFFFFFF00000254BF Scache Status Reg x00000000 Bcache Tag Address Reg xFFFFFF80286F7FFF External cache hit Parity for ds and v bits Cache block dirty...
Page 108
¶ Device Id x0000003F MC error info valid ´ CAP Error Register xC0000000 Uncorrectable ECC err det by MDPB MC error info latched PCI Bus Trans Error Adr x000003FD MDPA Status Register x00000000 MDPA Chip Revision x00000000 MDPA Error Syndrome Reg x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000...
4.3.3 MCHK 670 Read Dirty CPU-Detected Failure The error log in Example 4–1 shows the following: CPU0 logged the error in a system with two CPUs. The External Interface Status Register records an uncorrectable ECC error from the system (bit <30> set). ...
Page 111
Example 4–1 MCHK 670 Read Dirty Failure Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number Timestamp of occurrence 08-APR-1997 10:20:37 Host name sect06 System type register x00000016 AlphaStation 4x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1.
Page 112
PAL Shadow Registers Enabled. Correctable Error Interrupts Enabled. ICACHE BIST (Self Test) Was Successful. TEST_STATUS_H Pin Asserted Icache Par Err Stat Reg x0000000000000000 Dcache Par Err Stat Reg x0000000000000000 Virtual Address Reg x0000000000044000 Memory Mgmt Flt Sts Reg x0000000000005D10 If Err, Reference Resulted in DTB Miss Fault Inst RA Field: x0000000000000014...
Page 113
Check ALL Transactions for Errors Use MC_BMSK for 16 Byte Align Blk Mem Wrt Wrt PEND_NUM Threshold: RD_TYPE Memory Prefetch Algorithm: Short RL_TYPE Mem Rd Line Prefetch Type: Medium RM_TYPE Mem Rd Multiple Cmd Type: Long ARB_MODE Arbitration: MC-PCI Priority Mode Mem Host Address Ext Reg x00000000 HAE Sparse Mem Adr<31:27>...
Page 114
Mem Host Address Ext Reg x00000000 HAE Sparse Mem Adr<31:27> x00000000 IO Host Adr Ext Register x00000000 PCI Upper Adr Bits<31:25> x00000000 Interrupt Ctrl Register x00000003 Write Device Interrupt Info Struct:Enabled Interrupt Request x00800001 Interrupts asserted x00000001 Hard Error Interrupt Mask0 Register x00C50001 Interrupt Mask1 Register x00000000...
4.3.4 MCHK 660 IOD-Detected Failure (System Bus Error) The error log in Example 4–1 shows the following: CPU0 logged the error in a system with two CPUs. The External Interface Status Register does not record an error. Both IOD CAP Error Registers logged an error.
Page 116
Example 4–1 MCHK 660 IOD-Detected Failure (System Bus Error) Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number Timestamp of occurrence 04-APR-1996 17:20:04 Host name whip16 System type register x00000016 AlphaStation 4000 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity...
Page 117
RA Field x0000000006 Opcode Field x00000000000029 Scache Address Reg xFFFFFF0000024EAF Scache Status Reg x00000000 Bcache Tag Address Reg xFFFFFF80FFED6FFF Parity for ds and v bits Cache block dirty Cache block valid Tag address<38:20> is x00000000000FFE Ext Interface Address Reg xFFFFFF00FC00000F Fill Syndrome Reg x0000000000C5D2 ...
4.3.5 MCHK 660 IOD-Detected Failure (PCI Error) The error log in Example 4–1 shows the following: CPU 0 logged the error in a system with two CPUs. The MCHK 660 register gives the reason for the Machine Check as an IOD detected hard error or a Dtag Parity Error (if cached CPU) ...
Page 121
Example 4–1 MCHK 660 IOD-Detected Failure (PCI Error) Timestamp of occurrence 19-AUG-1997 12:53:41 Host name sect04 System type register x00000016 Alpha 4000/1200 Series Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1. O/S claims event is valid Event severity 1.
Page 122
Timeout Counter Bit Clear. IBOX Timeout Counter Enabled. Floating Point Instructions will Cause FEN Exceptions. PAL Shadow Registers Enabled. Correctable Error Interrupts Enabled. ICACHE BIST (Self Test) Was Successful. TEST_STATUS_H Pin Asserted Icache Par Err Stat Reg x0000000000000000 Dcache Par Err Stat Reg x0000000000000000 Virtual Address Reg x0000000140008000...
Page 123
Bridge ACCEPTS 64 Bit Data Transactions PCI Address Parity Check: Enabled MC Bus CMD/Addr Parity Check: Enabled MC Bus NXM Check: Enabled Check ALL Transactions for Errors Use MC_BMSK for 16 Byte Align Blk Mem Wrt Wrt PEND_NUM Threshold: RD_TYPE Memory Prefetch Algorithm: Short RL_TYPE Mem Rd Line Prefetch Type: Medium RM_TYPE Mem Rd Multiple Cmd Type: Long...
Page 124
IO Host Adr Ext Register x00000000 PCI Upper Adr Bits<31:25> x00000000 Interrupt Ctrl Register x00000003 Write Device Interrupt Info Struct:Enabled Interrupt Request x00800000 Interrupts asserted x00000000 Hard Error Interrupt Mask0 Register x00C50111 Interrupt Mask1 Register x00000000 MC Error Info Register 0 xE0000000 MC Bus Trans Addr<31:4>: E0000000 MC Error Info Register 1...
Page 125
Interrupt P2 Min Gnt Max Lat CONFIG Address x000000FBC0001000 Slot or Device Number: 2 Device and Vendor ID x10201077 QLogic ISP_1020 Vendor ID: x102B (QLogic) Device ID: x00001020 Command Register x0107 I/O Space Accesses Response: Enabled Memory Space Accesses Response: Enabled PCI Bus Master Capability: Enabled...
Page 126
DETECTED PARITY ERROR:This Device Detected Revision ID Device Class Code x010400 Mass Storage: RAID Controller Cache Line S Latency T. Header Type Single Function Device Bist Base Address Register 1 x00101000 Base Address Register 2 x0412A000 Base Address Register 3 x00000000 Base Address Register 4 x00000000...
4.3.6 MCHK 630 Correctable CPU Error The error log in Example 4–1 shows the following: CPU0 logged the error in a system with two CPUs. During a D-ref fill, the External Interface Status Register shows no error but states that the “data source is b-cache.” (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-cache.
Page 129
Example 4–1 MCHK 630 Correctable CPU Error Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha 4000/1200 Series Event sequence number 415. Timestamp of occurrence 15-JUN-1997 14:56:30 Host name whip16 System type register x00000016 AlphaStation 4x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity...
4.3.7 MCHK 620 Correctable Error The MCHK 620 error is a correctable error detected by the IOD. The error log in Example 4–1 shows the following: CPU0 logged the error in a system with two CPUs. The External Interface Status Register is not valid. ...
Page 131
Example 4–1 MCHK 620 Correctable Error Logging OS 2. DIGITAL UNIX System Architecture 2. Alpha Event sequence number Timestamp of occurrence 28-JUN-1996 19:45:42 Host name sect06 System type register x00000016 AlphaStation 4x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1.
Page 132
MC error info latched MDPA Status Register x00000000 MDPA Status Register Data Not Valid MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid MDPB Status Register x00000000 MDPB Status Register Data Not Valid MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid PALcode Revision Palcode Rev: 0.0-1...
Troubleshooting IOD-Detected Errors Step 1 Read the CAP Error Registers on both PCI bridges (F9E0000880 and FBE0000880). If one or both of these registers shows an error, match the register contents with the data pattern and perform the action indicated. Table 4-3 CAP Error Register Data Pattern Data Pattern Most Likely Cause...
4.4.1 System Bus ECC Error Step 2 Read the MC_ERR1 register and match the contents with the data pattern. Perform the action indicated. Table 4-4 System Bus ECC Error Data Pattern MC_ERR1 Data Pattern Most Likely Cause Action for Memory Read 1000 0000 0000 xxxx xxxx 10xx 0xxx xxxx Bad nondirty data from Go to Step 10...
4.4.2 System Bus Nonexistent Address Error Step 3 Determine which node (if any) should have responded to the command/address identified in MC_ERR1. Perform the action indicated. Table 4-5 System Bus Nonexistent Address Error Troubleshooting MC_ERR1 Data Pattern Most Likely Cause Action 1000 0000 000x xxxx xxxx xxxx 0xxx xxxx Software generated an MC...
4.4.3 System Bus Address Parity Error Step 4 Determine which node put the bad command/adress on the system bus identified in MC_ERR1. Perform the action indicated. Table 4-6 Address Parity Error Troubleshooting MC_ERR1 Data Pattern Most Likely Cause Action 1000 0000 000x xxx0 10xx xxxx xxxx xxxx Data sourced by MID = 2 Replace CPU0 1000 0000 000x xxx0 11xx xxxx xxxx xxxx...
4.4.4 PIO Buffer Overflow Error (PIO_OVFL) Step 5 Enter the value of the CAP_CTRL register bits<19:16> (Actual_PEND_NUM) in the following formula. Compare the results as indicated in Table 4-7 to determine the most likely cause of the error. When an IOD is implicated in the analysis of the error, replace the one that capturered the error in its CAP Error Register.
4.4.5 Page Table Entry Invalid Error Step 6 This error is almost always a software problem. However, if the software is known to be good and the hardware is suspected, swap the IOD. 4.4.6 PCI Master Abort Step 7 Master aborts normally occur when the operating system is sizing the PCI bus. However, if the master abort occurs after the system is booted, read PCI_ERR1 and determine which PCI device should have responded to this PCI address.
4.4.9 Broken Memory Step 10 Refer to the following sections. For a Read Data Substitute Error (uncorrectable ECC error) When a read data substitute (RDS) error occurs, determine which memory module pair caused the error as follows: 1. Run the memory diagnostic to see if it catches the bad memory. If so, replace the memory module that it reports as bad.
3. When you have isolated the failing memory pair, determine which of the two modules is bad. (You cannot do this if the operating system is Windows NT.) Read the CPU FIL SYNDROME Register. If this register is non-zero, use the ECC syndrome bits in Table 4-8 to determine which module had the single-bit error.
4.4.10 Command Codes Table 4-9 shows the codes for transactions on the system bus and how they are affected by the commander in charge of the bus during the transaction. The command is a six-bit field in the command address (bits<5:0>). Bit-to-text translations give six-bit data (although the top two bits may or may not be relevant).
Double Error Halts and Machine Checks While in PAL Mode Two error cases require special attention. Neither double error halts or machine checks while the machine is in PAL mode result in error log entries. Nevertheless, information is available that can help determine what error occurred.
4.5.2 Double Error Halt A double error halt occurs under the following conditions: • A machine check occurs. • PAL completes its tasks and returns control of the system to the operating system. • A second machine check occurs before the operating system completes its tasks. The machine returns to the console and displays the following message: halt code = 6 double error halt...
Chapter 5 Error Registers This chapter describes the registers used to hold error information. These registers include: • External Interface Status Register • External Interface Address Register • MC Error Information Register 0 • MC Error Information Register 1 • CAP Error Register •...
Page 152
External Interface Status Register - EL_STAT The EI_STAT register is a read-only register that is unlocked and cleared by any PALcode read. A read of this register also unlocks the EI_ADDR, BC_TAG_ADDR, and FILL_SYN registers subject to some restrictions. The EI_STAT register is not unlocked or cleared by reset.
Fill data from B-cache or main memory could have correctable or uncorrectable errors in ECC mode. In parity mode, fill data parity errors are treated as uncorrectable hard errors. System address/command parity errors are always treated as uncorrectable hard errors, irrespective of the mode. The sequence for reading, unlocking, and clearing EI_STAT, EI_ADDR, BC_TAG_ADDR, and FILL_SYN is as follows: 1.
Page 154
Table 5-1 External Interface Status Register Name Bits Type Description COR_ECC_ERR <31> Correctable ECC Error. Indicates that fill data received from outside the CPU contained a correctable ECC error. EI_ES <30> External Interface Error Source. When set, indicates that the error source is fill data from main memory or a system address/command parity error.
Page 155
Table 5-1 External Interface Status Register (continued) Name Bits Type Description <63:36> All ones. SEO_HRD_ERR <35> Second External Interface Hard Error. Indicates that a fill from B-cache or main memory, or a system address/command received by the CPU has a hard error while one of the hard error bits in the EI_STST register is already set.
5.1.1 External Interface Address Register - EI_ADDR The EI_ADDR register contains the physical address associated with errors reported by the EI_STAT register. It is unlocked by a read of the EI_STAT Register. This register is meaningful only when one of the error bits is set. Address FF FFF0 0148 Access...
Table 5-2 Loading and Locking Rules for External Interface Registers Correct Uncorrect- Second -able able Error Hard Load Lock Action When Error Error Register Register EI_STAT Is Read Clears and unlocks possible all registers Clears and unlocks possible all registers Clears and unlocks all registers Clear bit (c) does...
5.1.2 MC Error Information Register 0 (MC_ERR0 - Offset = 800) The low-order MC bus (system bus) address bits are latched into this register when the system bus to PCI bus bridge detects an error event. If the event is a hard error, the register bits are locked.
Page 159
5.1.3 MC Error Information Register 1 (MC_ERR1 - Offset = 840) The high-order MC bus (system bus) address bits and error symptoms are latched into this register when the system bus to PCI bus bridge detects an error. If the event is a hard error, the register bits are locked. A write to clear symptom bits in the CAP Error Register unlocks this register.
Page 160
Table 5-4 MC Error Information Register 1 Initial Name Bits Type State Description VALID <31> Logical OR of bits <30:23> in the CAP_ERR Register. Set if MC_ERR0 and MC_ERR1 contain a valid address. Reserved <30:21> Dirty <20> Set if the system bus error was associated with a Read/Dirty transaction.
1.1.1 CAP Error Register (CAP_ERR - Offset = 880) CAP_ERR is used to log information pertaining to an error detected by the CAP or MDP ASIC. If the error is a hard error, the register is locked. All bits, except the LOST_MC_ERR bit, are locked on hard errors.
Page 162
Table 5-5 CAP Error Register Initial Name Bits Type State Description MC_ERR VALID <31> Logical OR of bits <30:23> in this register. When set MC_ERR0 and MC_ERR1 are latched. RDSB <30> RW1C Uncorrectable ECC error detected by MDPB. Clear state in MDPB before clearing this bit.
Page 163
Table 5-5 CAP Error Register (continued) Initial Name Bits Type State Description LOST_MC_ERR <24> RW1C Set when an error is detected but not logged because the associated symptom fields and registers are locked with the state of an earlier error. PIO_OVFL <23>...
5.1.4 PCI Error Status Register 1 (PCI_ERR1 - Offset = 1040) PCI_ERR1 is used by the system bus to PCI bus bridge to log bus address <31:0> pertaining to an error condition logged in CAP_ERR. This register always captures PCI address <31:0>, even for a PCI DAC cycle. When the PCI_ERR_VALID bit in CAP_ERR is clear, the contents are undefined.
Chapter 6 Removal and Replacement This chapter describes removal and replacement procedures for field-replaceable units (FRUs). System Safety Observe the safety guidelines in this section to prevent personal injury. CAUTION: Wear an antistatic wrist strap whenever you work on a system. The AlphaServer cabinet system has a wrist strap connected to the frame at the front and rear.
FRU List Figure 6-1 shows the locations of FRUs in the system drawer, and Table 6-1 lists the part numbers of all field-replaceable units. Figure 6-1 System FRU Locations SCSI CD-ROM Disks OCP and Display CPUs and Memory Floppy Power Supplies PCI/EISA Options PKW0521-97...
Table 6-1 Field-Replaceable Unit Part Numbers CPU Modules B3007-AA 400 MHz CPU 4 Mbyte cache B3007-CA 533 MHz CPU, 4 Mbyte cache Memory Modules 54-25084-DA 32 Mbyte DIMM (Synch) 20-47405-D3 54-25092-DA 128 Mbyte DIMM (Synch) 20-45619-D3 54-25149-01 Memory Riser Card System Backplane, Display, and support hardware 54-25147-01 System motherboard...
Page 168
Table 6-1 Field-Replaceable Unit Part Numbers (continued) Power Cords BN26J-1K North America, Japan 12V, 75-inches long BN19H-2E Australia, New Zealand, 2.5m long BN19C-2E Central Europe, 2.5m long BN19A-2E UK, Ireland, 2.5m long BN19E-2E Switzerland 2.5m long BN19K-2E Denmark, 2.5m long BN19Z-2E Italy, 2.5m long BN19S-2E...
Page 169
Table 6-1 Field-Replaceable Unit Part Numbers (continued) System Cables and Jumpers From 17-01495-01 Current share Current share Current share conn on cable conn on PS0 17-03970-02 Floppy signal Floppy conn Floppy cable (34 pin) on mbrd 17-03971-01 OCP signal OCP conn on OCP signal mbrd Twisted pair...
System Exposure The system has three sheet metal covers, one on top and one of each side. The covers are removed to expose the system card cage and the power/SCSI sections. Figure 6-2 Exposing the System 7RS &RYHU 5HOHDVH /DWFK ,3 6-6 Service Manual PRELIMINARY...
Page 171
Exposing the System Caution: Be sure the system On/Off button is in the “off” position before removing system covers. 1. Shutdown the system operating system. 2. Press the On/Off button to turn the system off. 3. Unlock and open the door that exposes the storage shelf. 4.
CPU Removal and Replacement CAUTION: Several different CPU modules work in these systems. Unless you are upgrading the system be sure you are replacing the CPU you are removing with the same variant of CPU. Figure 6-3 Removing CPU Module ,3 WARNING: CPU modules and memory modules have parts that operate at high temperatures.
Page 173
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3 3. Remove the memory riser card next to the CPU you are removing. See Section 6.6. 4.
CPU Fan Removal and Replacement Figure 6-4 Removing CPU Fan PKW-0516-97 6-10 Service Manual PRELIMINARY...
Page 175
Removal 1. Follow the CPU Removal and Replacement procedure. 2. Unplug the fan from the module. 3. Remove the four Phillips head screws holding the fan to the Alpha chip’s heatsink. Replacement Reverse the above procedure. Verification If the system powers up, the CPU fan is working. Removal and Replacement 6-11...
Memory Riser Card Removal and Replacement CAUTION: Several different memory modules work in these systems. Be sure you are replacing the broken module with the same variant. Figure 6-5 Removing Memory Riser Card ,3% WARNING: CPU modules and memory riser cards have parts that operate at high temperatures.
Page 177
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. There are two riser cards, one High and one Low. After you have determined which should be removed unscrew the two retaining screws that secures the riser card to the card cage 4.
DIMM Removal and Replacement Figure 6-6 Removing A DIMM from a Memory Riser Card DIMM Riser Card PKW0505B-97 6-14 Service Manual PRELIMINARY...
Page 179
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3 3. Remove the Memory Riser Card that has the broken memory DIMM. See Section 6.6 4. There are prying/retaining levers on the connectors in each slot on the riser card. Press both levers in an arc away from the DIMM and gently pull the DIMM from the connector.
System Motherboard (54-25147-01) Removal and Replacement Figure 6-7 Removing System Motherboard Module Brace System motherboard PKW0518-97 6-16 Service Manual PRELIMINARY...
Page 181
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3 3. Remove both memory riser cards. 4. Remove all CPUs. 5. Remove all PCI and EISA options. 6.
PCI/EISA Option Removal and Replacement Figure 6-8 Removing PCI/EISA Option Slot Cover Screw Option Card IP00225 WARNING: To prevent fire, use only modules with current limited outputs. See National Electrical Code NFPA 70 or Safety of Information Technology Equipment, Including Electrical Business Equipment EN 60 950. 6-18 Service Manual PRELIMINARY...
Page 183
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3 3. To remove the faulty option: Disconnect cables connected to the option. Remove cables to other options that are in the way of removing the option you are removing.
6.10 Power Supply Removal and Replacement Figure 6-9 Removing Power Supply 4 rear screws 6/32 inch Power Supply 1 Power Supply 0 2 internal screws 3.5 mm PKW0517-97 6-20 Service Manual PRELIMINARY...
Page 185
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3 3. Unplug the power supply you are replacing. 4. Remove the four screws at the back of the system cabinet and the two screws at the back of the power supply that hold the power supply in place.
6.11 Power Harness Removal and Replacement Figure 6-10 Removing Power Harness To Floppy and Optional device To Motherboard Cable Clip To CD-ROM and To Power Supplies StorageWorks shelf Power Harness Current Share (70-31346-01) (17-01495-01) PKW0522-97 6-22 Service Manual PRELIMINARY...
Page 187
Removal 1. Shut down the operating system and power down the system. 2. Expose both the card cage section and the power section of the system. See Section 6.3 . 3. Remove the cable clip between the two sections of the system. 4.
6.12 System Fan Removal and Replacement Figure 6-11 Removing System Fan Cable to Fan 0 Cable to Fan 1 Fan 0 (17-31351-01) Module guides Fan 1 (17-31350-01) PKW0523-97 6-24 Service Manual PRELIMINARY...
Page 189
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. Removing Fan 0 3. Remove the CPU module(s). 4. Remove memory. 5. Unplug the twisted pair power cord from J5 on the motherboard and pass the cord through the sheet metal to the fan compartment.
6.13 Cover Interlock Removal and Replacement Figure 6-12 Removing Cover Interlock Interconnect switch PKW0519A-97 6-26 Service Manual PRELIMINARY...
Page 191
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. Remove the CD-ROM. 4. Unplug the interlock switch’s pig tail cable from the cable it is connected to. 5.
6.14 Operator Control Panel Removal and Replacement Figure 6-13 Removing OCP PKW-0501A-97 6-28 Service Manual PRELIMINARY...
Page 193
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. To remove the StorageWorks door: Open the door slightly and grab the left edge of the door with your left hand and the right edge of the door with you right hand.
6.15 CD-ROM Removal and Replacement Figure 6-14 Removing CD_ROM PKW0519-97 6-30 Service Manual PRELIMINARY...
Page 195
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. Loosen the screw holding the CD-ROM bracket to the system. in Figure 6-14. 4. Detach both the power and signal connectors at the rear of the CD-ROM. 5.
6.16 Floppy Removal and Replacement Figure 6-15 Removing Floppy PKW0520-97 6-32 Service Manual PRELIMINARY...
Page 197
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. Remove the two Phillips head screws holding the floppy in the system, Figure 6-15. 4.
6.17 SCSI Disk Removal and Replacement Figure 6-16 Removing StorageWorks Disk PKW0501B-97 6-34 Service Manual PRELIMINARY...
Page 199
Removal 1. Shut down the operating system and power down the system. 2. Open the front door exposing the StorageWorks disks. 3. Pinch the clips on both sides of the disk and slide it out of the shelf. Replacement Reverse the steps in the Removal procedure. Verification Power up the system (press the Halt button if necessary to bring up the SRM console).
Page 201
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. Remove the power and signal cables from the repeater on the side of the StorageWorks shelf. 4.
Page 203
Removal 1. Shut down the operating system and power down the system. 2. Expose the card cage side of the system. See Section 6.3. 3. Remove the power and signal cables from the repeater on the side of the StorageWorks shelf. 4.
Appendix A Running Utilities This appendix provides a brief overview of how to load and run utilities. The following topics are covered: • Running Utilities from a Graphics Monitor • Running Utilities from a Serial Terminal • Running ECU • Updating Firmware with LFU •...
Running Utilities from a Graphics Monitor Start AlphaBIOS and select Utilities from the menu. The next selection depends on the utility to be run. For example, to run ECU, select Run ECU from floppy. To run RCU, select Run Maintenance Program. Figure A-1 Running a Utility from a Graphics Monitor AlphaBIOS Setup F1=Help...
Running Utilities from a Serial Terminal Utilities are run from a serial terminal in the same way as from a graphics monitor. The menus are the same, but some keys are different. Table A-1 AlphaBIOS Option Key Mapping AlphaBIOS Key VTxxx Key Ctrl/A Ctrl/B...
Running ECU The EISA Configuration Utility (ECU) is used to configure EISA options on AlphaServer systems. The ECU can be run either from a graphics monitor or a serial terminal. 1. Start AlphaBIOS Setup. If the system is in the SRM console, issue the command alphabios.
Updating Firmware with LFU Start the Loadable Firmware Update (LFU) utility by issuing the lfu command at the SRM console prompt or by selecting Update AlphaBIOS in the AlphaBIOS Setup screen. LFU is part of the SRM console. Example 6–1 Starting LFU from the SRM Console P00>>>...
You can start LFU from either the SRM console or the AlphaBIOS console. • From the SRM console, start LFU by issuing the lfu command. • From the AlphaBIOS console, select Upgrade AlphaBIOS from the AlphaBIOS Setup screen (see Figure A-2). A typical update procedure is: 1.
A.4.1 Updating Firmware from the Internal CD-ROM Insert the update CD-ROM, start LFU, and select cda0 as the load device. Example 6–1 Updating Firmware from the Internal CD-ROM ***** Loadable Firmware Update Utility ***** ² Select firmware load device (cda0, dva0, ewa0), or Press <return>...
Page 212
² Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, the internal CD-ROM is selected. ³ Select the file that has the firmware update, or press Enter to select the default file.
Page 213
Example 6–1 Updating Firmware from the Internal CD-ROM (Continued) ¶ UPD> update * WARNING: updates may take several minutes to complete for each device. · Confirm update on: AlphaBIOS [Y/(N)] y DO NOT ABORT! AlphaBIOS Updating to V6.40-1... Verifying V6.40-1... PASSED. Confirm update on: srmflash [Y/(N)] y DO NOT ABORT!
Page 214
¶ The update command updates the device specified or all devices. In this example, the wildcard indicates that all devices supported by the selected update file will be updated. · For each device, you are asked to confirm that you want to update the firmware.
A.4.2 Updating Firmware from the Internal Floppy Disk — Creating the Diskettes Create the update diskettes before starting LFU. See Section A.4.3 for an example of the update procedure. Table A-1 File Locations for Creating Update Diskettes on a PC Console Update Diskette I/O Update Diskette AS1200FW.TXT...
A.4.3 Updating Firmware from the Internal Floppy Disk — Performing the Update Insert an update diskette (see Section A.4.2) into the internal floppy drive. Start LFU and select dva0 as the load device. Example 6–1 Updating Firmware from the Internal Floppy Disk ***** Loadable Firmware Update Utility ***** Select firmware load device (cda0, dva0, ewa0), or ²...
Page 218
² Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, the internal floppy disk is selected. ³ Select the file that has the firmware update, or press Enter to select the default file.
Page 219
Example 6–1 Updating Firmware from the Internal Floppy Disk(Continued) µ UPD> update pfi0 WARNING: updates may take several minutes to complete for each device. ¶ Confirm update on: pfi0 [Y/(N)] y DO NOT ABORT! pfi0 Updating to 3.10... Verifying to 3.10... PASSED. ·...
µ The update command updates the device specified or all devices. ¶ For each device, you are asked to confirm that you want to update the firmware. The default is no. Once the update begins, do not abort the operation. Doing so will corrupt the firmware on the module. ·...
A.4.4 Updating Firmware from a Network Device Copy files to the local MOP server’s MOP load area, start LFU, and select ewa0 as the load device. Example 6–1 Updating Firmware from a Network Device ***** Loadable Firmware Update Utility ***** Select firmware load device (cda0, dva0, ewa0), or ²...
Page 222
Before starting LFU, download the update files from the Internet (see Preface). You will need the files with the extension .SYS. Copy these files to your local MOP server’s MOP load area. ² Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device.
Page 223
Example 6–1 Updating Firmware from a Network Device (Continued) µ UPD> update * -all WARNING: updates may take several minutes to complete for each device. DO NOT ABORT! AlphaBIOS Updating to V6.40-1... Verifying V6.40-1... PASSED. DO NOT ABORT! kzpsa0 Updating to A11 ...
Page 224
µ The update command updates the device specified or all devices. In this example, the wildcard indicates that all devices supported by the selected update file will be updated. Typically, LFU requests confirmation before updating each console’s or device’s firmware. The -all option removes the update confirmation requests.
A.4.5 LFU Commands The commands summarized in Table A-2 are used to update system firmware. Table A-2 LFU Command Summary Command Function display Shows the system physical configuration. exit Terminates the LFU program. help Displays the LFU command list. Restarts the LFU program. list Displays the inventory of update firmware on the selected device.
Page 226
display The display command shows the system physical configuration. Display is equivalent to issuing the SRM console command show configuration. Because it shows the slot for each module, display can help you identify the location of a device. exit The exit command terminates the LFU program, causes system initialization and testing, and returns the system to the console from which LFU was called.
Page 227
list The list command displays the inventory of update firmware on the CD-ROM, network, or floppy. Only the devices listed at your terminal are supported for firmware updates. The list command shows three pieces of information for each device: • Current Revision —...
Updating Firmware from AlphaBIOS Insert the CD-ROM or diskette with the updated firmware and select Upgrade AlphaBIOS from the main AlphaBIOS Setup screen. Use the Loadable Firmware Update (LFU) utility to perform the update. The LFU exit command causes a system reset. Figure A-3 AlphaBIOS Setup Screen AlphaBIOS Setup Display System Configuration...
A.6 Upgrading AlphaBIOS As new versions of Windows NT are released, it might be necessary to upgrade AlphaBIOS to the latest version. Additionally, as improvements are made to AlphaBIOS, it might be desirable to upgrade to take advantage of new AlphaBIOS features.
Hard Disk Partitioning The recommended hard disk partition on the first hard disk in your system is: partition 1 should be 6 megabytes less than the total size of the drive (this large partition holds the operating system and the application and data files) and partition 2 should be the remaining 6 megabytes (this small partition holds only the few files necessary for your computer to boot).
No Hard Disks Found When you start hard disk setup, if you receive a “No hard drives were found connected to your computer” message, it means that AlphaBIOS could not locate a hard drive. The likely conditions that cause this error are: •...
A.7.3 How AlphaBIOS Works with System Partitions If you are installing Windows NT for the first time, AlphaBIOS will determine that a system partition has not been defined when you select Install Windows NT in the AlphaBIOS Setup screen (see Figure A-1). When this occurs, AlphaBIOS searches for all FAT partitions on the system.
Using the Halt Button Use the Halt button to halt the DIGITAL UNIX or OpenVMS operating system when it hangs, clear the SRM console password, or force a halt assertion, as described in Section 3.12. Using Halt to Shut Down the Operating System You can use the Halt button if the DIGITAL UNIX or OpenVMS operating system hangs.
Halt Assertion A halt assertion allows you to disable automatic boots of the operating system so that you can perform tasks from the SRM console. Under certain conditions, you might want to force a “halt assertion A halt assertion .” differs from a simple halt in that the SRM console “remembers”...
Page 235
If you enter the RCM haltin command when Windows NT or AlphaBIOS is running, the interrupt is ignored. However, you can enter the RCM haltin command followed by the RCM reset command to force a halt assertion. Upon reset, the system powers up to the SRM console, but the SRM console does not load the AlphaBIOS console.
Appendix B SRM Console Commands and Environment Variables This appendix provides a summary of the SRM console commands and environment variables. The test command is described in Chapter 3 of this document. For complete reference information on the other SRM commands and environment variables, see the AlphaServer 1200 System User’s Guide.
Summary of SRM Console Commands The SRM console commands are used to examine or modify the system state. Table B-1 Summary of SRM Console Commands Command Function alphabios Loads and starts the AlphaBIOS console. boot Loads and starts the operating system. clear envar Resets an environment variable to its default value.
Page 239
Table B-1 Summary of SRM Console Commands (Continued) Command Function login Turns off secure mode, enabling access to all SRM console commands during the current session. Displays information about the specified console command. more Displays a file one screen at a time. prcache Initializes and displays status of the PCI NVRAM.
B.1.1 Summary of SRM Environment Variables Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. Environment variables are set or changed with the set envar command and returned to their default values with the clear envar command.
Page 241
Table B-2 Environment Variable Summary (Continued) Environment Variable Function memory_test Specifies the extent to which memory will be tested. For DIGITAL UNIX systems only. ocp_text Overrides the default OCP display text with specified text. os_type Specifies the operating system and sets the appropriate console interface.
Recording Environment Variables This worksheet lists all environment variables. Copy it and record the settings for each system. Use the show* command to list environment variable settings. Table B-3 Environment Variables Worksheet Environment Variable System Name System Name System Name auto_action bootdef_dev boot_osflags...
Page 243
Table B-3 Environment Variables Worksheet (Continued) Environment Variable System Name System Name System Name pk*0_soft_term sys_model_num sys_serial_num sys_type tga_sync_green tt_allow_login SRM Console Commands and Environment Variables...
Page 245
Appendix C Managing the System Remotely This chapter describes how to manage the system from a remote location using the Remote Console Manager (RCM). You can use the RCM from a console terminal at a remote location. You can also use the RCM from the local console terminal. Sections in this chapter are: •...
C.1 RCM Overview The remote console manager (RCM) monitors and controls the system remotely. The control logic resides on the system board. The RCM is a separate console from the SRM and AlphaBIOS consoles. The RCM is run from a serial console terminal or terminal emulator. A command interface lets you to reset, halt, and power the system on or off, regardless of the state of the operating system or hardware.
C.2 First-Time Setup To set up the RCM to monitor a system remotely, connect the console terminal and modem to the ports at the back of the system, configure the modem port for dial-in, and dial in. Figure C-1 RCM Connections VTxxx PK-0906-97 Managing the System Remotely...
C.2.1 Configuring the Modem The RCM requires a Hayes-compatible modem. The controls that the RCM sends to the modem are acceptable to a wide selection of modems. After selecting the modem, connect it and configure it. Qualified Modems The modems that have been tested and qualified with this system are: •...
C.2.2 Dialing In and Invoking RCM To dial in to the RCM modem port, dial the modem, enter the modem password at the # prompt, and type the escape sequence. Use the hangup command to terminate the session. A sample dial-in dialog would look similar to the following: Example 6–1 Sample Remote Dial-In Dialog ²...
4. To terminate the modem connection, enter the RCM hangup command. RCM> hangup If the modem connection is terminated without using the hangup command or if the line is dropped due to phone-line problems, the RCM will detect carrier loss and initiate an internal hangup command.
C.3 RCM Commands The RCM commands given in Table C-1 are used to control and monitor a system remotely. Table C-1 RCM Command Summary Command Function alert_clr Clears alert flag, stopping dial-out alert cycle alert_dis Disables the dial-out alert function alert_ena Enables the dial-out alert function disable...
Page 252
Command Conventions • The commands are not case sensitive. • A command must be entered in full. • You can delete an incorrect command with the Backspace key before you press Enter. • If you type a valid RCM command, followed by extra characters, and press Enter, the RCM accepts the correct command and ignores the extra characters.
Page 253
Two conditions must be met for the alert_enable command to work: • A modem dial-out string must be entered from the system console. • Remote access to the RCM modem port must be enabled with the enable command. If the alert_enable command is entered when remote access is disabled, the following message is displayed: *** error *** disable...
Page 254
The enable command can fail for the following reasons: • No modem access password was set. • The initialization string or the answer string might not be set properly. (See Section C.7.) • The modem is not connected or is not working properly. •...
Page 255
haltin The haltin command halts a managed system and forces a halt assertion. The haltin command is equivalent to pressing the Halt button on the control panel and holding it in. This command can be used at any time after system power-up to allow you to perform system management tasks.
Page 256
poweron The poweron command requests the RCM to power on the system. The poweron command is equivalent to pressing the On/Off button on the control panel to the on position. For the system power to come on, the following conditions must be met: •...
Page 257
The following events occur when the reset command is executed: • The system restarts and the system console firmware reinitializes. • The console exits RCM command mode and reconnects the serial terminal to the system COM1 serial port. • The power-up messages are displayed, and then the console prompt is displayed or the operating system boot messages are displayed, depending on how the startup sequence has been defined.
Page 258
The minimum password length is one character, followed by a carriage return. If only a carriage return is entered, the command fails with the message: *** ERROR - illegal password *** If you forget the password, you can enter a new password. status The status command displays the current state of the system sensors, as well as the current escape sequence and alarm information.
C.4 Dial-Out Alerts When you are not monitoring the system remotely, you can use the RCM dial- out feature to notify you of a power failure within the system. When a dial-out alert is triggered, the RCM initializes the modem for dial-out, sends the dial-out string, hangs up the modem, and reconfigures the modem for dial-in.
Composing the Dial-Out String Enter the set rcm_dialout command from the SRM console to compose the dial-out string. Use the show command to verify the string. See Example 6–2. Example 6–2 Typical RCM Dial-Out Command P00>>> set rcm_dialout “ATXDT9,15085553333,,,,,,5085553332#;” P00>>> show rcm_dialout rcm_dialout ATXDT9,15085553333,,,,,,5085553332#;...
Table C-3 Elements of the Dial-Out String ATXDT AT = Attention X = Forces the modem to dial “blindly” (not look for a dial tone). Enter X if the dial-out line modifies its dial tone when used for services such as voice mail. D = Dial T = Tone (for touch-tone) , = Pause for 2 seconds...
C.5 Using the RCM Switchpack The RCM operating mode is controlled by a switchpack on the system board. Use the switches to enable or disable certain RCM functions, if desired. Figure C-2 Location of RCM Switchpack on System Board PKW0504C-97 Managing the System Remotely C-19...
Page 264
Figure C-3 RCM Switches (Factory Settings) PKW0950-97 Switch Name Description EN RCM Enables or disables the RCM. The default is ON (RCM enabled). The OFF setting disables RCM. MODEM OFF Enables or disables the modem. The default is OFF (modem enabled). RPD DIS Enables or disables remote poweroff.
Page 265
Uses of the Switchpack You can use the RCM switchpack to change the RCM operating mode or disable the RCM altogether. The following are conditions when you might want to change the factory settings. • Switch 1 (EN RCM)—Set this switch to OFF (disable) if you want to reset the baud rate of the COM1 port to a value other than the system default of 9600.
Page 266
Resetting the RCM to Factory Defaults You can reset the RCM to factory settings, if desired. You would need to do this if you forgot the escape sequence for the RCM. Follow the steps below. 1. Turn off the system. 2.
C.6 Troubleshooting Guide Table C-4 is a list of possible causes and suggested solutions for symptoms you might see. Table C-4 RCM Troubleshooting Symptom Possible Cause Suggested Solution The local console Cables not correctly installed. Check external cable terminal is not installation.
Page 268
Table C-4 RCM Troubleshooting (continued) Symptom Possible Cause Suggested Solution RCM does not answer Modem cables may be Check modem phone when the modem is incorrectly installed. lines and connections. called. Enable remote access. RCM remote access is Set password and enable disabled.
Page 269
Table C-4 RCM Troubleshooting (continued) Symptom Possible Cause Suggested Solution RCM installation is RCM Power Control: is set Invoke RCM and issue the complete, but to DISABLE. poweron command. system does not power up. You reset the AC power cords were not Refer to Section C.5.
C.7 Modem Dialog Details This section is intended to help you reprogram your modem if necessary. Default Initialization and Answer Strings The modem initialization and answer command strings set at the factory for the RCM are: Initialization string: AT&F0EVS0=0S12=50<cr> Answer string ATXA<cr>...
Page 271
Initialization String Substitutions The following modems require modified initialization strings. Modem Model Initialization String Motorola 3400 Lifestyle 28.8 at&f0e0v0x0s0=2 AT&T Dataport 14.4/FAX at&f0e0v0x0s0=2 Hayes Smartmodem Optima 288 at&fe0v0x0s0=2 V-34/V.FC + FAX Managing the System Remotely C-27...
Need help?
Do you have a question about the DIGITAL Ultimate Workstation 533 and is the answer not in the manual?
Questions and answers