Sun Microsystems Sun Fire X4140 Diagnostics Manual
Sun Microsystems Sun Fire X4140 Diagnostics Manual

Sun Microsystems Sun Fire X4140 Diagnostics Manual

Sun microsystems server diagnostics guide
Hide thumbs Also See for Sun Fire X4140:
Table of Contents

Advertisement

Quick Links

Sun Fire™ X4140, X4240, and X4440
Servers Diagnostics Guide
Sun Microsystems, Inc.
www.sun.com
Part No. 820-3067-11
August 2008, Revision A
Submit comments about this document at: http://www.sun.com/hwdocs/feedback

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the Sun Fire X4140 and is the answer not in the manual?

Questions and answers

Summary of Contents for Sun Microsystems Sun Fire X4140

  • Page 1 Sun Fire™ X4140, X4240, and X4440 Servers Diagnostics Guide Sun Microsystems, Inc. www.sun.com Part No. 820-3067-11 August 2008, Revision A Submit comments about this document at: http://www.sun.com/hwdocs/feedback...
  • Page 2 Cette distribution peut inclure des éléments développés par des tiers . Sun, Sun Microsystems, le logo Sun, Java, Solaris et Sun Fire 4140, Sun Fire 4240, and Sun Fire 4440 sont des marques de fabrique ou des marques déposées de Sun Microsystems, Inc. aux Etats-Unis et dans d'autres pays.
  • Page 3: Table Of Contents

    Contents Preface vii Initial Inspection of the Server 1 Service Troubleshooting Flowchart 1 Gathering Service Information System Inspection 3 Troubleshooting Power Problems 3 Externally Inspecting the Server Internally Inspecting the Server 4 Using SunVTS Diagnostic Software 7 Running SunVTS Diagnostic Tests SunVTS Documentation 8 Diagnosing Server Problems With the Bootable Diagnostics CD 8 Requirements 8...
  • Page 4 Using the ILOM Service Processor GUI to View System Information 43 Making a Serial Connection to the SP 44 Viewing ILOM SP Event Logs Interpreting Event Log Time Stamps Viewing Replaceable Component Information Viewing Sensors Error Handling 53 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 5 Handling of Uncorrectable Errors Handling of Correctable Errors Handling of Parity Errors (PERR) Handling of System Errors (SERR) Handling Mismatching Processors Hardware Error Handling Summary 64 Index 69 Contents...
  • Page 6 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 7: Preface

    The Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide contains information and procedures for using available tools to diagnose problems with the servers. Before You Read This Document It is important that you review the safety guidelines in the Sun Fire X4140, X4240, and X4440 Safety and Compliance Guide.
  • Page 8: Related Documentation

    Related Documentation The document set for the Sun Fire X4140, X4240, and X4440 Servers is described in the Where To Find Sun Fire X4140, X4240, and X4440 Servers Documentation sheet that is packed with your system. You can also find the documentation at http://docs.sun.com.
  • Page 9: Web Sites

    Typographic ConventionsThird-Party Typeface Meaning The names of commands, files, AaBbCc123 and directories; onscreen computer output What you type, when contrasted AaBbCc123 with onscreen computer output AaBbCc123 Book titles, new words or terms, words to be emphasized. Replace command-line variables with real names or values. * The settings on your browser might differ from these settings.
  • Page 10: Sun Welcomes Your Comments

    You can submit your comments by going to: http://www.sun.com/hwdocs/feedback Please include the title and part number of your document with your feedback: Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide, part number 820-3067-11 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 11: Initial Inspection Of The Server

    C H A P T E R Initial Inspection of the Server This chapter includes the following topics: “Service Troubleshooting Flowchart” on page 1 ■ “Gathering Service Information” on page 2 ■ “System Inspection” on page 3 ■ Service Troubleshooting Flowchart Use the following flowchart as a guideline for using the subjects in this book to troubleshoot the server.
  • Page 12: Gathering Service Information

    4. Check for potential device conflicts before you add a new device. 5. Check for version dependencies, especially with third-party software. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Refer to this section “Using the ILOM Service Processor GUI to View System Information”...
  • Page 13: System Inspection

    System Inspection Controls that have been improperly set and cables that are loose or improperly connected are common causes of problems with hardware components. Troubleshooting Power Problems If the server will power on, skip this section and go to ■ Server”...
  • Page 14: Internally Inspecting The Server

    Power/OK LED is flashing. To completely power off the server, you must disconnect the AC power cords from the back panel of the server. X4140 Server Front Panel FIGURE 1-1 Locate Button/LED PowerButton Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 FIGURE 1-1 FIGURE 1-2...
  • Page 15 X4440 Server Front Panel FIGURE 1-2 Locate Button/LED Power Button 2. Remove the server cover. For instructions on removing the server cover, refer to your server’s service manual. 3. Inspect the internal status indicator LEDs. These can indicate component malfunction. For the LED locations and descriptions of their behavior, see Indicator LEDs”...
  • Page 16 10. If the problem with the server is not evident, you can obtain additional information by viewing the power-on self test (POST) messages and BIOS event logs during system startup. Continue with “Viewing Event Logs” on page Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 17: Using Sunvts Diagnostic Software

    C H A P T E R Using SunVTS Diagnostic Software This chapter contains information about the SunVTS™ diagnostic software tool. Running SunVTS Diagnostic Tests The servers are shipped with a Bootable Diagnostics CD that contains the Sun Validation Test Suite (SunVTS) software. SunVTS provides a comprehensive diagnostic tool that tests and validates Sun hardware by verifying the connectivity and functionality of most hardware controllers and devices on Sun platforms.
  • Page 18: Sunvts Documentation

    To use the diagnostics CD you must have a keyboard, mouse, and monitor ■ attached to the server on which you are performing diagnostics, or available through a remote KVM. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 19: Using The Bootable Diagnostics Cd

    Using the Bootable Diagnostics CD To use the diagnostics CD to perform diagnostics: 1. With the server powered on, insert the CD into the DVD-ROM drive. 2. Reboot the server, and press F2 during the start of the reboot so that you can change the BIOS setting for boot-device priority.
  • Page 20 To save the log files, you must save them to a removable media device or FTP them to another system. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 21: Troubleshooting Dimm Problems

    C H A P T E R Troubleshooting DIMM Problems This chapter describes how to detect and correct problems with the server’s Dual Inline Memory Modules (DIMM)s. It includes the following sections: “DIMM Population Rules” on page 11 ■ “DIMM Replacement Policy” on page 12 ■...
  • Page 22: Dimm Replacement Policy

    2. During reboot, the BIOS checks the Machine Check registers and determines that the previous reboot was due to an UCE, then reports this in POST after the memtest stage: A Hypertransport Sync Flood occurred on last boot Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 23 3. BIOS reports this event in the service processor’s system event log (SEL) as shown in the sample IPMItool output below: ipmitool -H 10.6.77.249 -U root -P changeme -I lanplus sel list 8 | 09/25/2007 | 03:22:03 | System Boot Initiated #0x02 | Initiated by warm reset | Asserted 9 | 09/25/2007 | 03:22:03 | Processor #0x04 | Presence detected | Asserted a | 09/25/2007 | 03:22:03 | OEM #0x12 |...
  • Page 24: Correctable Dimm Errors

    Solaris FMA reports and (sometimes) retires memory with correctable Error Correction Code (ECC) errors. See your Solaris Operating System documentation for details. Use the command: fmdump -eV Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 describes the contents of the display: TABLE 3-1...
  • Page 25: Bios Dimm Error Messages

    to view ECC errors Linux: ■ The HERD utility can be used to manage DIMM errors in Linux. See the x64 Servers Utilities Reference Manual for details. If HERD is installed, it copies messages from /dev/mcelog to ■ /var/log/messages. If HERD is not installed, a program called mcelog copies messages from ■...
  • Page 26 FIGURE 3-1 for the locations of DIMMs and LEDs on the mezzanine board. FIGURE 3-2 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 for the locations of DIMMs and LEDs on the motherboard. See...
  • Page 27 DIMMs and LEDs on Motherboard FIGURE 3-1 Chapter 3 Troubleshooting DIMM Problems...
  • Page 28: Isolating And Correcting Dimm Ecc Errors

    1. If you have not already done so, shut down your server to standby power mode and remove the cover. 2. Inspect the installed DIMMs to ensure that they comply with the Population Rules” on page Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 “DIMM...
  • Page 29 3. Press the PRESS TO SEE FAULT button, and inspect the DIMM fault LEDs. See FIGURE 3-1 FIGURE 3-2 A flashing LED identifies a component with a fault. For CEs, the LEDs correctly identify the DIMM where the errors were detected. ■...
  • Page 30 11. Power on the server and run the diagnostics test again. 12. Review the log file. If the tests identify the same error, the problem is in the CPU, not the DIMMs. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 31: Event Logs And Post Codes

    A P P E N D I X Event Logs and POST Codes This appendix contains information about the BIOS event log, the BMC system event log, the power-on self-test (POST), and console redirection. It contains the following sections: “Viewing Event Logs” on page 21 ■...
  • Page 32 * * PCI Express Configuration * * Remote Access Configuration * * USB Configuration ****************************************************************************** v02.61 (C)Copyright 1985-2006, American Megatrends, Inc. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Boot Security Chipset * Configure CPU. * **...
  • Page 33 b. From the Advanced Settings screen, select Event Log Configuration. Advanced Menu Event Logging Details screen is displayed. Advanced ****************************************************************************** * Event Logging details * *************************************************** * on the Event Log. * View Event Log * Mark all events as read * Clear Event Log ****************************************************************************** v02.61 (C)Copyright 1985-2006, American Megatrends, Inc.
  • Page 34 5. If the problem with the server is not evident, continue with Service Processor GUI to View System Information” on page ILOM SP Event Logs” on page Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 * View all events in the * Working * It will take up to * 60 Seconds approx.
  • Page 35: Power-On Self-Test (Post)

    Power-On Self-Test (POST) The system BIOS provides a rudimentary power-on self-test. The basic devices required for the server to operate are checked, memory is tested, the LSI 1064 disk controller and attached disks are probed and enumerated, and the two Intel dual Gigabit Ethernet controllers are initialized.
  • Page 36: Redirecting Console Output

    User Name: root ■ Password: changeme ■ The Sun Integrated Lights Out Manager main GUI screen is displayed. 8. Click the Remote Control tab. 9. Click the Redirection tab. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 37 10. Set the color depth for the redirection console at either 6 or 8 bits. 11. Click the Start Redirection button. 12. When you are prompted for a user name and password, type the following: User Name: root ■ Password: changeme ■...
  • Page 38: Changing Post Options

    * * Boot Device Priority * * Hard Disk Drives * * CD/DVD Drives ****************************************************************************** v02.61 (C)Copyright 1985-2006, American Megatrends, Inc. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Boot Security Chipset * Configure Settings * **...
  • Page 39 3. Select Boot Settings Configuration. The Boot Settings Configuration screen is displayed. Boot ****************************************************************************** * Boot Settings Configuration * *************************************************** * certain tests while * Quick Boot * Quiet Boot * AddOn ROM Display Mode * Bootup Num-Lock * Wait For 'F1' If Error * Interrupt 19 Capture ****************************************************************************** v02.61 (C)Copyright 1985-2006, American Megatrends, Inc.
  • Page 40 Default Boot Order – The letters in the brackets represent the boot devices. To ■ see the letters defined, position your cursor over the field and read the definition in the right side of the screen. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 41: Post Codes

    POST Codes contains descriptions of each of the POST codes, listed in the same order TABLE A-1 in which they are generated. These POST codes appear as a four-digit string that is a combination of two-digit output from primary I/O port 80 and two-digit output from secondary I/O port 81.
  • Page 42 Preparing CPU for booting to OS by copying all of the context of the BSP to all application processors present. NOTE: APs are left in the CLI HLT state. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 43: Post Code Checkpoints

    POST Code Checkpoints The POST code checkpoints are the largest set of checkpoints during the BIOS pre- boot process. the POST portion of the BIOS. These two-digit checkpoints are the output from primary I/O port 80. POST Code Checkpoints TABLE A-2 Post Code Description Disable NMI, Parity, video for EGA, and DMA controllers.
  • Page 44 Programming the memory hole or any kind of implementation that needs an adjustment in system RAM size if required. Updates CMOS memory size from memory found in memory test. Allocates memory for Extended BIOS Data Area from base memory. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 45 POST Code Checkpoints (Continued) TABLE A-2 Post Code Description Initializes NUM-LOCK status and programs the KBD typematic rate. Initialize Int-13 and prepare for IPL detection. Initializes IPL devices controlled by BIOS and option ROMs. Initializes remaining option ROMs. Generate and write contents of ESCD in NVRam. Log errors encountered during POST.
  • Page 46 OEM POST Error. This range is reserved for chipset vendors and system manufacturers. The error associated with this value may be different from one platform to the next. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 47: Status Indicator Leds

    A P P E N D I X Status Indicator LEDs This appendix contains information about the locations and behavior of the LEDs on the server. It describes the external LEDs that can be viewed on the outside of the server and the internal LEDs that can be viewed only with the main cover removed.
  • Page 48: Front Panel Leds

    Power Supply Fail: Amber AC OK: Green Locator LED Button Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Rear PS LED: (Amber) Power supply fault System Over Temperature LED: (Amber) Top Fan LED: (Amber) Service action required on fan(s)
  • Page 49: Hard Drive Leds

    Hard Drive LEDs Hard Drive LEDs FIGURE B-3 Figure Legend Ready to remove LED: Blue – Service action is allowed Fault LED: Amber – Service action is required Status LED: Green – Blinks when data is being transferred Internal Status Indicator LEDs The server has internal status indicators on the motherboard, and on the mezzanine board.
  • Page 50 Note – The mezzanine board, when present, obscures part of the motherboard, including the LEDs. The Motherboard Fault LED indicates that one or more of the LEDs on the motherboard is active. DIMMs and LEDs on Motherboard FIGURE B-4 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 51 DIMMs and LEDs on Mezzanine Board FIGURE B-5 Appendix B Status Indicator LEDs...
  • Page 52 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 53: Using The Ilom Service Processor Gui To View System Information

    A P P E N D I X Using the ILOM Service Processor GUI to View System Information This appendix contains information about using the Integrated Lights Out Manager (ILOM) Service processor (SP) GUI to view monitoring and maintenance information for your server.
  • Page 54: Making A Serial Connection To The Sp

    Continue with the following procedures: ■ “Viewing ILOM SP Event Logs” on page 45 ■ “Viewing Replaceable Component Information” on page 48 ■ “Viewing Sensors” on page 50 ■ Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 55: Viewing Ilom Sp Event Logs

    Viewing ILOM SP Event Logs Events are notifications that occur in response to some actions. The IPMI system event log (SEL) provides status information about the server’s hardware and software to the ILOM software, which displays the events in the ILOM web GUI. To view event logs: 1.
  • Page 56 BIOS-generated events. These events relate to error messages generated in the ■ BIOS. System management software events. These events relate to events that occur ■ within the ILOM software. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 57: Interpreting Event Log Time Stamps

    After you have selected a category of event, the Event Log table is updated with the specified events. The fields in the Event Log are described in Event Log Fields TABLE C-1 Field Description Event ID The number of the event, in sequence from number 1. Time Stamp The day and time the event occurred.
  • Page 58: Viewing Replaceable Component Information

    When you first try to access the ILOM Service Processor, you are prompted to type the default user name and password. The default user name and password are: Default user name: root Default password: changeme Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 59 2. From the System Information tab, select Components. The Replaceable Component Information page is displayed. See Replaceable Component Information Page FIGURE C-2 3. Select a component from the drop-down list. Information about the selected component is displayed. 4. If the problem with the server is not evident after viewing replaceable component information, continue with page Appendix C...
  • Page 60: Viewing Sensors

    Default user name: root Default password: changeme 2. From the System Monitoring tab, select Sensor Readings. The Sensor Readings page is displayed. See Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Appendix FIGURE C-3...
  • Page 61 Sensor Readings Page FIGURE C-3 3. Click the Refresh button to update the sensor readings to their current status. 4. Click a sensor to display its thresholds. A display of properties and values appears. See the example in FIGURE C-4 Appendix C Using the ILOM Service Processor GUI to View System Information...
  • Page 62: Information, Continue With "Running Sunvts Diagnostic Tests" On Page

    Sensor Details Page FIGURE C-4 5. If the problem with the server is not evident after viewing sensor readings information, continue with “Running SunVTS Diagnostic Tests” on page Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 63: Error Handling

    A P P E N D I X Error Handling This appendix contains information about how the servers process and log errors. See the following sections: “Handling of Uncorrectable Errors” on page 53 ■ “Handling of Correctable Errors” on page 56 ■...
  • Page 64 The BIOS skips the faulty DIMM on the next POST memory test. ■ The BIOS reports available memory, excluding the faulty DIMM pair. ■ shows an example of a DMI log screen from BIOS Setup Page. FIGURE D-1 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 65 DMI Log Screen, Uncorrectable Error FIGURE D-1 Appendix D Error Handling...
  • Page 66: Handling Of Correctable Errors

    Solaris support provides full self-healing and automated diagnosis for the CPU ■ and Memory subsystems. FIGURE D-2 ■ Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 shows an example of a DMI log screen from BIOS Setup Page:...
  • Page 67 DMI Log Screen, Correctable Error FIGURE D-2 If during any stage of memory testing the BIOS finds itself incapable of ■ reading/writing to the DIMM, it takes the following actions: The BIOS disables the DIMM as indicated by the Memory Decreased message ■...
  • Page 68 DMI Log Screen, Correctable Error, Memory Decreased EXAMPLE D-1 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 69: Handling Of Parity Errors (Perr)

    Handling of Parity Errors (PERR) This section lists facts and considerations about how the server handles parity errors (PERR). The handling of parity errors works through NMIs. ■ During BIOS POST, the NMI is logged in the DMI and the SP SEL. See the ■...
  • Page 70 Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled? Aug 5 05:15:00 d-mpk12-53-159 kernel: Dazed and confused, but trying to continue Aug 5 05:15:00 d-mpk12-53-159 kernel: Do you have a strange power saving mode enabled? Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 71: Handling Of System Errors (Serr)

    Note – The Linux system reboots, but does not inform the BIOS of this incident. Handling of System Errors (SERR) This section lists facts and considerations about how the server handles system errors (SERR). System error handling works through the HyperTransport Synch Flood Error ■...
  • Page 72 Description ■ FIGURE D-5 system error. DMI Log Screen with Error FIGURE D-5 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 : 04 : Critical Interrupt : 00 : Sensor-specific Discrete : Assertion Event : 05ffff...
  • Page 73: Handling Mismatching Processors

    Handling Mismatching Processors This section lists facts and considerations about how the server handles mismatching processors. The BIOS performs a complete POST. ■ The BIOS displays a report of any mismatching CPUs, as shown in the following ■ example: AMIBIOS(C)2003 American Megatrends, Inc. BIOS Date: 08/10/05 14:51:11 Ver: 08.00.10 CPU : AMD Opteron(tm) Processor 254, Count : 3, CPU Revision, CPU0 : E4, CPU1 : E6...
  • Page 74: Hardware Error Handling Summary

    BIOS POST Server BIOS does failure not pass POST. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Handling The SP controls the system reset, so the system may power on, but will not come out of reset.
  • Page 75 Hardware Error Handling Summary (Continued) TABLE D-1 Error Description Single-bit With ECC enabled DRAM ECC in the BIOS Setup, error the CPU detects and corrects a single-bit error on the DIMM interface. Single four-bit With CHIP-KILL DRAM error enabled in the BIOS Setup, the CPU detects and corrects for the failure of a...
  • Page 76 Single fan Fan failure is failure detected by reading tach signals. Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Handling Sync floods on HyperTransport links, the machine resets itself, and error information gets retained through reset.
  • Page 77 Hardware Error Handling Summary (Continued) TABLE D-1 Error Description Multiple fan Fan failure is failure detected by reading tach signals. Single power When any of the supply failure AC/DC PS_VIN_GOOD or PS_PWR_OK signals are deasserted. DC/DC power converter POWER_GOOD failure signal is deasserted from the DC/DC converters.
  • Page 78 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008...
  • Page 79: Index

    Index BIOS changing POST options, 28 event logs, 21 POST code checkpoints, 33 POST codes, 31 POST overview, 25 redirecting console output for POST, 26 Bootable Diagnostics CD, 8 comments and suggestions, x component inventory viewing with ILOM SP GUI, 48 console output, redirecting, 26 correctable errors, handling, 56 diagnostic software...
  • Page 80 ILOM SP GUI, 50 serial connection to ILOM SP, 44 SERR, 61 Sun Fire X4140, X4240, and X4440 Servers Diagnostics Guide • August 2008 Service Processor system event log, See SP SEL service visit information, gathering, 2 shutdown procedure, 4...

This manual is also suitable for:

Sun fire x4240Sun fire x4440

Table of Contents