Lenovo System x3850 X6 Quick Start Manual page 30

Hide thumbs Also See for System x3850 X6:
Table of Contents

Advertisement

When a QPI fails down to half width, the IMM logs two errors messages. For example,
"Sensor CPU 1 QPILink has transitioned to critical from a less severe state" and "Sensor
CPU 2 QPILink has transitioned to critical from a less severe state." Figure 34 shows that if
the IMM flags CPU1 and CPU2, QPI Port1 has an issue. When the IMM reports this kind
of error, the issue can be in one of several places:
– CPU1 or CPU2
– CPU socket 1 or 2
– CPU book 1 or 2
– Midplane
IMM/HW cannot detect where the QPI is broken. It can only see that the RX is not in full
width. This is why the IMM must report two CPUs when a link goes to half width.
To isolate which FRU is defective, you need to complete a series of tests:
a. Check for bent pins on the midplane to avoid a the situation where this one hardware
problem can cascade into addition hardware failures. Perform this check before starting
the defective FRU isolation process.
b. Swap one of the failing books with a good book in the same system.
c. If the IMM still calls out the same two CPU books, swap out the other failing book.
d. If the failure follows the book, it is the CPU or book.
e. If the failure follows the CPU book slot, the issue is with the midplane.
f. After you have isolated the error to a single book, swap CPUs with one of the good
books.
g. If the error follows the CPU, replace the CPU. If it follows the book, replace the book.
To date, most QPI errors have been related to damaged CPU socket pins. To inspect the pins,
look at the socket while moving the CPU book to get a view of the sockets from many angles.
Look for non-uniform light reflection off the pins. Pin damage can be subtle and hard to see
even under a microscope.
30
Lenovo System x3850 X6 and x3950 X6 Quick Start Guide

Advertisement

Table of Contents
loading

This manual is also suitable for:

System x3950 x6

Table of Contents