IBM RS/6000 SP Problem Determination Manual page 138

Hide thumbs Also See for RS/6000 SP:
Table of Contents

Advertisement

2. All board patterns
3. Half board pattern
4. All the nodes are reporting a problem
5. All of the switch-to-switch connections are reporting a problem
118
SP PD Guide
For the most part, if an internal port reports an error, the FRU is the card
that contains that port. You may discern patterns that indicate clock
problems, but you will quickly deduce that the FRU is the same because the
clock problem is a broken driver on the switch card.
There should be very few errors that occur on the whole board, because
once you get enough errors, you cannot get to the other parts of the board to
see if they are in error.
When you see a half board with a problem, check to see if it matches the
clock tree. The majority of these patterns of error are clock-related.
If all of the nodes are reporting an error, it is usually a -1, -7, -16, or -19.
These quite often point to clock problems, or power sequence problems,
which result from nodes being on the incorrect clock.
A -1 on all of the nodes on a board will occur on boards that are not on the
same board with the primary node: the primary node throws this off, because
if you have gotten as far as generating an out.top file, the primary node is
operational and reachable.
With a -1 on all the nodes, you will probably see -2, -3, or -4 on the ports that
connect to other switches. What this indicates is that the failing switch board
is not on the same clock as the primary node's switch board. This can
happen because:
a. The clock tree was not set up properly by issuing an Eclock with the
correct Eclock topology file.
b. The clock to this board is broken in this switch assembly, in the cable, or
in the switch that sources this board's clock.
c. If the primary node' s board is the only one that is tunable, it may be set
to the incorrect clock.
A -16 can indicate a poor power-on sequence. For example, if the nodes in a
frame are powered-on before the switch, all the adapters will be set to their
internal clock (on card oscillator). You can discover this by looking in the
dtbx.trace file. Another example is that the switch was powered on before
the nodes, but the clock was not properly set. When the proper clock is
selected, the clock is momentarily interrupted, which causes the adapter to
have problems.
-7 and -19 both point to daemon time-out problems. The -7 occurs when the
primary node has initialized a node's adapter and is waiting for that node' s
daemon to respond to communication that asks what node it thinks it is. The
-19 occurs after the whole switch has been tuned and the route table is being
distributed to the nodes. In this case, the daemon on the receiving node is
not responding. These -7 and -19 errors can occur more easily in large
systems than in small systems, because large networks will naturally eat
more into the time-out period during normal operation than a small network
will.
This is quite often a clock or power problem. It can be on the board, the
clock card, the clock source cable, the power card, or the power cable.
This soft copy for use by IBM employees only.

Advertisement

Table of Contents
loading

Table of Contents