Troubleshooting The Infiniband Network - HP IBRIX X9000 Installation Manual

Network storage system
Table of Contents

Advertisement

/lib/modules/2.6.18-194.el5/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko
/lib/modules/2.6.18-194.el5/updates/kernel/fs/exportfs/exportfs.ko
3.
Rename all of the above files to use the following suffix: /path/name.ofed. For example:
mv /lib/modules/2.6.18-194.el5/updates/kernel/fs/nfs/nfs.ko
/lib/modules/2.6.18-194.el5/updates/kernel/fs/nfs/nfs.ko.ofed
4.
Clean up the modules with the depmod -a command and reboot the nodes. A reboot is
necessary for the changes to take effect.
"depmod -a" , "reboot"
5.
Execute the following commands on each node to ensure that the modules are loaded on
startup:
chkconfig openibd on
service openibd start
6.
This step is needed only if you have unmanaged InfiniBand switches in your network. If the
subnet manager runs on managed switches, skip this step.
The Subnet Manager opensmd must be running on at least one file serving node. Run the
command /usr/sbin/sminfo as root to determine whether opensmd is running on the
IB network.
If opensmd is not running, issue the following commands:
chkconfig opensmd on
service opensmd start
7.
Verify the status of the HCA.
NOTE:
Run the following checks:
* ofed_info
* ibstat
* ibclearcounters
* ibdiagnet -lw 4x -ls 10 -r
8.
Verify that the link is up and the state is active. If the state is initializing, there is no
subnet manager running on the fabric. See step 6.

Troubleshooting the InfiniBand network

Force connected mode for a file serving node:
/sys/class/net/ib0
"echo connected > /sys/class/net/ib0/mode"
"ifconfig ib0 mtu 65520"
NOTE:
For Windows WinOF (OFFED) IB client connectivity, check Windows Sockets Direct
(wsd). This must be enabled for Windows.
Troubleshoot physical errors (logical, sim erros, and so on). Note the following:
Use ibstat to check errors on InfiniBand nodes.
Use ibclearcounters to watch for error counter increments.
Check /sys/class/infiniband/mthca0/ports/1/counters.
symbol_error and port_rcv_erros are physical hardware failures.
If you are using Host Based SM, by default it is tied to Port1 of the HCA.
Troubleshooting the InfiniBand network 145

Advertisement

Table of Contents
loading

Table of Contents