Page 4
We acknowledge the right of proprietors of trademarks mentioned in this book. The information in this document is subject to change without notice. Bull will not be liable for errors contained herein, or r incidental or consequential damages in connection with the use of this material.
Contents Safety notices ....... . . ix High-performance computing clusters using InfiniBand hardware ... 1 Clustering systems by using InfiniBand hardware .
Page 6
Planning Fast Fabric Toolset . . 63 Planning for fabric management server . . 64 Planning event monitoring with QLogic and management server . . 66 Planning event monitoring with xCAT on the cluster management server . 66 Planning to run remote commands with QLogic from the management server . .
Page 8
Checking InfiniBand configuration in AIX . . 215 Checking system configuration in AIX . . 217 Verifying the availability of processor resources . . 217 Verifying the availability of memory resources . . 217 Checking InfiniBand configuration in Linux . .
Page 9
Example PortRcvRemotePhysicalErrors analyses. . 262 Interpreting security errors . . 264 Diagnose a link problem based on error counters . . 264 Error counter details . . 265 Categorizing Error Counters . . 265 Link Integrity Errors . . 266 LinkDownedCounter .
Page 10
viii Power Systems: High performance clustering...
Safety notices Safety notices may be printed throughout this guide: v DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to people. v CAUTION notices call attention to a situation that is potentially hazardous to people because of some existing condition.
Page 12
DANGER When working on or around the system, observe the following precautions: Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: v Connect power to this unit only with the IBM provided power cord. Do not use the IBM provided power cord for any other product.
Page 13
Observe the following precautions when working on or around your IT rack system: v Heavy equipment–personal injury or equipment damage might result if mishandled. v Always lower the leveling pads on the rack cabinet. v Always install stabilizer brackets on the rack cabinet. v To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest devices in the bottom of the rack cabinet.
Page 14
CAUTION: Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building: v Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack cabinet.
Page 15
(L003) All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laser product.
Page 16
CAUTION: Data processing environments can contain equipment transmitting on system links with laser modules that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical fiber cable or open receptacle. (C027) CAUTION: This product contains a Class 1M laser.
High-performance computing clusters using InfiniBand hardware You can use this information to guide you through the process of planning, installing, managing, and servicing high-performance computing (HPC) clusters that use InfiniBand hardware. This information serves as a navigation aid through the publications required to install the hardware units, firmware, operating system, software, or applications publications produced by IBM or other vendors.
Table 1. High-level view of the cluster implementation process and associated information (continued) Content Description “Planning installation flow” on page 68 Provides guidance in how the various tasks relate to each other and who is responsible for the various planning tasks for the cluster.
Page 19
v “Cluster software and firmware information resources” on page 5 General cluster information resources The following table lists general cluster information resources: Table 2. General cluster resources Component Document Plan Install Manage and service IBM Cluster This document Information IBM Clusters with IBM Clusters with the InfiniBand Switch readme file the InfiniBand http://www14.software.ibm.com/webapp/set2/...
Page 20
Table 3. Cluster hardware information resources (continued) Component Document Plan Install Manage and service Logical partitioning Logical Partitioning Guide for all systems Install Instructions for IBM LPAR on System i and System P ® BladeCenter JS22 Planning, Installation, and Service Guide and JS23 IBM GX HCA Custom Installation Instructions, one for each...
Page 21
Table 4. Cluster management software resources (continued) Component Document Plan Install Manage and service QLogic InfiniServ InfiniServ Fabric Access Software Users Guide Stack http://filedownloads.qlogic.com/files/driver/ 68069/ QLogic_OFED+_Users_Guide_Rev_C.pdf QLogic Open Fabrics QLogic OFED+ Users Guide Enterprise http://filedownloads.qlogic.com/files/driver/ Distribution (OFED) 68069/ Stack QLogic_OFED+_Users_Guide_Rev_C.pdf Hardware Installation and Operations Guide for the HMC Management...
Table 5. Cluster software and firmware information resources (continued) Component Document Plan Install Manage and service IBM HPC Clusters GPFS: Concepts, Planning, and Installation Guide Software GPFS: Administration and Programming Reference GPFS: Problem Determination Guide GPFS: Data Management API Guide ®...
Figure 2. Main components in fabric data flow The following figure shows the high-level software architecture. Figure 3. High-level software architecture The following figure shows a simple InfiniBand configuration illustrating the tasks, the software layers, the windows, and the hardware. The host channel adapter (HCA) shown is intended to be a single HCA card with four physical ports.
Page 24
Figure 5. Two-port GX or GX+ host channel adapter A four-port HCA has two chips with a total of four logical switches that has two logical switches in each of the two chips. The logical structure affects how the HCA is represented to the Subnet Manager. Each logical switch and LHCA represent a separate InfiniBand node to the Subnet Manager on each port.
Since each GUID must be different in a network, the IBM HCA gets a subsequent GUID assigned by the firmware. You can choose the offset that is used for the LHCA. This information is also stored in the logical partition profile on the HMC. Therefore, when an HCA is replaced, each logical partition profile must be manually updated with the new HCA GUID information.
Host channel adapter statistics counter: The statistics counters in the IBM GX host channel adapters (HCAs) are only available with HCAs in System p (POWER6) servers. You can query the counters using Performance Manager functions with the Fabric Viewer and the fast fabric iba_report command.
Table 10. Cables for high-performance computing configurations Comments(feature codes listed in order System or use Cable type Connector type Length - m (ft) Source respective to length) POWER6 4x DDR, copper QSFP - CX4 6 m (passive, 26 QLogic 9125-F2A awg), 10 m (active, 26 awg),...
Related concepts “Management subsystem function overview” on page 13 This information provides an overview of the servers, consoles, applications, firmware, and networks that comprise the management subsystem function. POWER Hypervisor ™ The POWER Hypervisor provides an abstraction layer between the hardware and firmware and the operating system instances.
Related concepts “Management subsystem function overview” This information provides an overview of the servers, consoles, applications, firmware, and networks that comprise the management subsystem function. Management subsystem function overview This information provides an overview of the servers, consoles, applications, firmware, and networks that comprise the management subsystem function.
QLogic provides the following switch and fabric management tools. v Fabric Manager (From level 4.3, onward, is part of the QLogic InfiniBand Fabric Suite (IFS). Previously, it was in its own package.) v Fast Fabric Toolset (From level 4.3, onward, is part of QLogic IFS. Previously, it was in its own package.) v Chassis Viewer v Switch command-line interface...
Management subsystem overview The management subsystem in the System p HPC Cluster solution using an InfiniBand fabric loosely integrates the typical IBM System p HPC cluster components with the QLogic components. The management subsystem can be viewed from several perspectives, including: v Host views v Networks v Functional components...
Page 32
The preceding figure illustrates the use of a host-based Subnet Manager (HSM), rather than an embedded Subnet Manager (ESM), running on a switch. This use of HSM is because of the limited compute resources on switches for ESM use. If you are using an ESM, then the Fabric Managers runs on switches. The servers are monitored and serviced in the same fashion as for any IBM Power Systems cluster.
Table 12. Management subsystem server, consoles, and workstations (continued) Hosts Software hosted Server type Operating system User Connectivity Service laptop Serial interface to Laptop User experience RS/232 to switch v Switch service switch provider Note: This is not v System provided by IBM administrator as part of the...
Table 14. Fabric manager overview Fabric manager Details Description The fabric manager performs the following basic operations: v Discovers fabric devices v Configures the fabric v Monitors the fabric v Reconfigures the fabric on failure v Reports problems The fabric manager has several management interfaces that are used to manage an InfiniBand network.
Table 15. HMC overview (continued) Details How to access Use the HMC console located near the system. There is generally a single keyboard and monitor with a console switch to access multiple HMCs in a rack (if there is a need for multiple HMCs).
Server Operating system: The operating system is the interface with the device drivers. The following table provides an overview of the operating system. Table 18. Operating system overview Operating system details More information Description The operating system is the interface for the device drivers. Documentation Operating system users guide When to use...
Table 20. Fast Fabric Toolset overview (continued) Fast Fabric Toolset Details Documentation Fast Fabric Toolset Users Guide When to use These tools can be used during installation to search for problems. These tools can also be used for health checking when you have degraded performance. Host Fabric management server How to access...
Table 22. Fabric viewer overview (continued) Fabric viewer Details Host Any Linux or Microsoft Windows host. Typically, these hosts would be one of the following items. v Fabric management server v System administrator or operator workstation How to access Start the graphical user interface (GUI) from the server on which you install the fabric viewer, or use a remote window access to start it.
Table 24. Management subsystem networks overview (continued) Type of network Details Public network A local site Ethernet network. Typically this network is attached to the xCAT/MS and Fabric Management Server. Some sites might choose to put the cluster VLAN on the public network.
Figure 7. Vendor log flow to xCAT event management Supported components in an HPC cluster High-performance computing (HPC) clusters are implemented using components that are approved and supported by IBM. For details, see “Cluster information resources” on page 2. The following table indicates the components or units that are supported in an HPC cluster as of Service Pack 10.
Page 41
Table 25. Supported HPC components (continued) Component type Component Model, feature, or minimum level ™ Operating system AIX 5L AIX 5.3 at Technology Level 5300-12 with Service Pack 1 AIX 5.3 is for POWER6 only AIX 6.1 POWER6: AIX Version 6.1 with the 6100-01 Technology Level with Service Pack 1 POWER7 AIX 6LVersion 6.1 with the 6100-04 Technology Level with...
Table 25. Supported HPC components (continued) Component type Component Model, feature, or minimum level Hardware Management Console POWER6: (HMC) V7R3.5.0M0 HMC with fixes MH01194, MH01197, MH01204, and V7R3.5.0M1 HMC with MH01212 (HMC build level: 20100301.1) POWER7: V7R7.1.1 HMC with Fix pack AL710_03 Cluster planning Plan a cluster that uses InfiniBand technologies for the communications fabric.
The “Cluster planning overview” can be used as a road map through the planning process. If you read through the Cluster planning overview without following the links, you gain an understanding of the overall cluster planning strategy. Then you can follow the links that direct you through the different procedures to gain an in-depth understanding of the cluster planning process.
10. For some more hints and tips on installation planning, see “Planning aids” on page 75. If you have completed all the previous steps, you can plan in more detail by using the planning worksheets provided in “Planning worksheets” on page 76. When you are ready to install the components with which you plan to build your cluster, review information in readme files and online information related to the software and firmware.
Table 27 lists the minimum levels of software and firmware that are associated with an InfiniBand cluster. Table 27. Minimum levels of software and firmware associated with an InfiniBand cluster Software Minimum level AIX 5L(TM) AIX 5L Version 5.3 with the 5300-12 Technology Level with Service Pack 1 AIX 6L(TM) AIX 6L Version 6.1 with the 6100-03 Technology Level with Service Pack 1...
Server planning relative to the fabric requires decisions on the following items. Table 28. Server Types in an HPC cluster Type Description Typical models Compute Compute servers primarily perform 9125-F2A, 8236-E8C computation and the main work of applications. Storage Storage servers provide connectivity 8203-E4A, 8204-E8A, 9125-F2A, between the InfiniBand fabric and the 8236-E8C...
Page 47
1. The types and numbers of servers. See “Server planning” on page 29 and “Server types” on page 29 2. The number of HCA connections in the servers 3. The number of InfiniBand subnets 4. The size and number of switches in each InfiniBand subnet. Do not confuse InfiniBand subnets with IP subnets.
Page 48
numbered leaf modules. Finally, if there are frames with fewer than 12 nodes try to connect them such that the servers in the same frame are all connected to the same leaf. v If you only require 4 HCA connections from the servers, for increased availability, you might want to distribute them across two HCA cards and use only every other port on each card.
IO servers require enough fabric connectivity to ensure enough bandwidth between fabrics. Previous implementations using IO servers have used the 9125-F2A to permit for up to four connections to one fabric and four connections to another. Example configurations using only 9125-F2A servers: This information provides possible configurations using only 9125-F2A servers details.
Page 50
The following example has (240) 9125-F2As in 10 frames with 8 HCA connections in 8 InfiniBand subnets. You can calculate connections as shown in the following example: Leaf number = frame number Leaf connector number = Server number in frame Server number = Leaf connector number Frame number = Frame number HCA number = C(65+(Integer(switch-1)/4))
Page 51
Table 30. Example topology -> (240) 9125-F2As in 20 frames with 8 HCA connections in 8 InfiniBand subnets (continued) Frame Server Connector Switch Connector 2 (C66) L2-C1 1 (C65) L2-C2 1 (C65) L2-C2 1 (C65) L2-C2 1 (C65) L2-C2 2 (C66) L2-C2 2 (C66) L2-C2...
Page 52
Table 30. Example topology -> (240) 9125-F2As in 20 frames with 8 HCA connections in 8 InfiniBand subnets (continued) Frame Server Connector Switch Connector 1 (C65) L20-C12 2 (C66) L20-C12 2 (C66) L20-C12 2 (C66) L20-C12 2 (C66) L20-C12 Fabric management server 1 Port 1 L21-C1 Fabric management server 1...
Page 53
Table 31. Example topology -> (120) 9125-F2As in 10 frames with 8 HCA connections in 4 InfiniBand subnets (continued) Frame Server Connector Switch Connector 1 (C65) L1-C1 1 (C65) L1-C1 1 (C65) L1-C1 2 (C66) L13-C1 2 (C66) L13-C1 2 (C66) L13-C1 2 (C66) L13-C1...
Page 54
Table 31. Example topology -> (120) 9125-F2As in 10 frames with 8 HCA connections in 4 InfiniBand subnets (continued) Frame Server Connector Switch Connector 2 (C66) L14-C2 2 (C66) L14-C2 Continue through to the last server in the frame 1 (C65) L2-C12 1 (C65) L2-C12...
Page 55
Table 31. Example topology -> (120) 9125-F2As in 10 frames with 8 HCA connections in 4 InfiniBand subnets (continued) Frame Server Connector Switch Connector Fabric management server 1 Port 2 L11-C1 Fabric management server 1 Port 1 L11-C1 Fabric management server 1 Port 2 L11-C1 Fabric management server 2...
Page 56
Table 32. Example topology -> (120) 9125-F2As in 10 frames with 4 HCA connections in 4 InfiniBand subnets (continued) Frame Server Connector Switch Connector 2 (C66) L2-C1 2 (C66) L2-C1 1 (C65) L2-C2 1 (C65) L2-C2 2 (C66) L2-C2 2 (C66) L2-C2 Continue through to the last server in the frame 1 (C65)
Page 57
The following is an example of (140) 9125-F2As in 10 frames connected to eight subnets. This requires 14 servers in a frame and therefore a slightly different mapping of leaf to server is used instead of frame to leaf as in the previous examples. You can calculate connections as shown in the following example: Leaf number = server number in frame Leaf connector number = frame number...
Page 58
Table 33. Example topology -> (140) 9125-F2As in 10 frames with 8 HCA connections in 8 InfiniBand subnets (continued) Frame Server Connector Switch Connector 2 (C66) L1-C2 2 (C66) L1-C2 1 (C65) L2-C2 1 (C65) L2-C2 1 (C65) L2-C2 1 (C65) L2-C2 2 (C66) L2-C2...
Table 33. Example topology -> (140) 9125-F2As in 10 frames with 8 HCA connections in 8 InfiniBand subnets (continued) Frame Server Connector Switch Connector 1 (C65) L10-C10 1 (C65) L10-C10 2 (C66) L10-C10 2 (C66) L10-C10 2 (C66) L10-C10 2 (C66) L10-C10 Fabric management server 1 Port 1...
Page 60
You can calculate connections as shown in the following example: Leaf number = server number in frame Leaf connector number = frame number Server number = Leaf number Frame number = Leaf connector number HCA number = For 9125-F2A -> C65 for switch 1-4; C66 for switch 5-8 HCA port = (Remainder of ((switch –...
Page 61
Table 34. Example topology -> (140) 9125-F2As in 10 frames with 8 HCA connections in 8 InfiniBand subnets (continued) Frame Server Connector Switch Connector 1 (C65) L2-C2 1 (C65) L2-C2 1 (C65) L2-C2 2 (C66) L2-C2 2 (C66) L2-C2 2 (C66) L2-C2 2 (C66) L2-C2...
Page 62
Table 34. Example topology -> (140) 9125-F2As in 10 frames with 8 HCA connections in 8 InfiniBand subnets (continued) Frame Server Connector Switch Connector 2 (C66) L10-C10 2 (C66) L10-C10 2 (C66) L10-C10 Frame of 8203-E4A servers 1 (C8) L1-C11 1 (C8) L1-C11 1 (C8)
There are backup fabric management server in this example. For maximum availability, the backup is connected to a different leaf from the primary. Configurations with IO router servers: This information provides possible configurations using only 9125-F2A compute servers and 8203-E4A storage servers.
Figure 8. Example configuration with IO router servers If you are using 12x HCAs (for example, in a 8203-E4A server), you should review “Planning 12x HCA connections” on page 75, to understand the unique cabling and configuration requirements when using these adapters with the available 4x switches.
Record the cable connection information planned here in the “QLogic and IBM switch planning worksheets” on page 83, for switch port connections and in a “Server planning worksheet” on page 81, for HCA port connections. Planning InfiniBand network cabling and configuration ends here. Planning QLogic or IBM Machine Type InfiniBand switch configuration You can plan for QLogic or IBM Machine Type InfiniBand switch configurations by using QLogic planning resources including general planning guides and planning guides specific to the model being...
Page 66
– Review the 9240 Users Guide to ensure that you understand which spine slots are used for managed spines. Slots 1, 2, 5 and 6 are used for managed spines. The numbering of spine 1 through 3 is from bottom to top. The numbering of spine 4 through 6 is from top to bottom. v - The total number of management Ethernet addresses is driven by the switch model.
the recipient of the remote logs from the switch. You can only direct logs from a switch to a single remote host (xCAT/MS). “Set up remote logging” on page 112 provides the procedure that is used for setting up remote logging in the cluster. The information planned here can be recorded in a “QLogic and IBM switch planning worksheets”...
Table 36. MTU settings (continued) Cluster type Cluster composition by HCA Switch and SM settings IP MTU Homogeneous ConnectX HCA only (System p Chassis MTU = 2 K (4) HCAs blades) Broadcast MTU = 2 K (4) BC rate = 10 GB (3) for SDR switches, or 20 GB (6) for DDR switches Heterogeneous GX++ DDR HCA in 9125-F2A...
Typically, all but the lowest order byte of the GID-prefix is kept constant, and the lowest byte is the number for the subnet. The numbering scheme typically begins with 0 or 1. The configuration settings for fabric managers can be recorded in the “QLogic fabric management worksheets”...
When using RSCT, there are restrictions to how you can configure Internet Protocol (IP) subnet addressing in a server attached to an InfiniBand network. Note: RSCT is no longer required for IBM Power HPC Clusters. This topic is for clusters that still rely on RSCT for InfiniBand network status monitoring.
v If there is a BPC for the power distribution, as in a 24 - inch frame, it might provide a hub for the processors in the frame, permitting for a single connection per frame to the service VLAN. After you know the number of devices and cabling of your service and cluster VLANs, you must consider the device IP-addressing.
If you have along multiple HMCs and are using xCAT, the xCAT Management Server (xCAT/MS) is typically the DHCP server for the service VLAN. If the cluster VLAN is public or local site network, then it is possible that another server might be set up as the DHCP server. It is preferred that the xCAT Management Server to be a stand-alone server.
Page 73
Most details are available in the Fabric Manager and Fabric Viewer Users Guide from QLogic. This information highlights information from a cluster perspective. The Fabric Viewer is intended to be used as documented by QLogic. However, it is not scalable and thus would be only used in small clusters when necessary.
Page 74
– If you use an embedded Subnet Manager, you might experience performance problems and outages if the subnet has more than 64 IBM GX+ or GX++ HCA ports attached to it. This is because of the limited compute power and memory available to run the embedded Subnet Manager in the switch. And because the IBM GX+ or GX++ HCAs also present themselves as multiple logical devices, because they can be virtualized.
Page 75
HCA. And instance 1 manages the second subnet, which typically is on the second port of the first HCA. Instance 2 manages the third subnet, which typically is on the first port of the second HCA, and instance 3 manages the fourth subnet, which typically is on the second port of the second HCA. v Plan for a backup Fabric Manager for each subnet.
Page 76
<Sm> <Start>1</Start> <!-- default SM startup for all instances --> . . . <!-- **************** Fabric Routing **************************** --> . . . <Lmc>2</Lmc> <!-- assign 2^lmc LIDs to all CAs (Lmc can be 0-7) --> . . . <!-- **************** IB Multicast **************************** --> <Multicast>...
Page 77
</Fe> <!-- Common PM (Performance Manager) attributes --> <Pm> <Start>0</Start> <!-- default PM startup for all instances --> . . . </Pm> <!-- Common BM (Baseboard Manager) attributes --> <Bm> <Start>0</Start> <!-- default BM startup for all instances --> . . . </Bm>...
Page 78
. . . <Priority>0</Priority> <!-- 0 to 15, higher wins --> <ElevatedPriority>8</ElevatedPriority> <!-- 0 to 15, higher wins --> </Sm> . . . </Fm> Instance 2 of the FM. When editing the configuration file, it is recommended that you note the instance in a comment <!-- A single FM Instance/subnet -->...
. . . </Fm> </Config> Plan for remote logging of Fabric Manager events: v Plan to update /etc/syslog.conf (or the equivalent syslogd configuration file on your Fabric Management Server) to point syslog entries to the Systems Management server. This requires knowledge of the Systems Management Servers IP address.
v You cannot use the message passing interface (MPI) performance tests because they are not compiled for the IBM System p or IBM Power Systems HPC clusters host stack. v High-Performance Linpack (HPL) in the Fast Fabric Toolset is not applicable to IBM clusters. v The Fast Fabric Toolset configuration must be set up in its configuration files.
Page 81
– The 3550 is 1U high and supports two PCI Express (PCIe) slots. It can support a total of four subnets. – v Memory requirements – In the following bullets, a node is either a GX HCA port with a single logical partition, or a PCI-based HCA port.
v If you are updating from IFS 4 to IFS 5, then you can review the QLogic Fabric Management Users Guide to learn about the new /etc/sysconfig/qlogic_fm.xml in IFS 5, which replaces the /etc/sysconfig/iview_fm.config file. There are some attribute name changes, including the change from a flat text file to an XML format.
– Consider creating response scripts that are specialized to your environment. For example, you might want to email an account other than root with log entries. See RSCT and xCAT documentation for how to create such scripts and where to find the response scripts associated with Log event anytime, Email root anytime, and LogEventToxCATDatabase, which can be used as examples.
The configuration settings planned here can be recorded in the “xCAT planning worksheets” on page 89. Planning Remote Command Execution with QLogic from the xCAT/MS ends here. Frame planning After reviewing the server, fabric device, and the management subsystem information, you can review the frames in which to place all the devices.
Table 37. Installation responsibilities Installation responsibilities Customer responsibilities: v Install customer setup units (according to server model) v Update system firmware v Update InfiniBand switch software including Fabric Management software v If applicable, install and customize the fabric management server including: –...
Table 38. Hardware to install and who is responsible for the installation (continued) Hardware to install Who is responsible for the installation InfiniBand switches The switch manufacturer or its designee (IBM Business Partner) or another contracted organization is responsible for installing the switches. If the switches have an IBM machine type and model, IBM is responsible for them.
Page 87
By breaking down the installation by major subsystem, you can see how to install the units in parallel. Or how you might be able to perform some installation tasks for on-site units while waiting for other units to be delivered. It is important that you recognize the key points in the installation where you cannot proceed with one subsystems installation task before completing the installation tasks in the other subsystem.
Page 88
v Plan and setup DHCP ranges for each service VLAN. Important: If these devices and associated services are not set up correctly before applying power to the base servers and devices, you might not be able to correctly configure and control cluster devices. Furthermore, if this is done out of sequence, the recovery procedures for doing this part of the cluster installation can be lengthy.
Connect switches to the cluster VLAN. If there is more than one VLAN, all switches must be attached to a single cluster VLAN, and all redundant switch Ethernet connections must be attached to the same network. Prerequisites for W3 are M3 and W2. Verify discovery of the switches.
Each organization can use a separate installation worksheet and the worksheet can be completed by using the flow shown in Figure 11 on page 71. It is good practice for each individual and team participating in the installation review the coordination worksheet ahead of time and identify their dependencies on other installers.
HPC applications results in four (4) LIDs for each port. The IBM MPI performance gain is realized particular in the FIFO mode. Consult performance papers and IBM for information about the impact of LMC is equal to 2 on RDMA. The default is to not use the LMC is equal to 2, and use only the first of the 4 available LIDs.
Table 41. Planning checklist (continued) Target Completed Step date date Ensure that you have planned for: v Servers v I/O devices v InfiniBand network devices v Frames or racks for servers, I/O devices and switches, and management servers v Service virtual local area network (VLAN), including: –...
Using the planning worksheets The planning worksheets do not cover every situation you might encounter (especially the number of instances of slots in a frame, servers in a frame, or I/O slots in a server). However, they can provide enough information upon which you can build a custom worksheet for your application. In some cases, you might find it useful to create the worksheets in a spreadsheet application so that you can fill out repetitive information.
Page 94
Table 42. Sample Cluster summary worksheet (continued) Cluster summary worksheet Number and models of fabric management servers: Number of Service VLANs: Service VLAN domains: Service VLAN DHCP server locations: Service VLAN: InfiniBand switches static IP: addresses: (not typical) Service VLAN HMCs with static IP: Service VLAN DHCP ranges: Number of cluster VLANs: Cluster VLAN security addressed: (yes/no/comments)
Table 43. Example: Completed cluster summary worksheet (continued) Cluster summary worksheet Switch partitions: subnet 1 = FE:80:00:00:00:00:00:00 (egf11fm01) subnet 2 = FE:80:00:00:00:00:00:01 (egf11fm02) subnet 3 = FE:80:00:00:00:00:00:00 (egf11fm01) subnet 4 = FE:80:00:00:00:00:00:01 (egf11fm02) Number and types of frames: (include systems, switches, management servers, Network Installation Management (NIM) servers (AIX) and distribution servers (Linux) (8) for 9125-F2A (1) for switches, and fabric management servers...
Page 96
You must know the quantity of each device type, including, server, switch, and bulk power assembly (BPA). For the slots, you can indicate the range of slots or drawers that the device populates. A standard method for naming slots can either be found in the documentation for the frames or servers, or you can choose to use EIA heights (1.75 in.) as a standard.
Table 46. Example: Completed frame and rack planning worksheet (2 of 3) Frame planning worksheet (2 of 3) Frame number or numbers: _______10______________ Frame machine type and model number: _____________________ Frame size: ____19___________ (19 in. or 24 in.) Number of slots: ______4_____________ Slots Slots Device type (server, switch, BPA)
Page 98
Table 48. Sample Server planning worksheet Server planning worksheet Names: _____________________________________________ Types: ______________________________________________ Frame or Frames slot or slot: ____________________________ Number and type of HCAs_________________________________ Number of LPARs or /LHCAs: ____________________________________ IP addressing for InfiniBand: __________________ Partition with service authority: ____________________________________ IP-addressing of service VLAN: _____________________________________________________ IP-addressing of cluster VLAN: ________________________________________________ LPAR IP-addressing: ____________________________________________________________...
Table 49. Example: Completed server planning worksheet Server planning worksheet Names: __________egf01n01 – egf08n12_______________________ Types: _________9125-F2A____________________ Frame or frames/slot or slots: _______1-8/1-12_________________________________ Number and type of HCAs___(1) IBM GX+ per 9125-F2A____________________ Number of LPARs or LHCAs: ___1/4_________________________________ IP-addressing for InfiniBand: _______10.1.2.32-10.1.2.128 10.1.3.32-10.1.3.128 10.1.4.x 10.1.5.x___ Partition with service authority: ____________Yes________________________ IP-addressing of service VLAN: _10.0.1.32-10.1.1.128;...
It might also be useful to note the IBM location code for this HCA port. You can get the location code information specific to each server in the server documentation during the planning process. Or you can work with the IBM service representative at the time of the installation to make the correct notation of the IBM location code.
Table 50. Sample QLogic 24-port switch planning worksheet (continued) 24-port switch worksheet Planning worksheet for switches with more than 24 ports: Use these worksheets for planning switches with more than 24 ports (ones with leafs and spines). The first worksheet is for the overall switch chassis planning. The second worksheet is planning for each leaf.
Page 102
Table 52. Sample: Planning worksheet for Director or core switch with more than 24 ports - leaf configuration Leaf _____ Leaf ____ Ports Connection Ports Connection The following worksheets are examples of the switch planning worksheets. Table 53. Example: Planning worksheet for Director or core switch with more than 24 ports Director or Core Switch (greater than 24 ports) (1 of 4) Switch Model: ____9140_________________________ Switch name: _____egsw01_______________________ (set by using setIBNodeDesc)
Page 103
Table 54. Example: Planning worksheet for Director or core switch with more than 24 ports - leaf configuration (2 of Leaf __1___ Leaf __2__ Ports Connection Ports Connection f01n01-C65-T1 f02n01-C65-T1 f01n02-C65-T1 f02n02-C65-T1 f01n03-C65-T1 f02n03-C65-T1 f01n04-C65-T1 f02n04-C65-T1 f01n05-C65-T1 f02n05-C65-T1 f01n06-C65-T1 f02n06-C65-T1 f01n07-C65-T1 f02n07-C65-T1 f01n08-C65-T1...
Page 104
Table 56. Example: Planning worksheet for Director or core switch with more than 24 ports (continued) Switch Model: ____9140_________________________ Switch name: _____egsw04_______________________ (set by using setIBNodeDesc) xCAT Device/Node name:_______xCAT 123____________ Frame and slot: ____f10s04________________________ Chassis IP addresses: _________10.1.1.13___________________________________________ (9240 has 2 hemispheres) Spine IP addresses: _____slot1=10.1.1.19;...
Table 58. Example: Planning worksheet for Director or core switch with more than 24 ports - leaf configuration (continued) Leaf __7___ Leaf __8__ f07n05-C65-T4 f08n05-C65-T4 f07n06-C65-T4 f08n06-C65-T4 f07n07-C65-T4 f08n07-C65-T4 f07n08-C65-T4 f08n08-C65-T4 f07n09-C65-T4 f08n09-C65-T4 f07n10-C65-T4 f08n10-C65-T4 f07n11-C65-T4 f08n11-C65-T4 f07n12-C65-T4 f08n12-C65-T4 xCAT planning worksheets Use the xCAT planning worksheet to plan for your xCAT management servers.
Page 106
Table 59. xCAT planning worksheet (continued) nodetype = FabricMS Node names or addresses of Fabric/MS: ___________________________________ Node groups for Fabric/MS: ____________________________________________ Primary Fabric/MS for data collection: The following worksheet is an example of a completed xCAT planning worksheet. Table 60. Example: Completed xCAT planning worksheet xCAT Planning Worksheet xCAT/MS Name: _______egxCAT01____________________________________ xCAT/MS IP addresses: service VLAN:___10.0.1.1 10.0.2.1________________ Cluster VLAN: __10.1.1.1___...
Page 107
Table 61. xCAT event monitoring worksheet xCAT Event Monitoring worksheet syslog or syslog-ng or other: ___________________________________ Accept logs from IP address (0.0.0.0): ___________________________ (yes=default) Fabric management server logging: TCP or UDP? ___________ port: _______ (514 default) Fabric management server IP addresses: ________________________________ Switch logging is UDP protocol: port: __________________ (514 default) Switch chassis IP address: __________________________________________ ______________________________________________________________...
QLogic fabric management worksheets Use this worksheet to plan QLogic Fabric Management. This worksheet highlights information that is important for management subsystem integration in high-performance computing (HPC) clusters with an InfiniBand network. It is not intended to replace the planning instructions found in the QLogic Installation and Planning Guides. To plan thoroughly for QLogic Fabric Management, complete the following worksheets.
Page 109
Table 64. Example: Completed General QLogic Fabric Management worksheet (continued) Host-based or embedded SM: _____Host-based____________________ LMC: __2___ (2 is preferred) MTU: Chassis: ___4096__________ Broadcast: ___4096___ MTU rate for broadcast: _____4096______ Fabric management server names and addresses on cluster VLAN: _____egf11fm01; egf11fm02__________________________ _____________________________________________________________________________________________ Embedded Subnet Manager Switches: ______Not applicable______________________________________...
Page 110
Table 65. Embedded Subnet Manager worksheet (continued) Tivoli Event Services Manager or HSM to Embedded Subnet Manager worksheet be used? ___________ Notes: The following worksheet is used to plan fabric management servers. A separate worksheet can be filled out for each server. It is intended to highlight information that is important for management subsystem integration in HPC clusters with an InfiniBand network.
Page 111
Table 66. Fabric management server worksheet (continued) Fabric management server worksheet (one for each server) Backup switch/Priority Back up switch/Priority Fast Fabric Toolset Planning Host-based or embedded SM? ___________________________________ (for FF_ALL_ANALYSIS) List of switch chassis: _________________________________________ __________________________________________________________ List of switches running embedded SM: (if applicable) _____________________________ ______________________________________________________________________ Subnet connectivity planning is in the previous Subnet Management planning worksheet.
Table 67. Example: Completed fabric management server worksheet (continued) Fabric management server worksheet (one for each server) Broadcast MTU (put rate in 5 (4096) 5 (4096) 5 (4096) 5 (4096) parentheses) node_appearance _msg_thresh Primary switch/Priority Back up switch/Priority Backup switch/Priority Back up switch/Priority Fast Fabric Toolset Planning Host-based or embedded SM? _______Host-based________________________________ (for FF_ALL_ANALYSIS)
a. Complete “Site setup for power, cooling, and floor” on page 98 b. Complete “Installing and configuring the management subsystem” on page 98 c. Complete “Installing and configuring the cluster server hardware” on page 123 d. Complete “Installing the operating system and configuring the cluster servers” on page 127 e.
Table 68. Cluster expansion or partial installation determination (continued) Adding Adding new Adding HCAs to Adding a subnet Adding servers InfiniBand servers to an an existing to an existing and a subnet to hardware to an existing InfiniBand InfiniBand an existing existing cluster InfiniBand network...
Page 115
The Management subsystem installation and configuration encompass major tasks M1 through M4 as shown in Figure 11 on page 71. This is the most complex area of a high-performance computing (HPC) cluster installation. It is affected by, and affects, other areas (such as server installation and switch installation). Many tasks can be performed simultaneously, while others must be done in a particular order.
Page 116
Tasks have two reference labels to help cross-reference them between figures and procedures. The first is from Figure 12 and the second is from Figure 11 on page 71. For example E1 (M1) indicates, task label E1 in the Figure 12 and task label (M1) in the Figure 11 on page 71. Steps that have a shaded background are steps that are performed under “Installing and configuring vendor or IBM InfiniBand switches”...
Installing and configuring the management subsystem for a cluster expansion or addition The tasks for expanding an existing cluster are different from the tasks for a new installation. This information is used when you want to expand an existing cluster. If you are adding or expanding InfiniBand network capabilities to an existing cluster, then you might approach the management subsystem installation and configuration differently than with a new cluster installation.
Table 69. Impact of cluster expansions (continued) Scenario Effects Adding servers and a subnet to an existing InfiniBand v Cable to InfiniBand switches service subsystem network Ethernet ports v Cable to servers service subsystem Ethernet ports v Build operating system update mechanisms for new servers without removable media v Might require additional HMCs to accommodate the new servers.
Page 119
– You have more than one HMC. – You have opted to install xCAT and CRHS in anticipation of future expansion. To install the HMC, complete the following steps. Note: Tasks have two reference labels to help cross-reference them between figures and procedures. The first is from Figure 12 on page 100 and the second is from Figure 11 on page 71.
6. H5 (M2) - Return to the HMC installation documentation and finish the installation and configuration procedures. However, do not attach the HMC cables to the service VLAN until instructed to do so in step 9 of this procedure. After finishing those procedures, continue with step 7. 7.
5. CM4 (M4) - Start the DHCP server on the xCAT/MS, or if applicable, on a separate DHCP server. This step blocks other installation tasks for servers and management consoles that require DHCP service from xCAT/MS. 6. It is a good practice to enter the configuration information for the server in its /etc/motd. Use the information from the “xCAT planning worksheets”...
Page 122
The fabric management server provides the following two functions that are installed and configured in this procedure. v Host-based Fabric Manager function v Fast Fabric Toolset Note: This procedure is written from the perspective of installing a single fabric management server. Using the instructions in the Fast Fabric Toolset Users Guide, you can use the ftpall command to copy common configuration files from the first Fabric Management Server to other fabric management servers.
Page 123
a. Configure the Fast Fabric Toolset according to the instructions in the Fast Fabric Toolset Users Guide. When configuring the Fast Fabric Toolset consider the following application of Fast Fabric within high-performance computing (HPC) clusters. v The master node referred in the Fast Fabric Toolset Users Guide, is considered to be Fast Fabric Toolset host in IBM HPC clusters.
Page 124
d. Assure that tcl and Expect are installed on the Fabric Management Server. They should be at least at the following levels. You can check using the rpm -qa | grep expect and rpm -qa | grep tcl commands. v expect-5.43.0-16.2 tcl-8.4.12-16.2 v For IFS 5, tcl-devel-8.4.12-16.2 is also required e.
Page 125
1) For MTU use the value planned in “Planning maximum transfer unit (MTU)” on page 51 < MTU>4096< /MTU> 2) For MTU rate, use the value planned in “Planning maximum transfer unit (MTU)” on page 51. The following example is for MTU rate of 20 g. <Rate>20g</Rate> c.
Page 126
2) Configure the name for the FM instance. You might use this name for referencing the instance. The FM also uses this name when creating log entries for this instance. The following example uses “ib0”. <Name>ib0< /Name> < !-- also for logging with _sm, _fe, _pm, _bm appended --> 3) Configure the HCA in the fabric management server to be used to reach the subnet that is managed by this instance of FM.
Page 127
Run iba_report against each port in the /etc/sysconfig/iba/ports file. For example: v iba_report -h 1 -p 1 | grep SW v iba_report -h 2 -p 2 | grep SW c. Verify correct security configuration for switches by ensuring that each switch has the required username/password enabled.
This procedure ends here. Set up remote logging Remote logging to xCAT/MS helps you monitor clusters by consolidating logs to a central location. This procedure involves setting up remote logging from the following locations to the xCAT/MS. v To set up remote logging for a fabric management server, continue with step 2 in: For xCAT/MS: “Remote syslogging to an xCAT/MS”...
Page 129
If the xCAT/MS is running the AIX operating system, go to Remote Syslogging and Event Management for xCAT on AIX. After finishing the event management setup, proceed to step 2 on page 117. If the xCAT/MS is running the Linux operating system, go to Remote Syslogging and Event Management for xCAT on Linux.
Page 130
6) Wait approximately 2 minutes and check the /etc/syslog.conf file. The sensor might have placed the following line in the file. The default cycle for the sensor is to check the files every 60 seconds. The first time it runs, it recognizes that it must set up the syslog.conf file with the following entry: local6.notice /var/log/xcat/syslog.fabric.notices...
Page 131
2) Log entries with a priority (severity) of INFO or lower are logged to the default location of /var/log/messages i. Edit the /etc/syslog-ng/syslog-ng.conf file ii. Add the following lines to the end of the file. # Fabric Notices from local6 into a FIFO/named pipe filter f_fabnotices { facility(local6) and level(notice, alert, warn, err, crit) and not filter(f_iptables);...
Page 132
f. If you get an error back from monerrorlog indicating a problem with syslog, there is probably a typographical error in the /etc/syslog-ng/syslog-ng.conf file. The message includes syslog in the error message, similar to: monerrorlog: * syslog * Note: The * is a wildcard. 1) Look for the typographical error in the /etc/syslog-ng/syslog-ng.conf file by reviewing the previous steps that you have taken to edit the syslog-ng.conf file.
Page 133
3) If you want to create any other response scripts, you use a similar format for the startcondresp command after creating the appropriate response script. For details, refer the xCAT Reference Guide and RSCT Reference Guide. Proceed to step 2. 2.
Page 134
3) In either case, ensure that all Priority logging levels with a severity above INFO are set to log using the logShowConfig command on the switch command line or using the Chassis Viewer to look at the log configuration. If you must turn on INFO entries, use the following methods: v On the switch command line use the logConfigure command and follow the instructions on screen.
Page 135
v Use the procedure in “Problem with event management or remote syslogging” on page 226. Recall that you were using the logger command such that the Fabric Management Server would be the source of the log entry. f. Check the /var/log/xcat/syslog.fabric.info file and verify that both the Notice entry and the INFO entry are in the file.
Using syslog on RedHat Linux-based xCAT/MS: Use this procedure to setup syslog to direct log entries from the fabric management server and switches. Note: Do not use this procedure unless you were directed here from another procedure. If the level of Linux on the xCAT/MS uses syslog instead of syslog-ng, use the following procedure to set up syslog to direct log entries from the fabric management server and switches instead of the one documented in Remote Syslogging and Event Management for xCAT on Linux.
Page 137
Note: The following method is just one of several methods by which you can set up remote command processing to a fabric management server. You can use any method that meets your requirements. For example, you can set up the Fabric Management Server as a node. By setting it up as a device rather than a node, you might find it easier to group it differently from the IBM servers.
# Note: the command output must be a numeric value in the last line. # e.g. # hello world! post-command=showLastRetcode -brief b. Add each switch to /etc/hosts: [IP address] [hostname] c. Ensure that you are using ssh for xdsh, and that you have run the command: chtab key=useSSHonAIX site.value=yes d.
To install and configure server with management consoles, complete the following steps. M4 - Final configuration of management consoles: This procedure is performed in “Installing and configuring the cluster server hardware” during the steps associated with S3 and M4. The following procedure is intended to provide an overview of what is done in that procedure.
If you are adding or expanding InfiniBand network capabilities to an existing cluster by adding servers to the cluster, then you must approach the Server installation and configuration a little differently than with a new cluster flow. The flow for Server installation and configuration is based on a new cluster installation, but it would indicate where there are variances for expansion scenarios.
Page 141
– For POWER5: IBM System Information CenterInformation Center → Initial server setup. Procedures for installing the GX InfiniBand host channel adapters are also available in the IBM systems Hardware Information Center, click IBM systems Hardware Information Center → Installing hardware. b.
Page 142
v For POWER5: IBM System Information CenterInformation Center → Initial server setup. Procedures for installing the GX InfiniBand host channel adapters are also available in the IBM systems Hardware Information Center, click IBM systems Hardware Information Center → Installing hardware. c.
Note: Typically, the IBM service representatives responsibility ends here for IBM service installed frames and servers. From this point forward, after the IBM service representative leaves the site, if any problem is found in a server, or with an InfiniBand link, a service call must be placed. The IBM service representative would recognize that the HCA link interface and InfiniBand cables have not been verified, and is not verified until the end of the procedure for InfiniBand network verification, which might be performed by either the customer or a non-IBM vendor.
Table 71. Effects on cluster installation when expanding existing clusters Scenario Effects Adding InfiniBand hardware to an existing cluster (switches v Configure the logical partitions to use the HCAs. and host channel adapters (HCAs)) v Configure HCAs for switch partitioning. Adding new servers to an existing InfiniBand network v Perform this procedure as if it were a new cluster installation.
Page 145
2. S7 - After the servers are connected to the cluster VLAN, install and update the operating systems. If servers do not have removable media, you must use an AIX network installation management (NIM) server or Linux distribution server to load and update the operating systems. Note: In order to use ml0 with AIX 5.3, you must install the devices.common.IBM.sni.ml file set.
Page 146
“Installing the fabric management server” on page 105. For embedded Subnet Managers, see “Installing and configuring vendor or IBM InfiniBand switches” on page 137. The subnet managers must be running before you start to configure the interfaces in the partitions. If the commands start failing and lsdev | grep ib reveals that devices are Stopped, it is likely that the subnet managers are not running.
Page 147
v Verify that the following is set to -1: cat /sys/module/ib_ehca/parameters/nr_ports 5) On the management server, run updatenode for each partition: updatenode lpar otherpkgs,configiba. Set up DNS: If the xCAT management server provids DNS service, the following procedure can be used. 1) The IP address entries for IB interfaces in /etc/hosts on xCAT managed nodes should have the node short host name and the unique IB interface name in them.
Page 148
5. S7 - Verify InfiniBand adapter configuration a. If you are running a host-based Subnet Manager, to check multicast group creation, on the Fabric Management Server run the following commands. Remember that, for some commands, you must provide the HCA and port through which the Subnet Manager connects to the subnet. For IFS 5, complete the following steps: 1) Check for multicast membership.
Page 149
ib3 65532 ib4* 65532 ib5 65532 ib6 65532 ib7 65532 ml0 65532 lo0 16896 lo0 16896 Note: If you have a problem where the MTU value is not 65532, you must follow the recover procedure in “Recovering ibX interfaces” on page 235. For Linux partitions: 1) Verify that the IPoIB process starts.
10.0.2.0 0.0.0.0 255.255.255.0 0 ib1 10.0.3.0 0.0.0.0 255.255.255.0 0 ib2 169.254.0.0 0.0.0.0 255.255.0.0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 0 lo 0.0.0.0 9.114.28.126 0.0.0.0 0 eth0 6. Once the servers are up and running and xCAT is installed and you can dsh/xdsh to the servers, and you have verified the adapter configuration, map the HCAs.
for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"` echo $i lsattr -El $i | egrep "super" done Note: To verify a single device (such as, ib0) run the command lsattr -El ib0 | egrep "mtu|super"...
Page 152
1. Confirm that the rpms listed in the following table, are installed by using the rpm command as in the following example: [root on c697f1sq01][/etc/sysconfig/network] => rpm -qa | grep -i ofed Refer the notes at the end of the table. The indications in the table for which libraries apply for Galaxy1/Galaxy2 HCAs versus Mellanox-based HCAs;...
libraries exist on the system. For the user who needs both these IB commands and the 64-bit libraries, install both 32-bit and 64-bit library packages. 2. If the previous rpms have not been installed, yet, do so now. Use instructions from the documentation provided with RedHat.
Installing and configuring the InfiniBand switch Use this procedure to install and configure InfiniBand switches. It is possible to perform some of the tasks in this procedure in a method other than which is described. If you have other methods for configuring switches, you must review a few key points in the installation process that are related to the order and coordination of tasks and configuration settings that are required in a cluster environment.
Page 155
v For QLogic switch command help, on the command-line interface (CLI), use the help <command name> command. Otherwise, the Users Guides provides information about the commands and identifies the appropriate command in its procedural documentation. v For new InfiniBand switches, perform all the steps in the following procedure on the new InfiniBand switches.
Page 156
simple query command or ping test to the switch. For example, the pingall command can be used as long as you point to the switch chassis and not the servers or nodes. 8. W5 - Verify that the switch code matches the latest supported level indicated in IBM Clusters with the InfiniBand Switch website referenced in “Cluster information resources”...
Page 157
b. Set the broadcast MTU value according to the installation plan. See the switch planning worksheet or “Planning maximum transfer unit (MTU)” on page 51. c. If you have or would be connecting cables to 9125-F2A servers, configure the amplitude and pre-emphasis settings as indicated in the “Planning QLogic or IBM Machine Type InfiniBand switch configuration”...
Page 158
4) For each port that is unique to a particular switch, run the above ismPortSetDdrAmplitude command as above, but either log on to the switch or add the -H [switch chassis ip address] parameter to the cmdall command, so that it directs the command to the correct switch.
4) For each port that is unique to a particular switch, run the above ismPortSetDdrPreemphasis command as above, but either log on to the switch or add the -H [switch chassis ip address] parameter to the cmdall command, so that it directs the command to the correct switch.
Cabling the InfiniBand network information for expansion If you are adding or expanding your InfiniBand network capabilities to an existing cluster, then you might approach cabling the InfiniBand differently than with a new cluster flow. The flow for cabling the InfiniBand network is based on a new cluster installation, but it indicates where there are variances for expansion scenarios.
IFS 5, use the qlogic_fm start command as directed in “Installing the fabric management server” on page 105. Contact the person installing the Fabric Management Server and indicate that the Fabric Manager might not be started on the Fabric Management Server. 7.
Page 162
v If you find a problem with a link that might be caused by a faulty HCA or cable, contact your service representative for repair. v This is the final procedure in installing an IBM System p cluster with an InfiniBand network. The following procedure provides additional details that can help you perform the verification of your network.
d. After running the fabric verification tool, perform the checks recommended in “Fabric verification” on page 150. 3. After fixing the problems, run the Fast Fabric tool baseline health check one more time. This can be used to help monitor fabric health and diagnose problems. Use the /sbin/all_analysis -b command. 4.
Page 164
b. Obtain or record the GUID index and capability settings in the logical partition profiles that use the HCA by using the following steps. 1) Go to the Systems Management window. 2) Select the Servers partition. 3) Select the server in which the HCA is installed. 4) Select the partition to be configured.
Note: If the following message occurs when you attempt to assign a new unique GUID, you might be able to recover from this error without the help of a service representative. A hardware error has been detected for the adapter U787B.001.DNW45FD-P1-Cx.
Verifying the installed InfiniBand network (fabric) in AIX Verifying the installed InfiniBand network (fabric) in AIX after the InfiniBand network is installed. The GX adapters and the network fabric must be verified through the operating system. Use this procedure to check the status of a GX host channel adapter (HCA) by using the AIX operating system.
4. Perform verification by completing the following steps. a. Run the fabric verification application b. Look for events revealing fabric problems c. Run a Health check Repeat step 3 on page 150 and 4 until no problems are found in the fabric. Fabric verification procedure Use this procedure for fabric verification.
Cluster Fabric Management Use this information to learn about the activities, applications, and tasks required for cluster fabric management. This would be a lot more along the lines of theory and best practice than detailed procedures. Documents referenced in this section can be found in “Cluster information resources” on page 2. This chapter is broken into the following sections.
Remote logging and event management is used to consolidate logs and serviceable events from the many components in a cluster in one location - the xCAT Management Server (xCAT/MS). To set this up, see “Set up remote logging” on page 112. For more information about how to use this monitoring capability see “Monitoring fabric logs from the xCAT Cluster Management server”...
® Current priority of SM_0 Current priority of SM_0 Event on Fabric M/S 1 on Fabric M/S 2 Current Master Fabric M/S 1 recovers Fabric M/S 2 SM_0 Admin issues restore Fabric M/S 1 SM_0 priority command on Fabric M/S 2 QLogic fast fabric toolset The Fast fabric toolset is a suite of management tools from QLogic.
v It can query only subnets to which the fabric management server on which it is running is connected. If you have more than four subnets, you must work with at least two different Fabric Management Servers to get to all subnets. v You must update the chassis configuration file with the list of switch chassis in the cluster.
Table 76. Cluster fabric management tasks (continued) Task Reference Monitor for general problems “Monitoring the fabric for problems” Monitor for fabric-specific problems “Monitoring fabric logs from the xCAT Cluster Management server” Manually querying status of the fabric “Querying status” on page 174 Scripting to QLogic management tools and switches “Remotely accessing QLogic management tools and commands from xCAT/MS”...
If the Email root anytime response is enabled, then the fabric logs go to the root account. These might also be interpreted by using the “Table of symptoms” on page 187. If the LogEventToxCATDatabase response is enabled, then references to the fabric logs would be in the xCAT database.
v Periodically to monitor the fabric (For more information, see “Setting up periodic fabric health checking”): /sbin/all_analysis Note: The LinkDown counter in the IBM GX+/GX++ HCAs would be reset as soon as the link goes down. This is part of the recovery procedure. While this is not optimal, the connected switch ports LinkDown counter provides an accurate count of the number of LinkDowns for the link.
Page 175
threshold files must be generated based on the amount of time since the most recent clearing of link errors. Therefore, it is also important to create a cronjob (or some other method) to periodically clear port error counters such that you can determine which threshold file to use at any given time all_analysis, fabric_analysis or iba_report –o errors is run.
Page 176
PortXmitDiscards PortXmitConstraintErrors PortRcvConstraintErrors LocalLinkIntegrityErrors ExcessiveBufferOverrunErrors VL15Dropped Note: The PortRcvSwitchRelayErrors are commented out such that they are never reported. This is because of a known problem in the switch chip that causes this error counter to incorrectly increment. The preferred substitute for iba_mon.conf follows. You can create this by first renaming the default iba_mon.conf that is shipped with Fast Fabric to iba_mon.conf.original.
Page 177
Threshold = (Threshold for 24 hours) * (Number hours since last clear)/24 However, the threshold used must never be lower than the minimum threshold for the error counter. Also, always round-up to the next highest integer. Always set the threshold for PortRcvErrors equal to or less than PortRcvPhysicalRemoteErrors, because PortRcvErrors is incremented for PortRcvPhysicalRemoteErrors, too.
Page 178
you must reference these files with the all_analysis script command, name them based on the time period in which they would be used, such as iba_mon.conf.[time period]. 3. Edit to update the symbol errors threshold to the value in Table 77 on page 161. For example, in the following you would see the default setting for SymbolErrorCounter and the setting for hour 12 in the file /etc/sysconfig/iba/iba_mon.conf.12.
Page 179
The default port error counter thresholds are defined in the /etc/sysconfig/iba/iba_mon.conf file, which must be configured for each intervals threshold. Then, cronjobs must be set up that reference these configuration files. 1. Save the original file: cp –p /etc/sysconfig/iba/iba_mon.conf /etc/sysconfig/iba/iba_mon.conf.original 2.
15 * * * * /sbin/iba_reports –o errors –F “nodepat:SilverStorm*” –c /etc/sysconfig/iba/iba_mon.conf.low > [output directory]/errors.`/bin/date +”%Y%m%d_%H%M”` Note: A more sophisticated method is to call a script that calculates the amount of time that has passed. Since the most recent error counter clears and calls that script without the requirement to reference specific instances of iba_mon.conf.
Page 181
– fabric*.errors - Record the location of the problem and see “Diagnosing link errors” on page 210 – chassis*.errors - Record the location of the problem and see “Table of symptoms” on page 187. – *.diff – indicates that there is a difference from the baseline to the latest health check run. See “Interpreting health check .diff files”...
Page 182
latest/esm.*.diff - If the FF_ESM_CMDS file has been modified, review the changes in results for those additional commands. As necessary, correct the SM. After being corrected, rerun the health checks to look for further errors. If the change was expected and permanent, rerun a baseline when all other health check errors have been corrected.
latest/chassis.fwVersion.[changes|diff] - This file indicates the chassis firmware version has changed. If this was not an expected change, correct the chassis firmware before proceeding further. After being corrected, rerun the health checks to look for further errors. If the change was expected and permanent, rerun a baseline when all other health check errors have been corrected.
Page 184
165 of 165 Fabric Links Checked Links Expected but Missing, Duplicate in input or Incorrect: 159 of 159 Input Links Checked Total of 6 Incorrect Links found 0 Missing, 6 Unexpected, 0 Misconnected, 0 Duplicate, 0 Different ------------------------------------------------------------------------------- The following table summarizes possible issues found in .changes files: Table 78.
Page 185
Table 78. Possible issues found in health check .changes files (continued) Issue Description and possible actions Incorrect Link This applies only to links and indicates that a link is not connected properly. This must be fixed. It is possible to find miswires by examining all of the Misconnected links in the fabric.
Page 186
Table 78. Possible issues found in health check .changes files (continued) Issue Description and possible actions Missing This indicates an item that is in the baseline is not in this instance of health check output. This might indicate a broken item or a configuration change that has removed the item from the configuration.
Page 187
Table 78. Possible issues found in health check .changes files (continued) Issue Description and possible actions Port Attributes Inconsistent This indicates that the attributes of a port on one side of a link have changed, such as PortGuid, Port Number, Device Type, and others.
Table 78. Possible issues found in health check .changes files (continued) Issue Description and possible actions X mismatch: expected * found: * This indicates an aspect of an item has changed as compared to the baseline configuration. The aspect which changed and the expected and found values would be shown.
Page 189
*** [line 1], [line 2] **** lines from the baseline file --- [line 1], [line 2] ---- lines from the latest file The first set of lines enclosed in asterisks (*) indicates which line numbers contain the lines from the baseline file that have been altered.
You can see in the swap in the previous example, by charting out the differences in the following table. The logical switch 2 lines happen to be extraneous information for this example, because their connections are not shown by diff; this is a result of using –C 1. Switch Port Connected to HCA port in baseline Connected to HCA port in latest...
Remotely accessing the Fabric Management Server from xCAT/MS To access any command that does not require user interaction by issuing the following dsh from the xCAT/MS. When you have set up remote command execution from the xCAT/MS to fabric management server as described in “Set up remote command processing”...
If you want to access switch commands that require user responses, the standard technique is to write an Expect script to interface with the switch Command Line Interface (CLI). Either xdsh on the xCAT/MS or cmdall on the fabric management server support interactive switch CLI access. You might want to remotely access switches to gather data or issue commands.
Page 193
The fabric manager code updates are documented in the Fabric Manager Users Guide, but the following items must be considered. The following information is about the fabric management server, which includes the host-based fabric manager and Fast Fabric Toolset. v The main document for fabric management server code updates is QLogic OFED+ Users Guide. v To determine the software package level on the fabric management server, use iba_config.
Page 194
– Choose only the following options to install or upgrade: - OFED IB stack - QLogic IB tools - QLogic Fast Fabric - Qlogic FM Note: : All of the above plus others are set to install by default. Clear all other selections on this screen AND on the next screen before selecting “P”...
v If you must update only the code on one switch, you can do this using the Chassis Viewer; see the Switch Users Manual. You must FTP the package to the server on which you are opening the browser to connect to the Chassis Viewer.
Page 197
illustrate how iba_report might be used for detailed monitoring of cluster fabric resources. Much more detail is available in the QLogic Fast Fabric Users Guide. Table 80. Suggested iba_report parameters Parameter Description -d 10 This parameter provides extra detail that you would not see at the default detail level of 2.
Page 198
Table 80. Suggested iba_report parameters (continued) Parameter Description Clears error and statistics counters. You might use it with –o none so that no counters are returned. Or, you might use –o errors to get error counters before clearing them, which is the preferred method. In order to ensure good performance of iba_report, anytime the “-C”...
iba_report –C –o none –F “nodepat:SilverStorm*” The previous query returns nothing, but it clears all of the port statistics on all switch chassis whose IB NodeDescription begins with the default “SilverStorm”. Cluster service Cluster service requires an understanding of how problems are reported, who is responsible for addressing service issues, and the procedures used to fix the problems.
Page 200
Table 81. Fault reporting mechanisms (continued) Reporting Mechanism Description xCAT Event Management Fabric Log Used to monitor and consolidate Fabric Manager and switch error logs. This is located on the xCAT/MS in: /tmp/systemEvents or xCAT eventlog This log is part of the standard event management function.
Table 81. Fault reporting mechanisms (continued) Reporting Mechanism Description /var/log/messages on fabric management server This is the syslog on the fabric management server where host-based Subnet Manager logs are located. This is the log for the entire fabric management server, therefore, there might be entries in it from components other than Subnet Manager.
cause. The link event caused by the user is reported through remote logging to the xCAT/MS in /tmp/systemEvents. Without remote logging, you must have interrogated the Subnet Manager log. v Server hardware failures would be reported to SFP on the managing HMC and forwarded to xCAT SFP Monitoring.
1) If there is a switch internal error, determine the association based on whether the error is isolated to a particular port, leaf board, or the spine. 2) If there is an adapter error or server checkstop, determine the switch links to which they are associated.
Page 204
Table 82. Descriptions of Tables of Symptoms (continued) Table Description Table 87 on page 191 All other events, including those reported by the operating system and users The following table is used for events reported in the xCAT/MS Fabric Event Management Log (/tmp/systemEvents on the xCAT/MS).
Page 205
Table 83. xCAT/MS Fabric Event Management log symptoms (continued) Symptom Procedure or Reference Other exceptions on switch or HCA ports Contact your next level of support. If anything is done to change the hardware or software configuration for the fabric, use “Re-establishing Health Check baseline”...
Page 206
Table 85. Fast Fabric Tools symptoms (continued) Symptom Procedure or Reference Health check file: fabric*comps.errors 1. Record the location of the errors. 2. See the Fast Fabric Toolset Users Guide for details 3. If this refers to a port, see “Diagnosing link errors” on page 210, otherwise, see “Diagnosing and repairing switch component problems”...
Table 86. SFP table of symptoms Symptom Procedure Reference Any eventID or reference code Use the IBM system service information. Then use “Diagnosing and repairing IBM system problems” on page 213. The following table is used for any symptoms reported outside of the previously mentioned reporting mechanisms.
Page 208
Table 88. Service Procedures Task Procedure Special procedures Restarting the cluster “Restarting the cluster” on page 246 Restarting or powering off an IBM system. “Restarting or powering off an IBM system” on page 247 Getting debug data from switches and Subnet Managers “Capturing data for fabric diagnosis”...
Table 88. Service Procedures (continued) Task Procedure Repairing IBM systems “Diagnosing and repairing IBM system problems” on page 213 Ping problems “Diagnosing and recovering ping problems” on page 225 Recovering ibX interfaces “Recovering ibX interfaces” on page 235 Not running at the required 4KB MTU “Recovering to 4K maximum transfer units in the AIX”...
Page 210
1. You must first have passwordless ssh set up between the fabric management server and all of the other fabric management servers and also between the fabric management server and the switches. Otherwise, a password prompt would appear and xdsh would not work. 2.
Page 211
d. d. Copy the latest directory from the fabric management server to the xCAT/MS For xCAT: xdcp [fabric management server] /var/opt/iba/analysis/latest <captureDir_onCAT>/latest e. e. On the xCAT/MS, make a directory for the failed health check runs: mkdir <captureDir_onxCAT>/hc_fails f. To get all failed directories, use xdcp (for xCAT) command. If you want to be more targeted, copy over the directories that have the required failure data.
4. By default, data would be captured to files in the ./uploads directory below the current directory when you run the command. 5. Get Health check data from: a. Baseline health check: /var/opt/iba/analysis/baseline b. Latest health check: /var/opt/iba/analysis/latest c. From failed health check runs: /var/opt/iba/analysis/<timestamp> Using script command to capture switch CLI output You can collect data directly from a switch command-line interface (CLI).
Mapping fabric devices Describes how to map from a description or device name or other logical naming convention to a physical location of an HCA or a switch. Mapping of switch devices is largely done by how they are named at install/configuration time. The switch chassis parameter for this is the InfiniBand Device name.
Page 214
With the HCA structure in mind, note that IBM HCA Node GUIDs are relative to the entire HCA These Node GUIDs always end in "00". For example, 00.02.55.00.00.0f.13.00. The final 00 would change for each port on the HCA. Note: If at all possible, during installation, it is advisable to issue a query to all servers to gather the HCA GUIDs ahead of time.
For xCAT: xdsh [nodegroup with all servers] -v ’ibstat -n | grep GUID | grep "[1st seven bytes of GUID]"’ You would have enough information to identify the physical HCA and port with which you are working. Once you know the server in which the HCA is populated, you can issue an ibstat –p to the server and get the information about exactly which HCA matches exactly the GUID that you have in hand.
Page 216
This procedure applies to IBM GX HCAs. For more information about the architecture of IBM GX HCAs and logical switches within them, see “IBM GX+ or GX++ host channel adapter” on page 7. Note: This procedure has some steps that are specific to operating system type (AIX or Linux). This must do with querying the HCA device from the operating system.
From xCAT: xdsh [nodegroup with a list of AIX nodes] -v ’ibstat -p | grep -p "[1st seven bytes of GUID]" | grep iba’ Example results: >dsh -v -N AIXNodes ’ibstat -p | grep -p "00.02.55.00.10.3a.72" | grep iba’ c924f1ec10.ppd.pok.ibm.com: IB PORT 1 INFORMATION (iba0) c924f1ec10.ppd.pok.ibm.com: IB PORT 2 INFORMATION (iba0) d.
Page 218
a. If the baseline health check has been run, use the following command. If it has not been run, use step 3b. grep –A 1 “0g *[GUID] *[port]” /var/opt/iba/analysis/baseline/fabric*links b. If the baseline health check has not been run, you must query the live fabric by using the following command.
>dsh -v -N AIXNodes ’ibstat -p | grep -p "00.02.55.00.10.3a.72" | grep iba’ c924f1ec10.ppd.pok.ibm.com: IB PORT 1 INFORMATION (iba0) c924f1ec10.ppd.pok.ibm.com: IB PORT 2 INFORMATION (iba0) v For Linux, use the following information: For xCAT: xdsh [nodegroup with Linux nodes] -v ’ibv_devinfo| grep –B1 "[1st seven bytes of GUID]" | grep ehca’ Example results: >dsh -v -N AIXNodes ’ibv_devinfo | grep –B1 "0002:5500:103a:72"...
Page 220
b. If the baseline health check has not been run, you must query the live fabric by using the following command. iba_report –o links | grep –A 1 “0g *[switch GUID] *[switch port]” Example results: > grep –A 1 “> *Courier; 0x00066a00d90003d3 *11” /var/opt/iba/analysis/baseline/fabric*links 20g 0x00025500103a6602 1 SW IBM G2 Logical Switch 1...
This procedure ends here. Finding devices based on a known ib interface (ibX/ehcaX) Use this procedure if the ib interface number is known and the physical HCA port and attached physical switch port must be determined. This applies to IBM GX HCAs. For more information about the architecture of IBM GX HCAs and logical switches within them, see “IBM GX+ or GX++ host channel adapter”...
Page 222
6. Log on to the fabric management server. 7. Translate the operating system representation of the logical HCA GUID to the subnet manager representation of the GUID. a. For AIX reported GUIDs, delete the dots: 00.02.55.00.10.24.d9.00 becomes 000255001024d900 b. For Linux reported GUIDs, delete the colons: 0002:5500:1024:d900 becomes 000255001024d900 8.
IBM GX HCA Physical port mapping based on device number Use this information to find the IBM GX HCA physical port based on the iba device and logical switch number. Use the following table is to find IBM GX HCA physical port based on iba device and logical switch number.
Table 92. QLogic log severities (continued) Severity Significance Example Notice Switch chassis management software v Actionable events rebooted v Can be a result of user action or actual failure FRU state changed from not-present v Have severity level above to present Information and below Warning and Error v Logged to xCAT event...
Oct 9 18:54:37 slot101:172.21.1.29;MSG:NOTICE|CHASSIS:SilverStorm 9024 GUID=0x00066a00d8000161|COND:#9999 This is a notice event test|FRU:Power Supply 1|PN:200667-000|DETAIL:This is an additional information about the event Subnet Manager log format The Subnet Manager logs information about the fabric. This includes events like link problems, devices status from the fabric, and information regarding when it is sweeping the network.
Oct 10 13:14:37 slot 101:172.21.1.9; MSG:ERROR| SM:SilverStorm 9040 GUID=0x00066a00db000007 Spine 101, Chip A:port 0| COND:#99999 Link Integrity Error| NODE:SilverStorm 9040 GUID=0x00066a00db000007 Spine 101, Chip A:port 10:0x00066a00db000007 | LINKEDTO:9024 DDR GUID=0x00066a00d90001db:port 15:0x00066a00d90001db|DETAIL:Excessive Buffer Overrun threshold trap received. Diagnosing link errors This procedure is used to isolate link errors to a field replacement unit (FRU). Symptoms that lead to this procedure include: Symptom Reporting mechanism...
Page 227
Check prescribed in step 18 on page 213 to ensure that you have returned the cluster fabric to the intended configuration. The only changes in configuration would be VPD information from replaced parts. 3. If you replace the managed spine for the switch chassis, you must redo the switch chassis setup for the switch as prescribed in “Installing and configuring vendor or IBM InfiniBand switches”...
Page 228
a. Replace the cable. Before replacing the cable, check the manufacturer and part number to ensure that it is an approved cable. Approved cables are available in the IBM Clusters with the InfiniBand Switch web-site referenced in “Cluster information resources” on page 2. b.
b. If the cable does not fix the problem, replace the HCA, and verify the fix by using the procedure in “Verifying link FRU replacements” on page 244. If the problem is fixed, go to step 18. c. If the HCA does not fix the problem, engage QLogic to work on the switch. When the problem is fixed, go to step 18.
3. If you see configuration changes, do one of the following steps. To determine the nature of the change see “Health checking” on page 157. a. Look for a health check output file with the extension of .changes or .diff on the fabric management server, in one of the following directories: /var/opt/iba/analysis/latest or /var/opt/analysis/[recent timestamp] b.
You must check that the following configuration parameters match the installation plan. A reference or setting for IBM System p and IBM Power Systems HPC Clusters is provided for each parameter that you can check. Table 93. Health check parameters Parameter Reference GID prefix...
Page 232
For xCAT: xdsh [nodegroup with all nodes that had previously missing HCAs] –v “lsdev –Cc adapter | grep iba” c. If the HCA: v Is still not visible to the system, continue with the step 5 v Is visible to the system, continue with the procedure to verify that all HCAs are available to the LPARs 5.
16. Verify that the network interfaces are recognized as being up and available. The following command string must return no interfaces. If an interface is marked down, it returns the LPAR and ibX interface. For xCAT: xdsh [nodegroup with all nodes] –v '/usr/bin/lsrsrc IBM.NetworkInterface Name OpState | grep -p"resource"...
Note: Before you perform a memory service action, ensure that the memory was not unconfigured for a specific reason. If the network still has performance problems call your next level of support. 3. If no problems are found in SFP, perform any System Service Guide instructions for diagnosing unconfigured memory.
Page 235
Verify all HCAs are available to the LPARs: 6. Run the following command to count the number of active HCA ports: For xCAT: xdsh [nodegroup with all nodes] -v "ibv_devinfo | grep PORT_ACTIVE" | wc -l Note: An HCA has two ports. 7.
Verify HCAs ends here. Checking system configuration in Linux You can check your system configuration with the Linux operating system. Verifying the availability of processor resources To verify the availability of processor resources, perform the following steps: 1. Run the following command: For xCAT: xdsh [nodegroup with all nodes] –v "grep processor /proc/cpuinfo"...
Checking multicast groups Use this procedure to check multicast groups for correct membership. To check multicast groups for correct membership, perform the following procedure: 1. If you are running a host-based Subnet Manager, to check multicast group creation, on the Fabric Management Server run the following commands.
In general, when HCA ports are swapped, they are swapped on the same HCA, or perhaps on HCAs within the same IBM server. Any more sophisticated swapping would likely be up for debate with respect to if it is a switch port swap or an HCA port swap, or just a complete reconfiguration. You must reference the Fast Fabric Toolset Users Guide for details on health checking.
3. Look for fabric.X:Y.links.diff or fabric.X:Y.links.changes, where X is the HCA and Y is the HCA port on the fabric management server that is attached to the subnet. This helps you map directly to the subnet with the potential issue. 4.
, where [timestamp] is a timestamp after the timestamp for the operating system event, and for any errors found associated with the switch link recorded previously, run the procedure in “Interpreting error counters” on page 255. 2. Look for link errors reported by the fabric manager in /var/log/messages by searching on the HCA nodeGUID and the associated switch port information as recorded previously.
2. Look for fabric configuration problems by using the procedure in “Checking for fabric configuration and functional problems” on page 214. 3. Look for configuration problems in the IBM systems: Check for HCA availability, processor availability, and memory availability. a. For AIX LPARs, see: 1) “Checking InfiniBand configuration in AIX”...
This procedure ends here. Diagnosing application crashes Use this procedure to diagnose application crashes. Diagnosing application crashes with respect to the cluster fabric is similar to diagnosing performance problems as in “Diagnosing performance problems” on page 224. However, if you know the endpoints involved in the application crash, you can check the state of the routes between the two points to see if there might be an issue.
Symptom Procedure Event is not in the /tmp/systemEvents on the xCAT/MS “Event not in xCAT/MS:/tmp/systemEvents” Event is not in /var/log/xcat/syslog.fabric.notices on the “Event not in xCAT/MS: xCAT/MS /var/log/xcat/syslog.fabric.notices” on page 228 Event is not in /var/log/xcat/syslog.fabric.info on the “Event not in xCAT/MS: xCAT/MS /var/log/xcat/syslog.fabric.info”...
xCAT Config Sensor Condition Response xCAT on AIX and IBSwitchLogSensor LocalIBSwitchLog Log event anytime xCAT/MS is not a Email root anytime managed node (optional) LogEventToxCATDatabase (optional) xCAT on AIX and IBSwitchLogSensor LocalIBSwitchLog Log event anytime xCAT/MS is a managed Email root anytime node (optional) LogEventToxCATDatabase...
Page 245
If an expected event is not in the remote syslog file for notices on the xCAT/MS (/var/log/xcat/ syslog.fabric.notices), do the following procedure. Note: This assumes that you are using syslogd for syslogging. If you are using another syslog application, like syslog-ng, then you must alter this procedure to account for that. However, the underlying technique for debug remains the same.
logSyslogConfig –h [host] –p 514 –f 22 –m 1 v The xCAT/MS is the host IP address v The port is 514 (or other than that you have chosen to use) v The facility is local6 8. If the problem persists, then try restarting the syslogd on the xCAT/MS and also resetting the source's logging: a.
management server from which you want to receive logs. If you have a specific address named, ensure that the source of the log has an entry with its address. Switches use udp. Fabric management servers are configurable for tcp or udp. 4.
Note: This procedure assumes that you are using syslogd for syslogging. If you are using another syslog application, like syslog-ng, then you must alter this procedure for that to account. However, the underlying technique for debugging remains the same. 1. Log on to the fabric management server. 2.
stopcondresp <condition name> <response_name> 4. Delete all the xCAT related entries from the /etc/syslog file. These entries are defined in “Set up remote logging” on page 112. The commented entry might not exist. # all local6 notice and above priorities go to the following file local6.notice /var/log/xcat/syslog.fabric.notices 5.
Page 250
destination fabnotices_fifo { pipe("/var/log/xcat/syslog.fabric.notices" group(root) perm(0644)); }; log { source(src); filter(f_fabnotices); destination(fabnotices_fifo); }; 5. Ensure that the f_fabnotices filter remains in the /etc/syslog-ng/syslog-ng.conf file by using the following command. filter f_fabnotices { facility(local6) and level(notice, alert, warn, err, crit) and not filter(f_iptables); }; 6.
14. Check the /etc/syslog-ng/syslog-ng.conf configuration file to ensure that the appropriate entries were added by monerrorlog. Typically, the entries look similar to the following example. However, monerrorlog uses a different name from fabnotices_fifo in the destination and log entries. It uses a pseudo-random name that looks similar to fifonfJGQsBw.
If the ifconfig [ib interface] up command does not recover the ibX interface, you must completely remove and rebuild the interface by using the following command: rmdev –l [ibX] chdev –l [ibX] -a superpacket=on –a state=up -a tcp_sendspace=524288 -a tcp_recvspace=524288 –a srq_size=16000 mkdev –l [ibX] Recovering all of the ibX interfaces in an LPAR in the AIX If you must recover all of the ibX interfaces in a server, it is probable that you must remove the interfaces...
mkiba –A $iba –i $i –a $ib_addr –p 1 –P 1 –S up –m 255.255.255.0 done # Re-create the ibX interfaces properly # This assumes that the default p_key (0xffff) is being used for # the subnet for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"` chdev -l $i -a superpacket=on –a tcp_recvspace=524288 –a tcp_sendspace=524288 –a srq_size=16000 -a state=up done...
2. If these commands do not recover the ibX interface, check for any error messages in the dmesg resp attribute in the /var/log/messages file. And perform the appropriate service associated with the error messages. 3. If the problem persists, contact your next level of support. Recovering all of the ibX interfaces in an LPAR in the Linux Use this procedure to recover all of the ibX interfaces in a logical partition in the Linux operating system.
Page 255
. . . </Multicast> . . . </Sm> e. Start the Subnet Manager by using the following command: For IFS 5: /etc/init.d/qlogic_fm start If you are running an embedded Subnet Manager, complete the following steps: Note: These instructions are written for recovering a single subnet at a time. Log on to the switch command-line interface (CLI), or issue these commands from the fabric management server by using cmdall, or from the xCAT/MS by using xdsh.
Page 256
for i in `lsdev | grep Infiniband | awk ’{print $1}’ | egrep -v "iba|icm"` echo $i lsattr -El $i | egrep " super" done Note: To verify a single device (such as, ib0), use the lsattr - El ib0 | egrep "mtu|super" command.
0xff12401bffff0000:00000000ffffffff (c000) qKey = 0x00000000 pKey = 0xFFFF mtu = 5 rate = 3 life = 19 sl = 0 0x00025500101a3300 F 0x00025500101a3100 F 0x00025500101a8300 F 0x00025500101a8100 F 0x00025500101a6300 F 0x00025500101a6100 F 0x0002550010194000 F 0x0002550010193e00 F 0x00066a00facade01 F Recovering to 4K maximum transfer units in the Linux Use this procedure if your cluster must be running with 4K maximum transfer units (MTUs), but it has already been installed and is not currently running at 4K MTU.
Page 258
Log on to the switch CLI, or issue these commands from the Fabric Management Server by using cmdall, or from the xCAT/MS by using xdsh. If you use xdsh, use the parameters, -l admin --devicetype IBSwitch::Qlogic, as outlined in “Remotely accessing QLogic switches from the xCAT/MS”...
ib2 65532 ib3 65532 ib4* 65532 ib5 65532 ib6 65532 ib7 65532 ml0 65532 lo0 16896 lo0 16896 If you are running a host-based Subnet Manager, to check multicast group creation, on the fabric management server run the following commands. For IFS 5, use the following setps: 1) Check for multicast membership.
In many cases, it is acceptable to loop through all instances of the subnet manager on all fabric management servers to ensure that they are running under the original priority. Assuming you have four subnet managers running on a fabric management server, you would use the following command-line loop: for i in 0 1 2 3;...
7. Run the /sbin/iba_report –o errors command again. 8. If the link reports errors, the problem is not fixed. Otherwise, the problem is fixed. This procedure ends here. Return to the fault isolation procedure that sent you here Verifying repairs and configuration changes Use this procedure to verify repairs and configurations changes that have taken place with your cluster.
5. If any problems were found, fix them and restart this procedure. Continue to fix them and restart this procedure until you are satisfied that a repair is successful. Or continue to fix them and restart this procedure till a configuration change has been successful, and that neither has resulted in unexpected configuration changes.
c. If you did not use the –e parameter, look for configuration changes and fix any that you find. For more information, see “Finding and interpreting configuration changes” on page 180. This procedure ends here. Restarting or powering off an IBM system If you are restarting or powering off an IBM system for maintenance or repair, use this procedure to minimize impacts on the fabric, and to verify that the system host channel adapters (HCAs) have rejoined the fabric.
a. Run the all_analysis command, or the all_analysis -e command. For more information, see “Health checking” on page 157 and the Fast Fabric Toolset Users Guide. b. Look for errors and fix any that you find. For more information, see the “Table of symptoms” on page 187.
Each spine has two switch chips. To maintain cross-sectional bandwidth performance, you want a spine port for each cable port. So, a single spine can support up to 48 ports. The standard sizes are 48, 96, 144, and 288 port switches and the switches require 1, 2, 3 and 6 spines. A leaf-board has a single switch chip.
6. Start the Subnet Managers. If you had powered off the fabric management server running Subnet Managers, and the Subnet Managers were configured to auto-start, all you must do is start the fabric management server after you start the other servers. If the switches have embedded Subnet Managers configured for auto-start, then the Subnet Managers restarts when the switches come back online.
Page 269
20g 2048 0x00025500106d1602 1 SW IBM G2 Logical Switch 1 SymbolErrorCounter: 1092 Exceeds Threshold: 6 <-> 0x00066a0007000de7 3 SW SilverStorm 9080 c938f4ql01 Leaf 3, Chip c. Find the LID associated with this nodeGUID by substituting $nodeGUID in the following iba_report command. In this example, the LID is 0x000c. Also note the subnet in which it was found.
e. Re-enable the switch port by using the switch LID, switch port, and the fabric manager HCA and port mentioned in the preceding section found: /sbin/iba_portenable –l $lid –m $switch_port –h $h –p $p 7. Clear all errors by using either the following command, or a script like the one in “Error counter clearing script”...
v “Diagnose a link problem based on error counters” on page 264 v “Error counter details” on page 265 v “Clearing error counters” on page 274 Interpreting error counters If the only problems that exist in a fabric involve the occasionally faulty link which results in excessive SymbolErrors or PortRcvErrors, interpreting error counters can be routine.
a. Determine if pattern of errors leads you through the fabric to a common point exhibiting link integrity problems. b. If there are no link integrity problems, see if there is a pattern to the errors that has a common leaf or spine, or if there is some configuration problem that is causing the error.
Page 273
d. If the configuration has been changed it must be changed back again by using the ismChassisSetMtu command. e. If there is no issue with the configuration, then perform the procedures to isolate local link integrity errors (“Diagnose a link problem based on error counters” on page 264). Otherwise, go to step 3.
Page 274
Note: By design, the IBM GX HCA increases the PortRcvError count if SymbolErrors occur on data packets. If a SymbolError occurs on an idle character, the PortRcvError would not be incremented. Therefore, HCA SymbolErrors reported in the absence of other errors, indicates that the errors are occurring only on idle patterns and therefore are not impacting performance.
Page 275
Figure 16. Reference for Link Integrity Error Diagnosis High-performance computing clusters using InfiniBand hardware...
Interpreting remote errors Both PortXmitDiscards and PortRcvRemotePhysicalErrors are considered to be “Remote Errors” in that they most often indicate a problem elsewhere in the fabric. If PortXmitDiscards, a problem elsewhere is preventing the progress of a packet to such a degree that its lifetime in the fabric exceeds the timeout values of a packet in a chip or in the fabric.
Example PortXmitDiscard analyses: Several figures would be presented with descriptions preceding them. The following figure is an example of an HCA detecting problem with a link and the pattern of PortXmitDiscards leading to the conclusion that the link errors are the root cause of the PortXmitDiscards.
Figure 19. Failing leaf chip causing PortXmitDiscards Example PortRcvRemotePhysicalErrors analyses: Several figures would be presented with descriptions preceding them. The following figure is an example of an HCA detecting problem with a link and the pattern of PortRvcRemotePhysicalErrors leading to the conclusion that the link errors are the root cause of the PortRvcRemotePhysicalErrors.
Page 279
Figure 21. Leaf-Spine link causing PortRcvRemotePhysicalErrors The following figure is an example of all PortRcvRemotePhysicalErrors being associated with a single leaf and there are no link errors to which to attribute them. You can see the transmit discards “dead-ending” at the leaf chip. It is important to first ensure yourself that all of the other errors in the network have a low enough threshold to be seen.
Figure 23. Failing HCA CRC generator causing PortRcvRemotePhysicalErrors Interpreting security errors Security errors do not apply to clusters running SubnetManager code at the 4.3.x level or previous levels. Call your next level of support upon seeing PortXmitConstraintErrors, or PortRcvConstraintErrors. Diagnose a link problem based on error counters You would have been directed here from another procedure.
c. For links to HCAs, replace the HCA. (impacts fewer CECs). For spine to leaf links, it is easier to replace the spine first. This affects performance on all nodes, but replacing a leaf might stop communication altogether on nodes connected to that leaf. d.
Table 100. Error Counter Categories (continued) Error Counter Category PortXmitDiscards Congestion or Remote Link Integrity PortXmitConstraintErrors Security PortRcvConstraintErrors Security VL15Dropped SMA Congestion PortRcvSwitchRelayErrors Routing Link Integrity Errors These are errors that are localized to a particular link. If they are not caused by some user action or outside event influencing the status of the link, these are generally indicative of a problem on the link.
If it appears that the link is recovering on its own without outside influences, typical link isolation techniques must be used. For more information, see “Diagnose a link problem based on error counters” on page 264. Performance impact: Because a link error recovery error is often associated with either a link that is taking many errors, or one that has stopped communicating, there would be a performance impact for any communication going over the link experiencing these errors.
L11P01 MTUCap=5(4096 bytes) VLCap=3(4 VLs) <- Leaf 11 Port 11; 4K MTU and 4 VLs S3BL19 MTUCap=5(4096 bytes) VLCap=3(4 VLs) <- Spine 3 chip B to Leaf 19 interface The default for VlCap is 3. The default for MTUCap is 4. However, typically, clusters with all DDR HCAs are configured with an MTUCap of 5.
only checked at the destination HCA. This is a difficult situation to isolate to root cause. The technique is to do methodical point to point communication and note which combination of HCAs causes the errors. Also, for every PortRcvRemotePhysical reported by an IBM Galaxy HCA, a PortRcvError would be reported.
Page 286
It indicates that an invalid combination of bits was received. While it is possible to get other link integrity errors on a link without SymbolErrors, this is not typical. Often if zero SymbolErrors are found, but there are LinkDowns, or LinkErrorRecoveries, another read of the SymbolError counter will reveal that you just happened to read it after it had been reset on a link recovery action.
Threshold: maximum in 24 hours = 10 Remote Link Errors (including congestion and link integrity) The errors (PortRcvRemotePhysicalErrors and PortXmitDiscards) are typically indicative of an error on a remote link that is affecting a local link. PortRcvRemotePhysicalErrors: PortRcvRemotePhysicalErrors indicate that a received packet was marked bad. Depending on where the head of the packet is within the fabric and relative to this port, because of cut-through routing, the packet might have been forwarded on toward the destination.
Page 288
There are several reasons for such XmitDiscards: v The packet switch lifetime limit has been exceeded. This is the most common issue and is caused by congestion or a downstream link that went down. It can be common for certain applications with communication patterns like All-to-All or All-to-one.
Security errors Security errors (PortXmitConstraintErrors and PortRcvConstraintErrors) do not apply until the QLogic code level reaches 4.4. PortXmitConstraintErrors: Indicates Partition Key violations, not expected with 4.3 and earlier SM. For QLogic 4.4 and later SM can indicate incorrect Virtual Fabrics Config or Application Config inconsistent with SM config.
Threshold: minimum actionable = IGNORE except under debug. Threshold: maximum in 24 hours = IGNORE except under debug. PortRcvSwitchRelayErrors: PortRcvSwitchRelayErrors indicate the number of discarded packets. Note: There is a known bug in the Anafa2 switch chip that incorrectly increments for this counter for multicast traffic (for example IPoIB).
It is further suggested that you clear all error counters every 24 hours at a regular interval. There are several ways to accomplish clear all error counters: v The simplest method is to run a cronjob by using the iba_report in the following to reset errors on the entire fabric.
v A configuration script that is called by the other scripts to set up common variables. One key thing to remember is that these sets of scripts also must be run from cron. Therefore, full path information is important. This set of scripts does not address how to deal with more accurate error counter thresholds for individual links that have had their error counters cleared at a different time from the other links.
Healthcheck control script This script not only chooses the appropriate iba_mon.conf file and calls all_analysis, but it also adds entries to a log file ($ANALYSISLOG, which is set up in the configuration file). It is assumed that the user has set-up /etc/sysconfig/fastfabric.conf appropriately for his configuration. The user would check the $ANALYSISLOG file on a regular basis to see if there are problems being reported.
Page 294
#---------------------------------------------------------------- # Run all_analysis with the appropriate iba_mon file based on the # number of hours since the last clear ($diffh). # This relies on the default set up for FF_FABRIC_HEALTH in the # /etc/sysconfig/fastfabric.conf file. # Log the STDOUT and STDERR of all_analysis. #---------------------------------------------------------------- /sbin/all_analysis -s -c $IBAMON.$diffh >>...
See /var/opt/iba/analysis/latest/fabric.2:2.errors fabric_analysis: Failure information saved to: /var/opt/iba/analysis/2009-03-06-21:00:01/ fabric_analysis: Possible fabric errors or changes found chassis_analysis: Chassis OK all_analysis: Possible errors or changes found The following example illustrates reading error counters 24 hours since the last error counter clear, which triggers healthcheck to call all_analysis to also clear the errors after reading them.
Page 296
Finally, in order to ensure that data is lost between calls of all_analysis, there must be a sleep between each call. The sleep must be at least one second to ensure that error results are written to a separate directory. The following section illustrates the logic described in the preceding paragraph.
Page 297
timestamp in the name, not the one with “latest” in the name. Also, if the result does not have “all_analysis: All OK”, set $HEALTHY=0. Run ls lastclear.*:* to get the list of link-clear files Loop through the list of link-clear files { Get the nodeguid ($nodeguid), the node port ($nodeport), the Fabric MS HCA ($hca) and HCA port ($hcaport) from the link-clear filename # needs the space before $nodeport...
Page 298
if $HEALTHY == 0 { write to analysis log file, 'HEALTHCHECK problems’ } else { write to analysis log file, 'HEALTHCHECK “All OK”’ Power Systems: High performance clustering...
Notices This information was developed for products and services offered in the U.S.A. The manufacturer may not offer the products, services, or features discussed in this document in other countries. Consult the manufacturer's representative for information on the products and services currently available in your area.
The manufacturer's prices shown are the manufacturer's suggested retail prices, are current and are subject to change without notice. Dealer prices may vary. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other product and service names might be trademarks of IBM or other companies. Electronic emission notices When attaching a monitor to the equipment, you must use the designated monitor cable and any interference suppression devices supplied with the monitor.
Page 302
Technical Regulations, Department M456 IBM-Allee 1, 71139 Ehningen, Germany Tele: +49 7032 15-2937 email: tjahn@de.ibm.com Warning: This is a Class A product. In a domestic environment, this product may cause radio interference, in which case the user may be required to take adequate measures. VCCI Statement - Japan The following is a summary of the VCCI Japanese statement in the box above: This is a Class A product based on the standard of the VCCI Council.
Page 303
Electromagnetic Interference (EMI) Statement - Taiwan The following is a summary of the EMI Taiwan statement above. Warning: This is a Class A product. In a domestic environment this product may cause radio interference in which case the user will be required to take adequate measures. IBM Taiwan Contact Information: Electromagnetic Interference (EMI) Statement - Korea Germany Compliance Statement...
EN 55022 Klasse A Geräte müssen mit folgendem Warnhinweis versehen werden: "Warnung: Dieses ist eine Einrichtung der Klasse A. Diese Einrichtung kann im Wohnbereich Funk-Störungen verursachen; in diesem Fall kann vom Betreiber verlangt werden, angemessene Maßnahmen zu ergreifen und dafür aufzukommen." Deutschland: Einhaltung des Gesetzes über die elektromagnetische Verträglichkeit von Geräten Dieses Produkt entspricht dem “Gesetz über die elektromagnetische Verträglichkeit von Geräten (EMVG)“.
Page 305
Except as expressly granted in this permission, no other permissions, licenses or rights are granted, either express or implied, to the publications or any information, data, software or other intellectual property contained therein. The manufacturer reserves the right to withdraw the permissions granted herein whenever, in its discretion, the use of the publications is detrimental to its interest or, as determined by the manufacturer, the above instructions are not being properly followed.
Page 306
Power Systems: High performance clustering...