Fibre channel npiv storage networking for windows server 2008 r2 hyper-v and system center vmm2008 r2 usage scenarios and best practices guide (78 pages)
QLogic Corporation reserves the right to change product specifications at any time without notice. Applications described in this document for any of these products are for illustrative purposes only. QLogic Corporation makes no representation nor warranty that such applications are suitable for the specified use without further testing or modification.
Preface The QLogic OFED+ Host Software User Guide shows end users how to use the installed software to setup the fabric. End users include both the cluster administrator and the Message-Passing Interface (MPI) application programmers, who have different but overlapping interests in the details of the technology.
License Agreements Refer to the QLogic Software End User License Agreement for a complete listing of all license agreements affecting this product. IB0054606-02 A...
(IB), ® and Fibre Channel products. From the main QLogic web page at www.qlogic.com, click the Support tab at the top, and then click Training and Certification on the left. The QLogic Global Training portal offers online courses, certification exams, and scheduling of in-person training.
Technical Support Knowledge Database The QLogic knowledge database is an extensive collection of QLogic product information that you can search for specific solutions. We are constantly adding to the collection of information in our database to provide answers to your most urgent questions.
Introduction How this Guide is Organized The QLogic OFED+ Host Software User Guide is organized into these sections: Section 1, provides an overview and describes interoperability. Section 2, describes how to setup your cluster to run high-performance MPI jobs.
Fabric ® Software Installation Guide contains information on QLogic software installation. Overview The material in this documentation pertains to a QLogic OFED+ cluster. A cluster is defined as a collection of nodes, each attached to an InfiniBand -based fabric ®...
QLogic offers the QLogic Embedded Fabric Manager (FM) for both DDR and QDR switch product lines supplied by your IB switch vendor. A host-based subnet manager can be used. QLogic provides the QLogic Fabric Manager (FM), as a part of the QLogic InfiniBand Fabric Suite (IFS).
QLogic InfiniBand Adapter Hardware Installation Guide ® and software installation and driver configuration has been completed according to the instructions in the QLogic InfiniBand Fabric Software ® Installation Guide. To minimize management problems, the compute nodes of the cluster must have very similar hardware configurations and identical software installations.
“Checking Cluster and Software Status” on page 3-44. Using MPI Verify that the QLogic hardware and software has been installed on all the nodes you will be using, and that ssh is set up on your cluster (see all the steps in the Cluster Setup checklist).
The IB driver ib_qib, QLogic Performance Scaled Messaging (PSM), accelerated Message-Passing Interface (MPI) stack, the protocol and MPI support libraries, and other modules are components of the QLogic OFED+ software. This software provides the foundation that supports the MPI implementation.
/usr/share/doc/infinipath License information is found only in . QLogic usr/share/doc/infinipath OFED+ Host Software user documentation can be found on the QLogic web site on the software download page for your distribution. Configuration files are found in: /etc/sysconfig Init scripts are found in: /etc/init.d...
OpenSM. This component is disabled at startup. QLogic recommends using the QLogic Fabric Manager (FM), which is included with the IFS or optionally available within the QLogic switches. QLogic FM or OpenSM can be installed on one or more nodes with only one node being the master SM.
Page 28
3–InfiniBand Cluster Setup and Administration ® IPoIB Network Interface Configuration This example assumes that no hosts files exist, the host being configured has the IP address 10.1.17.3, and DHCP is not used. NOTE Instructions are only for this static IP address case. Configuration methods for using DHCP will be supplied in a later release.
Fabric Software Installation Guide for more ® information on using the QLogic IFS Installer TUI. Refer to the QLogic FastFabric User Guide for more information on using FastFabric. For using the command line to configure the IPoIB driver use the following commands.
Linux Ethernet Bonding Driver and was adopted to work with IPoIB. The support for IPoIB interfaces is only for the active-backup mode, other modes should not be used. QLogic supports bonding across HCA ports and bonding port 1 and port 2 on the same HCA.
3–InfiniBand Cluster Setup and Administration ® IB Bonding Red Hat EL5 and EL6 The following is an example for bond0 (master). The file is named /etc/sysconfig/network-scripts/ifcfg-bond0: DEVICE=bond0 IPADDR=192.168.1.1 NETMASK=255.255.255.0 NETWORK=192.168.1.0 BROADCAST=192.168.1.255 ONBOOT=yes BOOTPROTO=none USERCTL=no MTU=65520 BONDING_OPTS="primary=ib0 updelay=0 downdelay=0" The following is an example for ib0 (slave). The file is named /etc/sysconfig/network-scripts/ifcfg-ib0: DEVICE=ib0 USERCTL=no...
3–InfiniBand Cluster Setup and Administration ® IB Bonding SuSE Linux Enterprise Server (SLES) 10 and 11 The following is an example for bond0 (master). The file is named /etc/sysconfig/network-scripts/ifcfg-bond0: DEVICE="bond0" TYPE="Bonding" IPADDR="192.168.1.1" NETMASK="255.255.255.0" NETWORK="192.168.1.0" BROADCAST="192.168.1.255" BOOTPROTO="static" USERCTL="no" STARTMODE="onboot" BONDING_MASTER="yes" BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0"...
3–InfiniBand Cluster Setup and Administration ® IB Bonding Verify the following line is set to the value of yes in /etc/sysconfig/boot: RUN_PARALLEL="yes" Verify IB Bonding is Configured After the configuration scripts are updated, and the service network is restarted or a server reboot is accomplished, use the following CLI commands to verify that IB bonding is configured.
RX bytes:141223648 (134.6 Mb) TX bytes:147950000 (141.0 Mb) Subnet Manager Configuration QLogic recommends using the QLogic Fabric Manager to manage your fabric. Refer to the QLogic Fabric Manager User Guide for information on configuring the QLogic Fabric Manager. 3-10 IB0054606-02 A...
Page 35
You cannot use OpenSM if any of your IB switches provide a subnet manager, or if you are running a host-based SM, for example the QLogic Fabric Manager.
Applications that use Distributed SA The QLogic PSM Library has been extended to take advantage of the Distributed SA. Therefore, all MPIs that use the QLogic PSM library can take advantage of the Distributed SA. Other applications must be modified specifically to take advantage of it.
Virtual Fabrics and the Distributed SA The IBTA standard states that applications can be identified by a Service ID (SID). The QLogic Fabric Manager uses SIDs to identify applications. One or more applications can be associated with a Virtual Fabric using the SID. The Distributed SA is designed to be aware of Virtual Fabrics, but to only store records for those Virtual Fabrics that match the SIDs in the Distributed SA's configuration file.
® QLogic Distributed Subnet Administration If you are using the QLogic Fabric Manager in its default configuration, and you are using the standard QLogic PSM SIDs, this arrangement will work fine and you will not need to modify the Distributed SA's configuration file - but notice that the Distributed SA has restricted the range of SIDs it cares about to those that were defined in its configuration file.
3–InfiniBand Cluster Setup and Administration ® QLogic Distributed Subnet Administration Figure 3-4. Distributed SA Multiple Virtual Fabrics Configured Example Virtual Fabrics with Overlapping Definitions As defined, SIDs should never be shared between Virtual Fabrics. Unfortunately, it is very easy to accidentally create such overlaps.
3–InfiniBand Cluster Setup and Administration ® QLogic Distributed Subnet Administration Figure 3-6. Virtual Fabrics with PSM_MPI Virtual Fabric Enabled Figure 3-6, the administrator enabled the “PSM_MPI” fabric, and then added a new “Reserved” fabric that uses one of the SID ranges that “PSM_MPI” uses.
Second, the Distributed SA handles overlaps by taking advantage of the fact that Virtual Fabrics have unique numeric indexes. These indexes are assigned by the QLogic Fabric Manager in the order which the Virtual Fabrics appear in the configuration file. These indexes can be seen by using the command iba_saquery -o vfinfo command.
The SIDs identify applications which will use the distributed SA to determine their path records. The default configuration for the Distributed SA includes all the SIDs defined in the default Qlogic Fabric Manager configuration for use by MPI.
Generally, this will produce too much information for normal use. (Includes Dbg=5) Dbg=7: Debugging This should only be turned on at the request of QLogic Support. This will generate so much information that system operation will be impacted. (Includes Dbg=6) Other Settings The remaining configuration settings for the Distributed SA are generally only useful in special circumstances and are not needed in normal operation.
3–InfiniBand Cluster Setup and Administration ® Changing the MTU Size Changing the MTU Size The Maximum Transfer Unit (MTU) size enabled by the IB HCA and set by the driver is 4KB. To see the current MTU size, and the maximum supported by the adapter, type the command: $ ibv_devinfo If the switches are set at 2K MTU size, then the HCA will automatically use this as...
This should be executed on every switch and both hemispheres of the 9240s. For the 12000 switches, refer to the QLogic FastFabric User Guide for externally managed switches, and to the QLogic FastFabric CLI Reference Guide for the internally managed switches.
Start, Stop, or Restart ib_qib Driver Restart the software if you install a new QLogic OFED+ Host Software release, change driver options, or do manual testing. QLogic recommends using /etc/init.d/openibd to stop, stat and restart the ib_qib driver.
3–InfiniBand Cluster Setup and Administration ® Managing the ib_qib Driver You can check to see if is configured to autostart by using the following opensmd command (as a root user); if there is no output, is not configured to opensmd autostart: # /sbin/chkconfig --list opensmd | grep -w on Unload the Driver/Modules Manually...
3–InfiniBand Cluster Setup and Administration ® More Information on Configuring and Loading Drivers /ipathfs/1/counter_names /ipathfs/1/counters file contains general driver statistics. There is one numbered driver_stats subdirectory per IB device on the system. Each numbered subdirectory contains the following per-device files: ...
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips Performance Tuning Tuning compute or storage (client or server) nodes with IB HCAs for MPI and verbs performance can be accomplished in several ways: Run the ipath_perf_tuning script in automatic mode (See “Performance Tuning using ipath_perf_tuning Tool”...
Page 50
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips If cpuspeed or powersaved are being used as part of implementing Turbo modes to increase CPU speed, then they can be left on. With these daemons left on, IB micro-benchmark performance results may be more variable from run-to-run.
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips Increasing the number of kernel receive queues allows more CPU cores to be involved in the processing of verbs traffic. This is important when using parallel file systems such as Lustre or IBM's GPFS (General Parallel File System). The module parameter that sets this number is krcvqs.
CPUs: options ib_qib pcie_caps=0x51 numa_aware=1 On AMD systems, the pcie_caps=0x51 setting will result in a line of the lspci -vv output associated with the QLogic HCA reading in the "DevCtl" section: MaxPayload 128 bytes, MaxReadReq 4096 bytes. AMD Interlagos CPU Systems...
Page 53
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips For setting all C-States to 0 where there is no BIOS support: Add kernel boot option using the following command: processor.max_cstate=0 Reboot the system. If the node uses a single-port HCA, and is not a part of a parallel file system cluster, there is no need for performance tuning changes to a modprobe configuration file.
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips High Risk Tuning for Intel Harpertown CPUs For tuning the Harpertown generation of Intel Xeon CPUs that entails a higher risk factor, but includes a bandwidth benefit, the following can be applied: For nodes with Intel Harpertown, Xeon 54xx CPUs, you can add pcie_caps=0x51 and pcie_coalesce=1 to the modprobe.conf file.
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips Additional Driver Module Parameter Tunings Available Setting driver module parameters on Per-unit or Per-port basis The ib_qib driver allows the setting of different driver parameter values for the individual HCAs and ports. This allows the user to specify different values for each port on a HCA or different values for each HCA in the system.
Page 56
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips value is the parameter value for the particular unit or port. The fields in the square brackets are options; however, either a default or a per-unit/per-port value is required. Example usage: To set the default IB MTU to 1K for all ports on all units: ibmtu=3...
Page 57
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips This command lets the driver automatically decide on the allocation behavior and disables this feature on platforms with AMD and Intel Westmere-or-earlier CPUs, while enabling it on newer Intel CPUs. Tunable options: option ib_qib numa_aware=0 This command disables the NUMA awareness when allocating memories...
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips For example: # cat /etc/modprobe.d/ib_ipoib.conf alias ib0 ib_ipoib alias ib1 ib_ipoib options ib_ipoib recv_queue_size=512 Performance Tuning using ipath_perf_tuning Tool The ipath_perf_tuning tool is intended to adjust parameters to the IB QIB driver to optimize the IB and application performance.
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips Table 3-3. Checks Preformed by ipath_perf_tuning Tool Check Type Description Check whether (and which) C-States are enabled. C-States cstates should be turned off for best performance. Check whether certain system services (daemons) are services enabled.
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips AUTOMATIC vs. INTERACTIVE MODE The tool performs different functions when running in automatic mode compared to running in the interactive mode. The differences include the node type selection, test execution, and applying the results of the executed tests. Node Type Selection The tool is capable of configuring compute nodes or storage nodes (see Compute...
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips Table 3-5. Test Execution Modes Test Mode Test is performed in both modes but the user is noti- services fied of running services only if the tool is in interac- tive mode.
Adapter and Other Settings The following adapter and other settings can be adjusted for better performance. NOTE For the most current information on performance tuning refer to the QLogic OFED+ Host Software Release Notes. Use an IB MTU of 4096 bytes instead of 2048 bytes, if available, with the QLE7340, and QLE7342.
3–InfiniBand Cluster Setup and Administration ® Performance Settings and Management Tips Remove Unneeded Services The cluster administrator can enhance application performance by minimizing the set of system services running on the compute nodes. Since these are presumed to be specialized computing appliances, they do not need many of the service daemons normally running on a general Linux computer.
“Erratic Performance” on page D-10 for more information. Host Environment Setup for MPI After the QLogic OFED+ Host software and the GNU (GCC) compilers have been installed on all the nodes, the host environment can be set up for running MPI programs.
3–InfiniBand Cluster Setup and Administration ® Host Environment Setup for MPI “Configuring for ssh Using ssh-agent” on page 3-43 shows how an individual user can accomplish the same thing using ssh-agent The example in this section assumes the following: Both the cluster nodes and the front end system are running the openssh package as distributed in current Linux systems.
Page 66
3–InfiniBand Cluster Setup and Administration ® Host Environment Setup for MPI On each of the IB node systems, create or edit the file . You will need to copy the contents of the file /etc/ssh/ssh_known_hosts from to this file (as a single line), /etc/ssh/ssh_host_dsa_key.pub ip-fe and then edit that line to insert...
Page 67
3–InfiniBand Cluster Setup and Administration ® Host Environment Setup for MPI At this point, any end user should be able to login to the front end system ip-fe and use to login to any IB node without being prompted for a password or pass phrase.
IB status, link speed, and PCIe bus width can be checked by running the program . Sample usage and output are as follows: ipath_control $ ipath_control -iv QLogic OFED.VERSION yyyy_mm_dd.hh_mm_ss 0: Version: ChipABI VERSION, InfiniPath_QLE7340, InfiniPath1 VERSION, SW Compat 2 0: Serial: RIB0935M31511 LocalBus: PCIe,5000MHz,x8...
3–InfiniBand Cluster Setup and Administration ® Checking Cluster and Software Status iba_opp_query iba_opp_query is used to check the operation of the Distributed SA. You can run it from any node where the Distributed SA is installed and running, to verify that the replica on that node is working correctly.
3–InfiniBand Cluster Setup and Administration ® Checking Cluster and Software Status rate pkt_life 0x10 preference resv2 resv3 ibstatus Another useful program is that reports on the status of the local HCAs. ibstatus Sample usage and output are as follows: $ ibstatus Infiniband device 'qib0' port 1 status: default gid: fe80:0000:0000:0000:0011:7500:005a:6ad0...
Running MPI on QLogic Adapters This section provides information on using the Message-Passing Interface (MPI) on QLogic IB HCAs. Examples are provided for setting up the user environment, and for compiling and running MPI programs. Introduction The MPI standard is a message-passing library or collection of routines used in distributed-memory parallel programming.
Follow the instructions in the QLogic Fabric Software Installation Guide for installing Open MPI. Newer versions of Open MPI released after this QLogic OFED+ release will not be supported (refer to the OFED+ Host Software Release Notes for version numbers). QLogic does not recommend installing any newer versions of Open MPI.
(gcc, icc, pgcc, etc. ) to determine what options to use for your application. QLogic strongly encourages using the wrapper compilers instead of attempting to link to the Open MPI libraries manually. This allows the specific implementation of Open MPI to change without forcing changes to linker directives in users' Makefiles.
4–Running MPI on QLogic Adapters Open MPI The first choice will use verbs by default, and any with the _qlc string will use PSM by default. If you chose openmpi_gcc_qlc-1.4.3, for example, then the following simple mpirun command would run using PSM:...
F77=mpif77 F90=mpif90 CXX=mpicxx In some cases, the configuration process may specify the linker. QLogic recommends that the linker be specified as mpicc, mpif90, etc. in these cases. This specification automatically includes the correct flags and libraries, rather than trying to configure to pass the flags and libraries explicitly. For example:...
4–Running MPI on QLogic Adapters Open MPI The easiest way to use other compilers with any MPI that comes with QLogic OFED+ is to use mpi-selector to change the selected MPI/compiler combination, see “Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility”...
Normally MPI jobs are run with each node program (process) being associated with a dedicated QLogic IB adapter hardware context that is mapped to a CPU. If the number of node programs is greater than the available number of hardware contexts, software context sharing increases the number of node programs that can be run.
4–Running MPI on QLogic Adapters Open MPI Table 4-5. Available Hardware and Software Contexts Available Hardware Available Contexts when Adapter Contexts (same as number Software Context Sharing is of supported CPUs) Enabled QLE7342/ QLE7340 The default hardware context/CPU mappings can be changed on the QDR IB Adapters (QLE734x).
IB contexts to satisfy the job requirement and try to give a context to each process. When context sharing is enabled on a system with multiple QLogic IB adapter boards (units) and the IPATH_UNIT environment variable is set, the number of IB contexts made available to MPI jobs is restricted to the number of contexts available on that unit.
PSM environment variables. Setting PSM_SHAREDCONTEXTS_MAX=8 as a clusterwide default would unnecessarily penalize nodes that are dedicated to running single jobs. QLogic recommends that a per-node setting, or some level of coordination with the job scheduler with setting the environment variable should be used.
PSM contexts. Clean up these processes before restarting the job. Running in Shared Memory Mode Open MPI supports running exclusively in shared memory mode; no QLogic adapter is required for this mode of operation. This mode is used for running applications on a single node rather than on a cluster of nodes.
This is a different behavior than MVAPICH or the no-longer-supported QLogic MPI. In the second format, process_count can be different for each host, and is normally the number of available processors on the node.
4–Running MPI on QLogic Adapters Open MPI The command line option -hostfile can be used as shown in the following command line: $mpirun -np n -hostfile mpihosts [other options] program-name or -machinefile is a synonym for -hostfile. In this case, if the named file cannot be opened, the MPI job fails.
4–Running MPI on QLogic Adapters Open MPI This option spawns n instances of program-name. These instances are called node programs. Generally, mpirun tries to distribute the specified number of processes evenly among the nodes listed in the hostfile. However, if the number of processes exceeds the number of nodes listed in the hostfile, then some nodes will be assigned more than one instance of the program.
4–Running MPI on QLogic Adapters Open MPI NOTE The node that invoked mpirun need not be the same as the node where the MPI_COMM_WORLD rank 0 process resides. Open MPI handles the redirection of mpirun's standard input to the rank 0 process.
4–Running MPI on QLogic Adapters Open MPI Open MPI adds the base-name of the current node’s bindir (the directory where Open MPI’s executables are installed) to the prefix and uses that to set the PATH on the remote node. Similarly, Open MPI adds the base-name of the current node’s libdir (the directory where Open MPI’s libraries are installed) to the...
4–Running MPI on QLogic Adapters Open MPI Setting MCA Parameters The -mca switch allows the passing of parameters to various Modular Component Architecture (MCA) modules. MCA modules have direct impact on MPI programs because they allow tunable parameters to be set at run time (such as which BTL communication device driver to use, what parameters to pass to that BTL, and so on.).
4–Running MPI on QLogic Adapters Open MPI Environment Variables Table 4-6 contains a summary of the environment variables that are relevant to any PSM including Open MPI. Table 4-7 is more relevant for the MPI programmer or script writer, because these variables are only active after the mpirun command has been issued and while the MPI processes are active.
Page 91
4–Running MPI on QLogic Adapters Open MPI Table 4-6. Environment Variables Relevant for any PSM (Continued) Name Description When set to 1, the PSM library will skip trying to IPATH_NO_CPUAFFINITY set processor affinity. This is also skipped if the processor affinity mask is set to a list smaller than the number of processors prior to MPI_Init() being called.
4–Running MPI on QLogic Adapters Open MPI Table 4-6. Environment Variables Relevant for any PSM (Continued) Name Description This variable specifies the path to the run-time LD_LIBRARY_PATH library. Default: Unset Table 4-7. Environment Variables Relevant for Open MPI Name Description...
4–Running MPI on QLogic Adapters Open MPI and Hybrid MPI/OpenMP Applications Open MPI and Hybrid MPI/OpenMP Applications Open MPI supports hybrid MPI/OpenMP applications, provided that MPI routines are called only by the master OpenMP thread. This application is called the funneled thread model.
4–Running MPI on QLogic Adapters Debugging MPI Programs NOTE With Open MPI, and other PSM-enabled MPIs, you will typically want to turn off PSM's CPU affinity controls so that the OpenMP threads spawned by an MPI process are not constrained to stay on the CPU core of that process, causing over-subscription of that CPU.
Page 95
4–Running MPI on QLogic Adapters Debugging MPI Programs NOTE The TotalView debugger can be used with the Open MPI supplied in this ® release. Consult the TotalView documentation for more information: http://www.open-mpi.org/faq/?category=running#run-with-tv IB0054606-02 A 4-23...
With Open MPI 1.4.3 GCC, Intel, Provides some MPI-2 functionality Verbs (one-sided operations and dynamic pro- cesses). Available as part of the QLogic download. Can be managed by mpi-selector. MVAPICH GCC, Intel, Provides MPI-1 functionality. version 1.2 Verbs Available as part of the QLogic download.
By default, the MVAPICH, MVAPICH2, and Open MPI are installed in the following directory tree: /usr/mpi/$compiler/$mpi-mpi_version The QLogic-supplied MPIs precompiled with the GCC, PGI, and the Intel compilers will also have -qlc appended after the MPI version number. For example: /usr/mpi/gcc/openmpi-VERSION-qlc If a prefixed installation location is used, /usr is replaced by $prefix.
Open MPI is an open source MPI-2 implementation from the Open MPI Project. Pre-compiled versions of Open MPI version 1.4.3 that run over PSM and are built with the GCC, PGI, and Intel compilers are available with the QLogic download. Details on Open MPI operation are provided in...
MVAPICH2 can be managed with the mpi-selector utility, as described in “Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility” on page 5-5. Compiling MVAPICH2 Applications As with Open MPI, QLogic recommends that you use the included wrapper scripts that invoke the underlying compiler (see Table 5-3).
MVAPICH MVAPICH2 The mpi-selector is an OFED utility that is installed as a part of QLogic OFED+ 1.5.4. Its basic functions include: Listing available MPI implementations Setting a default MPI to use (per user or site wide) ...
5–Using Other MPIs Platform MPI 8 The example shell scripts mpivars.sh and mpivars.csh, for registering with mpi-selector, are provided as part of the mpi-devel RPM in $prefix/share/mpich/mpi-selector-{intel, gnu, pgi} directories. For all non-GNU compilers that are installed outside standard Linux search paths, set up the paths so that compiler binaries and runtime libraries can be resolved.
5–Using Other MPIs Intel MPI MPI_ICMOD_PSM__PSM_PATH = "^" Compiling Platform MPI 8 Applications As with Open MPI, QLogic recommends that you use the included wrapper scripts that invoke the underlying compiler (see Table 5-4). Table 5-4. Platform MPI 8 Wrapper Scripts...
QLogic OFED+ Host Software package. They can be installed either with the QLogic OFED+ Host Software installation or using the rpm files after the QLogic OFED+ Host Software tar file has been unpacked. For example: Using DAPL 1.2.
Page 105
5–Using Other MPIs Intel MPI Using DAPL 2.0. $ rpm -qa | grep dapl dapl-devel-static-2.0.19-1 compat-dapl-1.2.14-1 dapl-2.0.19-1 dapl-debuginfo-2.0.19-1 compat-dapl-devel-static-1.2.14-1 dapl-utils-2.0.19-1 compat-dapl-devel-1.2.14-1 dapl-devel-2.0.19-1 Verify that there is a /etc/dat.conf file. It should be installed by the dapl- RPM. The file dat.conf contains a list of interface adapters supported by uDAPL service providers.
Substitute bin if using 32-bit. Compiling Intel MPI Applications As with Open MPI, QLogic recommended that you use the included wrapper scripts that invoke the underlying compiler. The default underlying compiler is GCC, including gfortran. Note that there are more compiler drivers (wrapper...
5–Using Other MPIs Intel MPI uDAPL 1.2: -genv I_MPI_DEVICE rdma:OpenIB-cma uDAPL 2.0: -genv I_MPI_DEVICE rdma:ofa-v2-ib To help with debugging, you can add this option to the Intel mpirun command: TMI: -genv TMI_DEBUG 1 uDAPL: -genv I_MPI_DEBUG 2 Further Information on Intel MPI For more information on using Intel MPI, see: http://www.intel.com/ IB0054606-02 A...
5–Using Other MPIs Improving Performance of Other MPIs Over IB Verbs Improving Performance of Other MPIs Over IB Verbs Performance of MPI applications when using an MPI implementation over IB Verbs can be improved by tuning the IB MTU size. NOTE No manual tuning is necessary for PSM-based MPIs, since the PSM layer determines the largest possible IB MTU for each source/destination path.
SHMEM is packaged with the QLogic IFS or QLogic OFED+ Host software.Every node in the cluster must have a QLogic IB adapter card and be running RedHat Enterprise Linux (RHEL) 6, 6.1 or 6.2 OS. One or more Message Passing Interface (MPI) implementations are required and Performance Scaled Messaging (PSM) support must be enabled within the MPI.
Page 110
6–SHMEM Description and Configuration Installation The -qlc suffix denotes that this is the QLogic PSM version. MVAPICH version 1.2.0 compiled for PSM. This is provided by QLogic IFS and can be found in the following directories: /usr/mpi/gcc/mvapich-1.2.0-qlc /usr/mpi/intel/mvapich-1.2.0-qlc /usr/mpi/pgi/mvapich-1.2.0-qlc The -qlc suffix denotes that this is the QLogic PSM version.
6–SHMEM Description and Configuration SHMEM Programs By default QLogic SHMEM is installed with a prefix of /usr/shmem/qlogic into the following directory structure: /usr/shmem/qlogic /usr/shmem/qlogic/bin /usr/shmem/qlogic/bin/mvapich /usr/shmem/qlogic/bin/mvapich2 /usr/shmem/qlogic/bin/openmpi /usr/shmem/qlogic/lib64 /usr/shmem/qlogic/lib64/mvapich /usr/shmem/qlogic/lib64/mvapich2 /usr/shmem/qlogic/lib64/openmpi /usr/shmem/qlogic/include QLogic recommends that /usr/shmem/qlogic/bin is added onto your $PATH.
SHMEM library. The shmemcc script automatically determines the correct directories by finding them relative to its own location. The standard directory layout of the QLogic SHMEM software is assumed. The default C compiler is gcc, and can be overridden by specifying a compiler with the $SHMEM_CC environment variable.
There is no need to couple the application binary to a particular MPI, and these symbols will be correctly resolved at run-time. The advantage of this approach is that SHMEM application binaries will be portable across different implementations of the QLogic SHMEM library, including portability over different underlying MPIs. Running SHMEM Programs Using shmemrun The shmemrun script is a wrapper script for running SHMEM programs using mpirun.
The libraries can be found at: $SHMEM_DIR/lib64/$MPI Where $SHMEM_DIR denotes the top-level directory of the SHMEM installation, typically /usr/shmem/qlogic, and $MPI is your choice of MPI (one of mvapich, mvapich2, or openmpi). Additionally, the PSM receive thread and back-trace must be disabled using the...
These binaries are portable across all MPI implementations supported by QLogic SHMEM. This is true of the get/put micro-benchmarks provided by QLogic SHMEM. The desired MPI can be selected at run time simply by placing the desired mpirun on $PATH, or by using the $SHMEM_MPIRUN environment variable.
MPI implementation. The slurm web pages describe 3 approaches. Please refer to points 1, 2 and 3 on the following web-page: https://computing.llnl.gov/linux/slurm/mpi_guide.html Below are various options for integration of the QLogic SHMEM and slurm. Full Integration This approach fully integrates QLogic SHMEM start-up into slurm and is available when running over MVAPICH2.
6–SHMEM Description and Configuration Sizing Global Shared Memory The salloc allocates 16 nodes and runs one copy of shmemrun on the first allocated node which then creates the SHMEM processes. shmemrun invokes mpirun, and mpirun determines the correct set of hosts and required number of processes based on the slurm allocation that it is running inside of.
Page 118
$SHMEM_SHMALLOC_INIT_SIZE can also be changed to pre-allocate more memory up front rather than dynamically. By default QLogic SHMEM will use the same base address for the symmetric heap across all PEs in the job. This address can be changed using the $SHMEM_SHMALLOC_BASE_ADDR environment variable.
SHMEM one-sided operations. Passive progress means that progress on SHMEM one-sided operations can occur without the application needing to call into SHMEM. Active progress is the default mode of operation for QLogic SHMEM. Passive progress can be selected using an environment variable where required.
SHMEM, since progress will not occur and the program will hang. Instead, SHMEM applications should use one of the wait synchronization primitives provided by SHMEM. In active progress mode QLogic SHMEM will achieve full performance. Passive Progress...
16KB by default. Active versus Passive Progress It is expected that most applications will be run with QLogic SHMEM's active progress mode since this gives full performance. The passive progress mode will typically be used in the following circumstances: ...
Page 122
6–SHMEM Description and Configuration Environment Variables Table 6-1. SHMEM Run Time Library Environment Variables (Continued) Environment Variable Default Description Shared memory consistency checks $SHMEM_SHMALLOC_CHECK set for 0 to disable and 1 to enable. These are good checks for correctness but degrade the performance of shmal- loc() and shfree().
When the timeout value is reached, the mpirun is killed. This variable is intended for testing use. Implementation Behavior Some SHMEM properties are not fully specified by the SHMEM API specification. This section discusses the behavior for the QLogic SHMEM implementation. IB0054606-02 A 6-15...
Page 124
Additional properties of the QLogic SHMEM implementation are: The QLogic SHMEM implementation makes no guarantees as to the ordering in which the bytes of a put operation are delivered into the remote memory. It is *not* a safe assumption to poll or read certain bytes of the put destination buffer (for example, the last 8 bytes) to look for a change in value and then infer that the entirety of the put has arrived.
6–SHMEM Description and Configuration Application Programming Interface 8 byte put to a sync location Target side: Wait for the sync location to be written Now it is safe to make observations on all puts prior to fence ...
Page 133
6–SHMEM Description and Configuration Application Programming Interface Table 6-3. SHMEM Application Programming Interface Calls Operation Calls shmem_short_max_to_all shmem_complexd_sum_to_all complex collectives are not implemented shmem_complexf_sum_to_all complex collectives are not implemented shmem_double_sum_to_all shmem_float_sum_to_all shmem_int_sum_to_all shmem_long_sum_to_all shmem_longdouble_sum_to_all shmem_longlong_sum_to_all shmem_short_sum_to_all shmem_complexd_prod_to_all complex collectives are not implemented shmem_complexf_prod_to_all complex collectives are not implemented shmem_double_prod_to_all...
Page 134
6–SHMEM Description and Configuration Application Programming Interface Table 6-3. SHMEM Application Programming Interface Calls Operation Calls shmem_clear_lock shmem_test_lock Events clear_event set_event wait_event test_event General Operations globalexit (for compatibility) allows any process to abort the job shmem_finalize call to terminate the SHMEM library shmem_pe_accessible tests PE for accessibility shmem_addr_accessible...
SHMEM performance within a single node. The micro-benchmarks have the command line options shown in Table 6-4 Table 6-4. QLogic SHMEM micro-benchmarks options Option Description a log2 of desired alignment for buffers (default = 12)
Usage: shmem-rand [options] [list of message sizes]. Message sizes are specified in bytes (default = 8) Options: See Table 6-5 Table 6-5. QLogic SHMEM random access benchmark options Option Description use automatic (NULL) handles for NB ops (default explicit han-...
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-5. QLogic SHMEM random access benchmark options Option Description choose OP from get, getnb, put, putnb -o OP for blocking puts, no quiet every window (this is the default) for blocking puts, use quiet every window...
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-6. QLogic SHMEM all-to-all benchmark options Option Description enable communication to local ranks (including self) memory size in MB (default = 8MB): or in KB with a K suffix -m INTEGER[K] use non-pipelined mode for NB ops (default pipelined)
(vFabric) integration, allowing users to specify IB Service Level (SL) and Partition Key (PKey), or to provide a configured Service ID (SID) to target a vFabric. Support for using IB path record queries to the QLogic Fabric Manager during connection setup is also available, enabling alternative switch topologies such as Mesh/Torus.
PSM. Sixteen unique Service IDs have been allocated for PSM enabled MPI vFabrics to ease their testing however any Service ID can be used. Refer to the QLogic Fabric Manager User Guide on how to configure vFabrics.
PSM_IB_SERVICE_ID=SID # Service ID to use SL2VL mapping from the Fabric Manager PSM is able to use the SL2VL table as programmed by the QLogic Fabric Manager. Prior releases required manual specification of the SL2VL mapping via an environment variable.
Adapters iba_saquery can be used to get the SL2VL mapping for any given port however, QLogic 7300 series adapters exports the SL2VL mapping via sysfs files. These files are used by PSM to implement the SL2VL tables automatically. The SL2VL tables are per port and available under /sys/class/infiniband/hca name/ports/port #/sl2vl.
Dispersive Routing Infiniband uses deterministic routing that is keyed from the Destination LID ® (DLID) of a port. The Fabric Manager programs the forwarding tables in a switch to determine the egress port a packet takes based on the DLID. Deterministic routing can create hotspots even in full bisection bandwidth (FBB) fabrics for certain communication patterns if the communicating node pairs map onto a common upstream link, based on the forwarding tables.
Page 146
8–Dispersive Routing Internally, PSM utilizes dispersive routing differently for small and large messages. Large messages are any messages greater-than or equal-to 64K. For large messages, the message is split into message fragments of 128K by default (called a window). Each of these message windows is sprayed across a distinct path between ports.
Page 147
8–Dispersive Routing Static_Dest: The path selection is based on the CPU index of the destination process. Multiple paths can be used if data transfer is to different remote processes within a node. If multiple processes from Node A send a message to a single process on Node B only one path will be used across all processes.
A boot server or http server (can be the same as the DHCP server) A node to be booted Use a QLE7340 or QLE7342 adapter for the node. The following software is included with the QLogic OFED+ installation software package: gPXE boot image ...
Required Steps Download a copy of the gPXE image. Located at: The executable to flash the EXPROM on the QLogic IB adapters is located at: /usr/sbin/ipath_exprom The gPXE driver for QLE7300 series IB adapters (the EXPROM image) is located at: /usr/share/infinipath/gPXE/iba7322.rom...
DHCP server runs on a machine that supports IP over IB. NOTE Prior to installing DHCP, make sure that QLogic OFED+ is already installed on your DHCP server. Download and install the latest DHCP server from www.isc.org.
9–gPXE Preparing the DHCP Server in Linux Configuring DHCP From the client host, find the GUID of the HCA by using p1info or look at the GUID label on the IB adapter. Turn the GUID into a MAC address and specify the port of the IB adapter that is going to be used at the end, using b0 for port0 or b1 for port1.
NOTE The dhcpd and apache configuration files referenced in this example are included as examples, and are not part of the QLogic OFED+ installed software. Your site boot servers may be different, see their documentation for equivalent information.
Page 154
9–gPXE Netbooting Over IB Install Apache. Create an images.conf file and a kernels.conf file and place them in the /etc/httpd/conf.d directory. This sets up aliases for and tells apache where to find them: /images — http://10.252.252.1/images/ /kernels — http://10.252.252.1/kernels/ The following is an example of the images.conf file Alias /images /vault/images <Directory "/vault/images">...
Page 155
To add an IB driver into the initrd file, The IB modules need to be copied to the diskless image. The host machine needs to be pre-installed with the QLogic OFED+ Host Software that is appropriate for the kernel version the diskless image will run. The QLogic OFED+ Host Software is available for download from http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/default.aspx...
Page 156
9–gPXE Netbooting Over IB The infinipath rpm will install the file /usr/share/infinipath/gPXE/gpxe-qib-modify-initrd with contents similar to the following example. You can either run the script to generate a new initrd image, or use it as an example, and customize as appropriate for your site. # This assumes you will use the currently running version of linux, and # that you are starting from a fully configured machine of...
Page 157
9–gPXE Netbooting Over IB # extract previous contents gunzip -dc ../initrd-ib-${kern}.img | cpio --quiet -id # add infiniband modules mkdir -p lib/ib find /lib/modules/${kern}/updates -type f | \ egrep '(iw_cm|ib_(mad|addr|core|sa|cm|uverbs|ucm|umad|ipoib|qib ).ko|rdma_|ipoib_helper)' | \ xargs -I '{}' cp -a '{}' lib/ib # Some distros have ipoib_helper, others don't require it if [ -e lib/ib/ipoib_helper ];...
Page 158
9–gPXE Netbooting Over IB IFS=' ' v6cmd='/sbin/insmod /lib/'${xfrm}'.ko '"$v6cmd" crypto=$(modinfo -F depends $xfrm) if [ ${crypto} ]; then cp $(find /lib/modules/$(uname -r) -name ${crypto}.ko) lib IFS=' ' v6cmd='/sbin/insmod /lib/'${crypto}'.ko '"$v6cmd" # we need insmod to load the modules; if not present it, copy it mkdir -p sbin grep -q insmod ../Orig-listing || cp /sbin/insmod sbin...
Page 159
9–gPXE Netbooting Over IB /sbin/insmod /lib/ib/ib_sa.ko /sbin/insmod /lib/ib/ib_cm.ko /sbin/insmod /lib/ib/ib_uverbs.ko /sbin/insmod /lib/ib/ib_ucm.ko /sbin/insmod /lib/ib/ib_umad.ko /sbin/insmod /lib/ib/iw_cm.ko /sbin/insmod /lib/ib/rdma_cm.ko /sbin/insmod /lib/ib/rdma_ucm.ko $dcacmd /sbin/insmod /lib/ib/ib_qib.ko $helper_cmd /sbin/insmod /lib/ib/ib_ipoib.ko echo "finished loading IB modules" # End of IB module block # first get line number where we append (after last insmod if any, otherwse # at start line=$(egrep -n insmod init | sed -n '$s/:.*//p')
Page 160
9–gPXE Netbooting Over IB # and show the differences. echo -e '\nChanges in files in initrd image\n' diff Orig-listing New-listing # copy the new initrd to wherever you have configure the dhcp server to look # for it (here we assume it's /images) mkdir -p /images initrd-${kern}.img /images echo -e '\nCompleted initrd for IB'...
Page 161
9–gPXE Netbooting Over IB The following is an example of a uniboot.php file: <? header ( 'Content-type: text/plain' ); function strleft ( $s1, $s2 ) { return substr ( $s1, 0, strpos ( $s1, $s2 ) ); function baseURL() { $s = empty ( $_SERVER["HTTPS"] ) ? '' : ( $_SERVER["HTTPS"] == "on"...
9–gPXE HTTP Boot Setup This is the kernel that will boot. This file can be copied from any machine that has RHEL5.3 installed. Start httpd Steps on the gPXE Client Ensure that the HCA is listed as the first bootable device in the BIOS. Reboot the test node(s) and enter the BIOS boot setup.
Page 163
9–gPXE HTTP Boot Setup Create an images.conf file and a kernels.conf file using the examples in Step 2 Boot Server Setup and place them in the /etc/httpd/conf.d directory. Edit /etc/dhcpd.conf file to boot the clients using HTTP filename "http://172.26.32.9/images/uniboot/uniboot.php"; Restart the DHCP server Start HTTP if it is not already running: /etc/init.d/httpd start IB0054606-02 A...
They are not representations of actual IB performance characteristics. For additional MPI sample applications refer to Section 5 of the QLogic FastFabric Command Line Interface Reference Guide. Benchmark 1: Measuring MPI Latency Between...
Page 166
A–Benchmark Programs Benchmark 1: Measuring MPI Latency Between Two Nodes The program osu_latency, from Ohio State University, measures the latency for a range of messages sizes from 0bytes to 4 megabytes. It uses a ping-pong method, where the rank zero process initiates a series of sends and the rank one process echoes them back, using the blocking MPI send and receive calls for all operations.
Page 167
A–Benchmark Programs Benchmark 1: Measuring MPI Latency Between Two Nodes -H (or --hosts) allows the specification of the host list on the command line instead of using a host file (with the -m or -machinefile option). Since only two hosts are listed, this implies that two host programs will be started (as if -np 2 were specified).
A–Benchmark Programs Benchmark 2: Measuring MPI Bandwidth Between Two Nodes Benchmark 2: Measuring MPI Bandwidth Between Two Nodes The osu_bw benchmark measures the maximum rate that you can pump data between two nodes. This benchmark also uses a ping-pong mechanism, similar to the osu_latency code, except in this case, the originator of the messages pumps a number of them (64 in the installed version) in succession using the non-blocking MPI_I send function, while the receiving node consumes them as...
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks Benchmark 3: Messaging Rate Microbenchmarks OSU Multiple Bandwidth / Message Rate test (osu_mbw_mr) osu_mbw_mr is a multi-pair bandwidth and message rate test that evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes.
Page 171
An Enhanced Multiple Bandwidth / Message Rate test (mpi_multibw) mpi_multibw is a version of osu_mbw_mr which has been enhanced by QLogic to, optionally, run in a bidirectional mode and to scale better on the larger multi-core nodes available today This benchmark is a modified form of the OSU Network-Based Computing Lab’s osu_mbw_mr benchmark (as shown in the...
Page 172
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks N/2 is dynamically calculated at the end of the run. You can use the -b option to get a bidirectional message rate and bandwidth results. Scalability has been improved for larger core-count nodes. IB0054606-02 A...
Thefollowing is an example output when running mpi_multibw: $ mpirun -H host1,host2 -npernode 12 /usr/mpi/gcc/openmpi-1.4.3-qlc/tests/qlogic/mpi_multibw # PathScale Modified OSU MPI Bandwidth Test (OSU Version 2.2, PathScale $Revision: 1.1.2.1 $) # Running on 12 procs per node (uni-directional traffic for...
Page 174
The following is an example output when running with the bidirectional option (-b): $ mpirun -H host1,host2 -np 24 /usr/mpi/gcc/openmpi-1.4.3-qlc/tests/qlogic/mpi_multibw -b # PathScale Modified OSU MPI Bandwidth Test (OSU Version 2.2, PathScale $Revision: 1.1.2.1 $) # Running on 12 procs per node (bi-directional traffic for...
Page 175
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks The higher peak bi-directional messaging rate of 34.6 million messages per second at the 1 byte size, compared to 25 million messages/sec. when run unidirectionally. IB0054606-02 A A-11...
SRP Upper Layer Protocol (ULP). SRP storage can be treated as another device. In this release, two versions of SRP are available: QLogic SRP and OFED SRP. QLogic SRP is available as part of the QLogic OFED Host Software, QLogic IFS, Rocks Roll, and Platform PCM downloads.
OFED modules. The Linux kernel will not allow those OFED modules to be unloaded. QLogic SRP Configuration The QLogic SRP is installed as part of the QLogic OFED+ Host Software or the QLogic IFS. The following section provide procedures to set up and configure the QLogic SRP.
B–SRP Configuration QLogic SRP Configuration Stopping, Starting and Restarting the SRP Driver To stop the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp stop To start the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp start To restart the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp restart...
Page 180
B–SRP Configuration QLogic SRP Configuration By the port GUID of the IOC, or By the IOC profile string that is created by the VIO device (i.e., a string containing the chassis GUID, the slot number and the IOC number). FVIC creates the device in this manner, other devices have their own naming method.
Page 181
B–SRP Configuration QLogic SRP Configuration The system returns input similar to the following: st187:~/qlgc-srp-1_3_0_0_1 # ib_qlgc_srp_query QLogic Corporation. Virtual HBA (SRP) SCSI Query Application, version 1.3.0.0.1 1 IB Host Channel Adapter present in system. HCA Card 0 : 0x0002c9020026041c Port 1 GUID...
B–SRP Configuration QLogic SRP Configuration 0x0000494353535250 service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250 Target Path(s): HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a21dd000021 HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a21dd000021 SRP IOC Profile : Chassis 0x00066A0050000135, Slot 5, IOC 1...
Page 183
# qlgc_srp.cfg file generated by /usr/sbin/ib_qlgc_srp_build_cfg, version 1.3.0.0.17, on Mon Aug 25 13:42:16 EDT 2008 #Found QLogic OFED SRP registerAdaptersInOrder: ON ============================================================= # IOC Name: BC2FC in Chassis 0x0000000000000000, Slot 6, Ioc 1 # IOC GUID: 0x00066a01e0000149 SRP IU SIZE : 320 service 0 : name SRP.T10:0000000000000001 id...
B–SRP Configuration QLogic SRP Configuration noverify: 0 description: "SRP Virtual HBA 0" command creates a configuration file based on ib_qlgc_srp_build_cfg discovered target devices. By default, the information is sent to . In order stdout to create a configuration file, output should be redirected to a disk file. Enter -h for a list and description of the option flags.
B–SRP Configuration QLogic SRP Configuration NOTE When using this method, if the port GUIDs are changed, they must also be changed in the configuration file. Specifying a SRP Target Port The SRP target can be specified in two different ways. To connect to a particular SRP target no matter where it is in the fabric, use the first method (By IOCGUID).
B–SRP Configuration QLogic SRP Configuration Specifying a SRP Target Port of a Session by IOCGUID The following example specifies a target by IOC GUID: session begin card: 0 port: 1 targetIOCGuid: 0x00066A013800016c #IOC GUID of the InfiniFibre port 0x00066a10dd000046 ...
B–SRP Configuration QLogic SRP Configuration Restarting the SRP Module For changes to take effect, including changes to the SRP map on the VIO card, SRP will need to be restarted. To restart the driver, use the following qlgc_srp command: /etc/init.d/qlgc_srp restart Configuring an Adapter with Multiple Sessions Each adapter can have an unlimited number of sessions attached to it.
Page 188
B–SRP Configuration QLogic SRP Configuration When the module encounters an adapter command, that adapter is qlgc_srp assigned all previously defined sessions (that have not been assigned to other adapters). This makes it easy to configure a system for multiple SRP adapters.
B–SRP Configuration QLogic SRP Configuration adapter begin description: "Test Device 1" Configuring Fibre Channel Failover Fibre Channel failover is essentially failing over from one session in an adapter to another session in the same adapter. Following is a list of the different type of failover scenarios: ...
B–SRP Configuration QLogic SRP Configuration Failover Configuration File 1: Failing over from one SRP Initiator port to another In this failover configuration file, the first session (using adapter Port 1) is used to reach the SRP Target Port. If a problem is detected in this session (e.g., the IB cable on port 1 of the adapter is pulled) then the 2nd session (using adapter Port 2) will be used.
B–SRP Configuration QLogic SRP Configuration adapterIODepth: 1000 lunIODepth: 16 adapterMaxIO: 128 adapterMaxLUNs: 512 adapterNoConnectTimeout: 60 adapterDeviceRequestTimeout: 2 # set to 1 if you want round robin load balancing roundrobinmode: 0 # set to 1 if you do not want target connectivity...
B–SRP Configuration QLogic SRP Configuration On the VIO hardware side, the following needs to be ensured: The target device is discovered and configured for each of the ports that is involved in the failover. The SRP Initiator is discovered and configured once for each different initiatorExtension.
B–SRP Configuration QLogic SRP Configuration On the VIO hardware side, the following need to be ensured on each FVIC involved in the failover: The target device is discovered and configured through the appropriate FC port The SRP Initiator is discovered and configured once for the proper initiatorExtension.
B–SRP Configuration QLogic SRP Configuration The target device is discovered and configured through the appropriate FC port The SRP Initiator is discovered and configured once for the proper initiatorExtension. The SRP map created for the initiator connects to the same target...
B–SRP Configuration QLogic SRP Configuration 2 Adapter Ports and 2 Ports on a Single VIO Module In this example, traffic is load balanced between adapter Port 2/VIO hardware Port 1 and adapter Port1/VIO hardware Port 1. If one of the sessions goes down (due to an IB cable failure or an FC cable failure), all traffic will begin using the other session.
B–SRP Configuration QLogic SRP Configuration Using the roundrobinmode Parameter In this example, the two sessions use different VIO hardware cards as well as different adapter ports. Traffic will be load-balanced between the two sessions. If there is a failure in one of the sessions (e.g., one of the VIO hardware cards is rebooted) traffic will begin using the other session.
B–SRP Configuration QLogic SRP Configuration roundrobinmode: 0 # set to 1 if you do not want target connectivity verification noverify: 0 description: "SRP Virtual HBA 0" Note the correlation between the output of ib_qlgc_srp_query and qlgc_srp.cfg Target Path(s): HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a11dd000021 HCA 0 Port 2 0x0002c9020026041e ->...
B–SRP Configuration OFED SRP Configuration Additional Details All LUNs found are reported to the Linux SCSI mid-layer. Linux may need the (2.4 kernels) or (2.6 kernels) max_scsi_luns max_luns parameter configured in scsi_mod Troubleshooting For troubleshooting information, refer to “Troubleshooting SRP Issues”...
Page 201
B–SRP Configuration OFED SRP Configuration Choose the device you want to use, and run the command again with the option (as a root user): # ibsrpdm -c id_ext=200400A0B8114527,ioc_guid=0002c90200402c04,dgid=fe 800000000000000002c90200402c05,pkey=ffff,service_id=20040 0a0b8114527 id_ext=200500A0B8114527,ioc_guid=0002c90200402c0c,dgid=fe 800000000000000002c90200402c0d,pkey=ffff,service_id=20050 0a0b8114527 id_ext=21000001ff040bf6,ioc_guid=21000001ff040bf6,dgid=fe 8000000000000021000001ff040bf6,pkey=ffff,service_id=f60b0 4ff01000021 Find the result that corresponds to the target you want, and it into the echo file:...
# /sbin/fuser -v /dev/ipath* lsof can also take the same form: # lsof /dev/ipath* The following command terminates all processes using the QLogic interconnect: # /sbin/fuser -k /dev/ipath For more information, see the man pages for fuser(1) and lsof(8). NOTE...
Page 205
C–Integration with a Batch Queuing System Clean-up PSM Shared Memory Files #!/bin/sh files=`/bin/ls /dev/shm/psm_shm.* 2> /dev/null`; for file in $files; /sbin/fuser $file > /dev/null 2>&1; if [ $? -ne 0 ]; then /bin/rm $file > /dev/null 2>&1; done; When the system is idle, the administrators can remove all of the shared memory files, including stale files, by using the following command: # rm -rf /dev/shm/psm_shm.* IB0054606-02 A...
Page 206
C–Integration with a Batch Queuing System Clean-up PSM Shared Memory Files IB0054606-02 A...
System Administration Troubleshooting Performance Issues Open MPI Troubleshooting Troubleshooting information for hardware installation is found in the QLogic InfiniBand Adapter Hardware Installation Guide and software installation is found ® in the QLogic InfiniBand Fabric Software Installation Guide.
D–Troubleshooting BIOS Settings Table D-1. LED Link and Data Indicators (Continued) LED States Indication Green ON Signal detected and the physical link is up. Ready to talk to SM to bring the link fully up. Amber OFF If this state persists, the SM may be missing or the link may not be configured.
If you upgrade the kernel, then you must reboot and then rebuild or reinstall the InfiniPath kernel modules (drivers). QLogic recommends using the IFS Software Installation TUI to preform this rebuild or reinstall. Refer to the QLogic Fabric Software Installation Guide for more information.
D–Troubleshooting Kernel and Initialization Issues A zero count in all CPU columns means that no InfiniPath interrupts have been delivered to the processor. The possible causes of this problem are: Booting the Linux kernel with ACPI disabled on either the boot command line or in the BIOS configuration ...
If the driver loaded, but MPI or other programs are not working, check to see if problems were detected during the driver and QLogic hardware initialization with the command: $ dmesg | grep -i ib_qib This command may generate more than one screen of output.
Managers) and InfiniPath. Stop Infinipath Services Before Stopping/Restarting InfiniPath The following Infinipath services must be stopped before stopping/starting/restarting InfiniPath: QLogic Fabric Manager OpenSM Here is a sample command and the corresponding error messages: # /etc/init.d/openibd stop Unloading infiniband modules: sdp cm umad uverbs ipoib sa ipath mad coreFATAL:Module ib_umad is in use.
D–Troubleshooting OpenFabrics and InfiniPath Issues Manual Shutdown or Restart May Hang if NFS in Use If you are using NFS over IPoIB and use the manual /etc/init.d/openibd stop (or restart) command, the shutdown process may silently hang on the fuser command contained within the script. This is because fuser cannot traverse down the tree from the mount point once the mount point has disappeared.
/etc/sysconfig/network-scripts/ifcfg-eth2 (for RHEL) /etc/sysconfig/network/ifcfg-eth2 (for SLES) QLogic recommends using the IP over IB protocol (IPoIB-CM), included in the standard OpenFabrics software releases, as a replacement for ipath_ether. System Administration Troubleshooting The following sections provide details on locating problems related to system administration.
See your switch vendor for more information. QLogic recommends using FastFabric to help diagnose this problem. If FastFabric is not installed in the fabric, there are two diagnostic tools, ibhosts and ibtracert, that may also be helpful. The tool ibhosts lists all the IB nodes that the subnet manager recognizes.
D–Troubleshooting Performance Issues Erratic Performance Sometimes erratic performance is seen on applications that use interrupts. An example is inconsistent SDP latency when running a program such as netperf. This may be seen on AMD-based systems using the QLE7240 or QLE7280 adapters.
D–Troubleshooting Performance Issues This method is not the first choice because, on some systems, there may be two rows of ib_qib output, and you will not know which one of the two numbers to choose. However, if you cannot find $my_irq listed under /proc/irq (Method 1), this type of system most likely has only one line for ib_qib listed in /proc/interrupts, so you can use Method 2.
(for example, not FE80000000000000) based on the Fabric Manager configuration file. The config_generate tool for the Fabric Manager will help generate such files. Refer to the QLogic Fabric Manager User Guide for more information about the config_generate tool. D-12...
ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues To verify that an IB host can access an Ethernet system through the EVIC, issue a ping command to the Ethernet system from the IB host. Make certain that the route to the Ethernet system is using the VIO hardware by using the Linux route command on the IB host, then verify that the route to the subnet is using one of the virtual Ethernet interfaces (i.e., an EIOC).
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Verify that the proper VirtualNIC driver is running Check that a VirtualNIC driver is running by issuing an lsmod command on the IB host. Make sure that the qlgc_vnic is displayed on the list of modules. Following is an example: st186:~ # lsmod Module...
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Verifying that the host can communicate with the I/O Controllers (IOCs) of the VIO hardware To display the Ethernet VIO cards that the host can see and communicate with, issue the command ib_qlgc_vnic_query. The system returns information similar to the following: IO Unit Info: port LID:...
Page 222
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Chassis 0x00066A00010003F2, Slot 1, IOC 3 service entries: 2 service[ 0]: 1000066a00000003 / InfiniNIC.InfiniConSys.Control:03 service[ 1]: 1000066a00000103 / InfiniNIC.InfiniConSys.Data:03 When ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information. With the -s option it reports the IOCSTRING information for the Virtual I/O hardware IOCs present on the fabric.
Page 223
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues If the host can not see applicable IOCs, there are two things to check. First, verify that the adapter port specified in the eioc definition of the /etc/infiniband/qlgc_vnic.cfg file is active. This is done using the ibv_devinfo commands on the host, then checking the value of state.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Another reason why the host might not be able to see the necessary IOCs is that the subnet manager has gone down. Issue an iba_saquery command to make certain that the response shows all of the nodes in the fabric. If an error is returned and the adapter is physically connected to the fabric, then the subnet manager has gone down, and this situation needs to be corrected.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues DEVICE=eioc1 BOOTPROTO=static IPADDR=172.26.48.132 BROADCAST=172.26.63.130 NETMASK=255.255.240.0 NETWORK=172.26.48.0 ONBOOT=yes TYPE=Ethernet Example of ifcfg-eiocx setup for SuSE and SLES systems: BOOTPROTO='static' IPADDR='172.26.48.130' BROADCAST='172.26.63.255' NETMASK='255.255.240.0' NETWORK='172.26.48.0' STARTMODE='hotplug' TYPE='Ethernet' Verify the physical connection between the VIO hardware and the Ethernet network If the interface is displayed in an ifconfig and a ping between the IB host and the Ethernet host is still unsuccessful, verify that the VIO hardware Ethernet ports...
Page 226
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues There are up to 6 IOC GUIDs on each VIO hardware module (6 for the IB/Ethernet Bridge Module, 2 for the EVIC), one for each Ethernet port. If a VIO hardware module can be seen from a host, the ib_qlgc_vnic_query -s file displays information similar to: EVIC in Chassis 0x00066a000300012a, Slot 19, Ioc 1 EVIC in Chassis 0x00066a000300012a, Slot 19, Ioc 2...
E–ULP Troubleshooting Troubleshooting SRP Issues Troubleshooting SRP Issues ib_qlgc_srp_stats showing session in disconnected state Problem: If the session is part of a multi-session adapter, ib_qlgc_srp_stats will show it to be in the disconnected state. For example: SCSI Host # : 17 | Mode ROUNDROBIN Trgt Adapter Depth : 1000...
E–ULP Troubleshooting Troubleshooting SRP Issues Solution: Perhaps an interswitch cable has been disconnected, or the VIO hardware is offline, or the Chassis/Slot does not contain a VIO hardware card. Instead of looking at this file, use the ib_qlgc_srp_query command to verify that the desired adapter port is in the active state.
E–ULP Troubleshooting Troubleshooting SRP Issues Solution 1: The host initiator has not been configured as an SRP initiator on the VIO hardware SRP Initiator Discovery screen. Via Chassis Viewer, bring up the SRP Initiator Discovery screen and either Click on 'Add New' to add a wildcarded entry with the initiator extension to match what is in the session entry in the qlgc_srp.cfg file, or Click on the Start button to discover the adapter port GUID, and then click 'Configure' on the row containing the adapter port GUID and give the entry...
E–ULP Troubleshooting Troubleshooting SRP Issues Solution: This indicates a problem in the path between the VIO hardware and the target storage device. After an SRP host has connected to the VIO hardware successfully, the host sends a “Test Unit Ready” command to the storage device.
Which port does a port GUID refer to? Solution: A QLogic HCA Port GUID is of the form 00066appa0iiiiii where pp gives the port number (0 relative) and iiiiiii gives the individual id number of the adapter so 00066a00a0iiiiiii is the port guid of the 1st port of the adapter and 00066a01a0iiiiiii is the port guid of the 2nd port of the adapter.
E–ULP Troubleshooting Troubleshooting SRP Issues In a failover configuration, if everything is configured correctly, one session will be Active and the rest will be Connected. The transition of a session from Connected to Active will not be attempted until that session needs to become Active, due to the failure of the previously Active session.
Page 236
E–ULP Troubleshooting Troubleshooting SRP Issues The system displays information similar to the following: st106:~ # ibv_devinfo -i 1 hca_id: mthca0 fw_ver: 5.1.9301 node_guid: 0006:6a00:9800:6c9f sys_image_guid: 0006:6a00:9800:6c9f vendor_id: 0x066a vendor_part_id: 25218 hw_ver: 0xA0 board_id: SS_0000000005 phys_port_cnt: 2 port: state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 71...
Need to determine the SRP driver version. Solution: To determine the SRP driver version number, enter the command modinfo -d qlgc-srp, which returns information similar to the following: st159:~ # modinfo -d qlgc-srp QLogic Corp. Virtual HBA (SRP) SCSI Driver, version 1.0.0.0.3 IB0054606-02 A E-19...
Write Combining Introduction Write Combining improves write bandwidth to the QLogic driver by writing multiple words in a single bus transaction (typically 64 bytes). Write combining applies only to x86_64 systems. The x86 Page Attribute Table (PAT) mechanism allocates Write Combining (WC) mappings for the PIO buffers, and is the default mechanism for WC.
Use the ipath_mtrr Script to Fix MTRR Issues QLogic also provides a script, ipath_mtrr, which sets the MTRR registers, enabling maximum performance from the InfiniPath driver. This Python script is available as a part of the InfiniPath software download, and is contained in the infinipath* RPM.
F–Write Combining Verify Write Combining is Working The test results will list any problems, if they exist, and provide suggestions on what to do. To fix the MTRR registers, use: # ipath_mtrr -w Restart the driver after fixing the registers. This script needs to be run after each system reboot.
Page 242
F–Write Combining Verify Write Combining is Working Notes IB0054606-02 A...
Use the following items as a checklist for verifying homogeneity. A difference in any one of these items in your cluster may cause problems: Kernels Distributions Versions of the QLogic boards Runtime and build environments files from different compilers Libraries ...
Scans the system and reports hardware and firmware infor- mation about all the HCAs in the system. iba_manage_switch Allows management of externally managed switches (includ- ing 12200, 12200-18, and HP BLc QLogic 4X QDR) without the IFS software. Enables packet capture and subsequent dump to file iba_packet_capture...
Page 245
This script gathers the same information contained in board- version, status_str, and version. A Python script that sets the MTRR registers. ipath_mtrr Tests the IB link and bandwidth between two QLogic IB adapt- ipath_pkt_test ers, or, using an IB loopback connector, tests within a single QLogic IB adapter...
It is useful for checking for initialization problems. You can check to see if problems were detected during the driver and QLogic hardware initialization with the command: $ dmesg|egrep -i infinipath|qib This command may generate more than one screen of output.
Page 247
G–Commands and Files Summary and Descriptions of Commands -S/--sgid GID — Source GID. (Can be in GID (“0x########:0x########”) or inet6 format (“##:##:##:##:##:##:##:##”)) -D/--dgid GID — Destination GID. (Can be in GID (“0x########:0x########”) or inet6 format (“##:##:##:##:##:##:##:##”)) -k/--pkey pkey — Partition Key -i/--sid sid —...
Page 249
G–Commands and Files Summary and Descriptions of Commands resv2 resv3 Explanation of Sample Output: This is a simple query, specifying the source and destination LIDs and the desired SID. The first half of the output shows the full “query” that will be sent to the Distributed SA.
Page 250
G–Commands and Files Summary and Descriptions of Commands Examples: Query by LID and SID: iba_opp_query -s 0x31 -d 0x75 -i 0x107 iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107 Queries using octal or decimal numbers: iba_opp_query --slid 061 --dlid 0165 --sid 0407 (using octal numbers) iba_opp_query –slid 49 –dlid 113 –sid 263 (using decimal numbers)
G–Commands and Files Summary and Descriptions of Commands iba_hca_rev This command scans the system and reports hardware and firmware information about all the HCAs in the system. Running iba_hca_rev -v(as a root user) produces output similar to the following when run from a node on the IB fabric: # iba_hca_rev -v ###################### st2092...
###################### iba_manage_switch (Switch) Allows management of externally managed switches (including 12200, 12200-18, and HP BLc QLogic 4X QDR) without using the IFS software. It is designed to operate on one switch at a time, taking a mandatory target GUID parameter.
Page 262
G–Commands and Files Summary and Descriptions of Commands linkwidth (link width supported) – use -i for integer value (1=1X, 2=4X, 3=1X/4X, 4=8X, 5=1X/8X, 6=4X/8X, 7=1X/4X/8X) vlcreditdist (VL credit distribution) – use -i for integer value (0, 1, 2, 3, or 4) linkspeed (link speed supported) –...
G–Commands and Files Summary and Descriptions of Commands Example iba_manage_switch -t 0x00066a00e3001234 -f QLogic_12000_V1_firmware.7.0.0.0.27.emfw fwUpdate iba_manage_switch -t 0x00066a00e3001234 reboot iba_manage_switch -t 0x00066a00e3001234 showFwVersion iba_manage_switch -t 0x00066a00e3001234 -s i12k1234 setIBNodeDesc iba_manage_switch -t 0x00066a00e3001234 -C mtucap -i 4 setConfigValue iba_manage_switch -H The results are recorded in iba_manage_switch.res file in the current directory.
G–Commands and Files Summary and Descriptions of Commands – number of seconds for alarm trigger to dump capture and exit alarm – max 64 byte blocks of data to capture in units of Mi (1024*1024) maxblocks -v – verbose output To stop capture and trigger dump, kill with SIGINT (Ctrl-C) or SIGUSR1 (with the kill command).
G–Commands and Files Summary and Descriptions of Commands Following is a sample output for the DDR adapters: # ibstatus Infiniband device 'qib0' port 1 status: default gid: fe80:0000:0000:0000:0011:7500:0078:a5d2 base lid: sm lid: state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand...
0x00 link_layer: ident The ident strings are available in ib_qib.ko. Running ident provides driver information similar to the following. For QLogic RPMs on a SLES distribution, it will look like the following example: ident/lib/modules/OS_version/updates/kernel/drivers/infiniban d/hw/qib/ib_qib.ko /lib/modules/OS_version/updates/kernel/drivers/infiniband/hw/ qib/ib_qib.ko: $Id: QLogic OFED Release x.x.x $...
G–Commands and Files Summary and Descriptions of Commands NOTE For QLogic RPMs on a RHEL distribution, the drivers folder is in the updates folder instead of the kernels folder as follows: /lib/modules/OS_version/updates/drivers/ infiniband/hw/qib/ib_qib.ko If the /lib/modules/ /updates directory is not present, then the OS_version driver in use is the one that comes with the core kernel.
G–Commands and Files Summary and Descriptions of Commands NOTE The hostnames in the nodefile are Ethernet hostnames, not IPv4 addresses. To create a nodefile, use the ibhosts program. It will generate a list of available nodes that are already connected to the switch. ipath_checkout performs the following seven tests on the cluster: Executes the ping command to all nodes to verify that they all are reachable from the front end.
G–Commands and Files Summary and Descriptions of Commands Table G-2. ipath_checkout Options (Continued) Command Meaning This option keeps intermediate files that were created while -k, --keep performing tests and compiling reports. Results are saved in a directory created by mktemp and named infinipath_XXXXXX or in the directory name given to --workdir.
G–Commands and Files Summary and Descriptions of Commands MTRR is used by the InfiniPath driver to enable write combining to the QLogic on-chip transmit buffers. This option improves write bandwidth to the QLogic chip by writing multiple words in a single bus transaction (typically 64 bytes). This option applies only to x86_64 systems.
G–Commands and Files Summary and Descriptions of Commands Test the IB link and bandwidth between two InfiniPath IB adapters. Using an IB loopback connector, test the link and bandwidth within a single InfiniPath IB adapter. The ipath_pkt_test program runs in either ping-pong mode (send a packet, wait for a reply, repeat) or in stream mode (send packets as quickly as possible, receive responses as they come back).
G–Commands and Files Summary and Descriptions of Commands mpirun mpirun determines whether the program is being run against a QLogic or non-QLogic driver. It is installed from the mpi-frontend RPM. Sample commands and results are shown in the following paragraphs.
G–Commands and Files Common Tasks and Commands This option poisons receive buffers at initialization and after each receive; pre-initialize with random data so that any parts that are not being correctly updated with received data can be observed later. See the mpi_stress(1) man page for more information. To check the contents of an installed RPM, use these commands: $ rpm -qa infinipath\* mpi-\* $ rpm -q --info infinipath # (etc)
G–Commands and Files Common Tasks and Commands Table G-3. Common Tasks and Commands Summary Function Command Check the system state ipath_checkout [options] hostsfile ipathbug-helper -m hostsfile \ > ipath-info-allhosts mpirun -m hostsfile -ppn 1 \ -np numhosts -nonmpi ipath_control -i Also see the file: /sys/class/infini- band/ipath*/device/status_str...
G–Commands and Files Summary and Descriptions of Useful Files Table G-3. Common Tasks and Commands Summary (Continued) Function Command Show the status of host IB ipathbug-helper -m hostsfile \ ports > ipath-info-allhosts mpirun -m hostsfile -ppn 1 \ -np numhosts -nonmpi ipath_control -i Verify that the hosts see each ipath_checkout --run=5 hostsfile other...
G–Commands and Files Summary and Descriptions of Useful Files This information is useful for reporting problems to Technical Support. NOTE This file returns information of where the form factor adapter is installed. The PCIe half-height, short form factor is referred to as the QLE7140, QLE7240, QLE7280, QLE7340, or QLE7342.
You can check the version of the installed InfiniPath software by looking in: /sys/class/infiniband/qib0/device/driver/version QLogic-built drivers have contents similar to: $Id: QLogic OFED Release x.x.x$ $Date: Day mmm dd hh:mm:ss timezone yyyy $ Non-QLogic-built drivers (in this case kernel.org) have contents similar to: $Id: QLogic kernel.org driver $...
G–Commands and Files Summary of Configuration Files Table G-7. Configuration Files Configuration File Name Description Specifies options for modules when added /etc/modprobe.conf or removed by the modprobe command. Also used for creating aliases. The PAT write-combing option is set here. For Red Hat 5.X systems.
Page 280
G–Commands and Files Summary of Configuration Files G-38 IB0054606-02 A...
Recommended Reading Reference material for further reading is provided in this appendix. References for MPI The MPI Standard specification documents are located at: http://www.mpi-forum.org/docs The MPICH implementation of MPI and its documentation are located at: http://www-unix.mcs.anl.gov/mpi/mpich/ The ROMIO distribution and its documentation are located at: http://www.mcs.anl.gov/romio Books for Learning MPI Programming Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI, Second Edition,...
H–Recommended Reading OpenFabrics OpenFabrics Information about the OpenFabrics Alliance (OFA) is located at: http://www.openfabrics.org Clusters Gropp, William, Ewing Lusk, and Thomas Sterling, Beowulf Cluster Computing with Linux, Second Edition, 2003, MIT Press, ISBN 0-262-69292-9 Networking The Internet Frequently Asked Questions (FAQ) archives contain an extensive Request for Command (RFC) section.
Need help?
Do you have a question about the OFED+ Host and is the answer not in the manual?
Questions and answers