Page 1
IBM ~ pSeries High Performance Switch Tuning and Debug Guide Version 1.0 April 2005 IBM Systems and Technology Group Cluster Performance Department Poughkeepsie, NY...
This paper is intended to help you tune and debug the performance of the IBM ® pSeries® High Performance Switch (HPS) on IBM Cluster 1600 systems. It is not intended to be a comprehensive guide, but rather to help in initial tuning and debugging of performance issues.
2.0 Tunables and settings for switch software To optimize the HPS, you can set shell variables for Parallel Environment MPI-based workloads and for IP-based workloads. This section reviews the shell variables that are most often used for performance tuning. For a complete list of tunables and their usage, see the documentation listed in section 7 of this paper.
2.1.4 MEMORY_AFFINITY The POWER4™ and POWER4+™ models of the pSeries 690 have more than one multi-chip module (MCM). An MCM contains eight CPUs and frequently has two local memory cards. On these systems, application performance can improve when each CPU and the memory it accesses are on the same MCM.
2.1.5 MP_TASK_AFFINITY Setting MP_TASK_AFFINITY to SNI tells parallel operating environment (POE) to bind each task to the MCM containing the HPS adapter it will use, so that the adapter, CPU, and memory used by any task are all local to the same MCM. To prevent multiple tasks from sharing the same CPU, do not set MP_TASK_AFFINITY to SNI if more than four tasks share any HPS adapter.
The IP buffer pools are allocated in partitions of up to 16MB each. Each increase in the buffer that crosses a 16 MB boundary allocates an additional partition. If you are running a pSeries 655 system with two HPS links, allocate two partitions (32MB) of buffer space. If you are running a p690+ system with eight HPS links, set the buffer size to 128MB.
rfifosize 0x1000000 rpoolsize 0x02000000 spoolsize 0x02000000 3.0 Tunables and settings for AIX 5L Several settings in AIX 5L impact the performance of the HPS. These include the IP and memory subsystems. The following sections provide a brief overview of the most commonly used tunables.
Read the vmo man page before changing these tunables, and test any vmo changes incrementally. Always consult IBM service before changing the vmo tunables strict_maxperm and strict_maxclient.
3.3.1 svmon The svmon command provides information about the virtual memory usage by the kernel and user processes in the system at any given time. For example, to see system-wide information about the segments (256MB chunk of virtual memory), type the following command as root: svmon -S The command prints out segment information sorted according to values in the Inuse field, which shows the number of virtual pages in the segment that are mapped into the process address space.
PageSize Inuse 448221 3687 16MB Vsid Esid Type Description 1f187f 11 work text data BSS heap 218a2 70000000 work default shmat/mmap 131893 17 work text data BSS heap 0 work kernel segment 1118b1 8001000a work private load d09ad 90000000 work loader segment 1611b6 90020014 work shared library text 31823 10 clnt text data BSS heap...
statistics in 5-second intervals, with the first set of statistics being the statistics since the node or LPAR was last booted. vmstat 5 The pi and po of the page group is the number of 4KB pages read from and written to the paging device between consecutive samplings.
Page 14
adapter is configured. The volume of reservation is proportional to the number of user windows configured on the HPS adapter. A private window is required for each MPI task. Here is a formula to calculate the number of TLPs needed by the HPS adapter. In the formula below, number_of_sni refers to the number of sniX logical interfaces present in the partition.
3.5 Large pages and IP support One of the most important ways to improve IP performance on the HPS is to ensure that large pages are enabled. Large pages are required to allocate a number of large pages which will used by the HPS IP driver at boot time.
4.1 RSCT daemons If you are using RSCT Peer Domain (such as VSD, GPFS, LAPI striping, or fail over), check the IBM.ConfigRMd daemon and the hats_nim daemon. If you see these daemons taking cycles, restart the daemons with AIXTHREAD_SCOPE=S. pshpstuningguidewp040105.doc...
4.2 LoadLeveler daemons The LoadLeveler® daemons are needed for MPI applications using HPS. However, you can lower the impact on a parallel application by changing the default settings for these daemons. You can lower the impact of the LoadLeveler daemons by: Reducing the number of daemons running Reducing daemon communication or placing daemons on a switch Reducing logging...
SCHEDD_DEBUG = -D_ALWAYS 4.3 Settings for AIX 5L threads Several variables help you use AIX 5L threads to tune performance. These are the recommended initial settings for AIX 5L threads when using HPS. Set them in the /etc/environment file. AIXTHREAD_SCOPE=S AIXTHREAD_MNRATIO=1:1 AIXTHREAD_COND_DEBUG=OFF AIXTHREAD_GUARDPAGES=4...
5.1.1 driver_debug setting The driver_debug setting is used to increase the amount of information collected by the HPS device drivers. eave this setting set to default value unless you are directed to change it by IBM service. 5.1.2 ip_trc_lvl setting The ip_trc_lvl setting is used to change the amount of data collected by the IP driver.
5.3 Affinity LPARs On p690 systems, if you are running with more than one LPAR for each CEC, make sure you are running affinity LPARs. To check affinity between CPU, memory, and HPS links, run the associativity scripts on the LPARs. To check the memory affinity setting, run the vmo command.
-n 1 -p0 If the lsswtopol command calls out links as ”service required,” but these links do not show up in Service Focal Point, contact IBM service. 5.9 Multiple versions of MPI libraries One common problem on clustered systems is having different MPI library levels on various nodes.
For HAL libraries: dsh sum /usr/sni/aix52/lib/libhal_r.a For MPI libraries: dsh sum /usr/lpp/ppe.poe/lib/libmpi_r.a (or run with MP_PRINTENV=yes) To make sure you are running the correct combination of HAL, LAPI, and MPI, check the Service Pack Release Notes. 5.10 MP_PRINTENV If you set MP_PRINTENV=YES or MP_PRINTENV=script_name, the output includes the following information about environmental variables.
Page 25
Run the following command: /usr/sbin/ifsn_dump -a The data is collected in sni.snap (sni_dump.out.Z), and provides useful information, such as the local mac address: mac_addr 0:0:0:40:0:0 If you are seeing arpq drops, ensure the source has the correct mac_addr for its destination. The ndd statistics listed in ifsn_dump are useful for measuring packet drops in relation to the overall number of packets sent and received.
To help you isolate the exact cause of packet drops, the ifsn_dump -a command also lists the following debug statistics. If you isolate packet drops to these statistics, you will probably need to contact IBM support. dbg: sNet_drop sRTF_drop sMbuf_drop...
There are two routes. sending packet using route No. 1 ml ip address structure, starting: ml flag (ml interface up or down) = 0x00000000 ml tick = 0 ml ip address = 0xc0a80203, 192.168.2.3 There are two preferred route pairs: from local if 0 to remote if 0 from local if 1 to remote if 1 There are two actual routes (two preferred).
MAC WOF (2F870): Bit: 1 [. . .] 5.12.4 Packets dropped in the switch hardware If a packet is dropped within the switch hardware itself (for example, when traversing the link between two switch chips), evidence of the packet drop is on the HMC, where the switch Federation Network Manager (FNM) runs.
5.14 LAPI_DEBUG_COMM_TIMEOUT If the LAPI protocol experiences communication timeouts, set the environment variable LAPI_DEBUG_COMM_TIMEOUT to PAUSE. This causes the application to issue a pause() call when encountering a timeout, which stops the application instead of closing it. 5.15 LAPI_DEBUG_PERF The LAPI_DEBUG_PERF flag is not supported and should not be used in production. However, it can provide useful information about packet loss.
HPS performs very well. If tuning is needed, there are several good tools to use to determine performance problems. 7.0 Additional reading This section lists documents that contain additional information about the topics in this white paper. 7.1 HPS documentation pSeries High Performance Switch - Planning, Installation and Service, GA22-7951-02 pshpstuningguidewp040105.doc Page 30...
AIX 5L Version 5.2 Performance Tools Guide and Reference, SC23-4859-03 7.4 IBM Redbooks™ AIX 5L Performance Tools Handbook, SG24-6039-01 7.5 POWER4 POWER4 Processor Introduction and Tuning Guide, SG24-7041-00 How to Control Resource Affinity on Multiple MCM or SCM pSeries Architecture in an HPC Environment http://www.redbooks.ibm.com/redpapers/abstracts/redp3932.html pshpstuningguidewp040105.doc Page 31...
Page 32
“AS IS” and no warranties or guarantees are expressed or implied by IBM. The IBM home page on the Internet can be found at http://www.ibm.com. The pSeries home page on the Internet can be found at http://www.ibm.com/servers/eserver/pseries. pshpstuningguidewp040105.doc Page 32...