IBM pSeries Tuning Manual
IBM pSeries Tuning Manual

IBM pSeries Tuning Manual

High performance switch tuning and debug guide
Table of Contents

Advertisement

Quick Links

IBM ~ pSeries
High Performance Switch
Tuning and Debug Guide
Version 1.0
April 2005
IBM Systems and Technology Group
Cluster Performance Department
Poughkeepsie, NY

Advertisement

Table of Contents
loading

Summary of Contents for IBM pSeries

  • Page 1 IBM ~ pSeries High Performance Switch Tuning and Debug Guide Version 1.0 April 2005 IBM Systems and Technology Group Cluster Performance Department Poughkeepsie, NY...
  • Page 2: Table Of Contents

    Contents 1.0 Introduction... 4 2.0 Tunables and settings for switch software... 5 2.1 MPI tunables for Parallel Environment... 5 2.1.1 MP_EAGER_LIMIT ... 5 2.1.2 MP_POLLING_INTERVAL and MP_RETRANSMIT_INTERVAL ... 5 2.1.3 MP_REXMIT_BUF_SIZE and MP_REXMIT_BUF_CNT ... 6 2.1.4 MEMORY_AFFINITY ... 6 2.1.5 MP_TASK_AFFINITY... 7 2.1.6 MP_CSS_INTERRUPT ...
  • Page 3 5.16 AIX 5L trace for daemon activity ... 30 6.0 Conclusions and summary ... 30 7.0 Additional reading... 30 7.1 HPS documentation ... 30 7.2 MPI documentation ... 31 7.3 AIX 5L performance guides ... 31 7.4 IBM Redbooks ... 31 7.5 POWER4 ... 31 pshpstuningguidewp040105.doc Page 3...
  • Page 4: Introduction

    This paper is intended to help you tune and debug the performance of the IBM ® pSeries® High Performance Switch (HPS) on IBM Cluster 1600 systems. It is not intended to be a comprehensive guide, but rather to help in initial tuning and debugging of performance issues.
  • Page 5: Tunables And Settings For Switch Software

    2.0 Tunables and settings for switch software To optimize the HPS, you can set shell variables for Parallel Environment MPI-based workloads and for IP-based workloads. This section reviews the shell variables that are most often used for performance tuning. For a complete list of tunables and their usage, see the documentation listed in section 7 of this paper.
  • Page 6: Mp_Rexmit_Buf_Size And Mp_Rexmit_Buf_Cnt

    2.1.4 MEMORY_AFFINITY The POWER4™ and POWER4+™ models of the pSeries 690 have more than one multi-chip module (MCM). An MCM contains eight CPUs and frequently has two local memory cards. On these systems, application performance can improve when each CPU and the memory it accesses are on the same MCM.
  • Page 7: Mp_Task_Affinity

    2.1.5 MP_TASK_AFFINITY Setting MP_TASK_AFFINITY to SNI tells parallel operating environment (POE) to bind each task to the MCM containing the HPS adapter it will use, so that the adapter, CPU, and memory used by any task are all local to the same MCM. To prevent multiple tasks from sharing the same CPU, do not set MP_TASK_AFFINITY to SNI if more than four tasks share any HPS adapter.
  • Page 8: Chgsni Command

    The IP buffer pools are allocated in partitions of up to 16MB each. Each increase in the buffer that crosses a 16 MB boundary allocates an additional partition. If you are running a pSeries 655 system with two HPS links, allocate two partitions (32MB) of buffer space. If you are running a p690+ system with eight HPS links, set the buffer size to 128MB.
  • Page 9: Tunables And Settings For Aix 5L

    rfifosize 0x1000000 rpoolsize 0x02000000 spoolsize 0x02000000 3.0 Tunables and settings for AIX 5L Several settings in AIX 5L impact the performance of the HPS. These include the IP and memory subsystems. The following sections provide a brief overview of the most commonly used tunables.
  • Page 10: Svmon And Vmstat Commands

    Read the vmo man page before changing these tunables, and test any vmo changes incrementally. Always consult IBM service before changing the vmo tunables strict_maxperm and strict_maxclient.
  • Page 11: Svmon

    3.3.1 svmon The svmon command provides information about the virtual memory usage by the kernel and user processes in the system at any given time. For example, to see system-wide information about the segments (256MB chunk of virtual memory), type the following command as root: svmon -S The command prints out segment information sorted according to values in the Inuse field, which shows the number of virtual pages in the segment that are mapped into the process address space.
  • Page 12: Vmstat

    PageSize Inuse 448221 3687 16MB Vsid Esid Type Description 1f187f 11 work text data BSS heap 218a2 70000000 work default shmat/mmap 131893 17 work text data BSS heap 0 work kernel segment 1118b1 8001000a work private load d09ad 90000000 work loader segment 1611b6 90020014 work shared library text 31823 10 clnt text data BSS heap...
  • Page 13: Large

    statistics in 5-second intervals, with the first set of statistics being the statistics since the node or LPAR was last booted. vmstat 5 The pi and po of the page group is the number of 4KB pages read from and written to the paging device between consecutive samplings.
  • Page 14 adapter is configured. The volume of reservation is proportional to the number of user windows configured on the HPS adapter. A private window is required for each MPI task. Here is a formula to calculate the number of TLPs needed by the HPS adapter. In the formula below, number_of_sni refers to the number of sniX logical interfaces present in the partition.
  • Page 15: Large Pages And Ip Support

    3.5 Large pages and IP support One of the most important ways to improve IP performance on the HPS is to ensure that large pages are enabled. Large pages are required to allocate a number of large pages which will used by the HPS IP driver at boot time.
  • Page 16: Debug Settings In The Aix 5L Kernel

    4.1 RSCT daemons If you are using RSCT Peer Domain (such as VSD, GPFS, LAPI striping, or fail over), check the IBM.ConfigRMd daemon and the hats_nim daemon. If you see these daemons taking cycles, restart the daemons with AIXTHREAD_SCOPE=S. pshpstuningguidewp040105.doc...
  • Page 17: Loadleveler Daemons

    4.2 LoadLeveler daemons The LoadLeveler® daemons are needed for MPI applications using HPS. However, you can lower the impact on a parallel application by changing the default settings for these daemons. You can lower the impact of the LoadLeveler daemons by: Reducing the number of daemons running Reducing daemon communication or placing daemons on a switch Reducing logging...
  • Page 18: Settings For Aix 5L Threads

    SCHEDD_DEBUG = -D_ALWAYS 4.3 Settings for AIX 5L threads Several variables help you use AIX 5L threads to tune performance. These are the recommended initial settings for AIX 5L threads when using HPS. Set them in the /etc/environment file. AIXTHREAD_SCOPE=S AIXTHREAD_MNRATIO=1:1 AIXTHREAD_COND_DEBUG=OFF AIXTHREAD_GUARDPAGES=4...
  • Page 19: Debug Settings And Data Collection Tools

    5.1.1 driver_debug setting The driver_debug setting is used to increase the amount of information collected by the HPS device drivers. eave this setting set to default value unless you are directed to change it by IBM service. 5.1.2 ip_trc_lvl setting The ip_trc_lvl setting is used to change the amount of data collected by the IP driver.
  • Page 20: Affinity Lpars

    5.3 Affinity LPARs On p690 systems, if you are running with more than one LPAR for each CEC, make sure you are running affinity LPARs. To check affinity between CPU, memory, and HPS links, run the associativity scripts on the LPARs. To check the memory affinity setting, run the vmo command.
  • Page 21: Errpt Command

    -n 1 -p0 If the lsswtopol command calls out links as ”service required,” but these links do not show up in Service Focal Point, contact IBM service. 5.9 Multiple versions of MPI libraries One common problem on clustered systems is having different MPI library levels on various nodes.
  • Page 22: Mp_Printenv

    For HAL libraries: dsh sum /usr/sni/aix52/lib/libhal_r.a For MPI libraries: dsh sum /usr/lpp/ppe.poe/lib/libmpi_r.a (or run with MP_PRINTENV=yes) To make sure you are running the correct combination of HAL, LAPI, and MPI, check the Service Pack Release Notes. 5.10 MP_PRINTENV If you set MP_PRINTENV=YES or MP_PRINTENV=script_name, the output includes the following information about environmental variables.
  • Page 23: Mp_Statistics

    MEMORY_AFFINITY Single Thread Usage(MP_SINGLE_THREAD) Hints Filtered (MP_HINTS_FILTERED) MPI-I/O Buffer Size (MP_IO_BUFFER_SIZE) MPI-I/O Error Logging (MP_IO_ERRLOG) MPI-I/O Node File (MP_IO_NODEFILE) MPI-I/O Task List (MP_IO_TASKLIST) System Checkpointable (CHECKPOINT) LoadLeveler Gang Scheduler DMA Receive FIFO Size (Bytes) Max outstanding packets LAPI Max Packet Size (Bytes) LAPI Ack Threshold (MP_ACK_THRESH) LAPI Max retransmit buf size (MP_REXMIT_BUF_SIZE) LAPI Max retransmit buf count (MP_REXMIT_BUF_CNT)
  • Page 24: Dropped Switch Packets

    MPCI: sends = 14 MPCI: sendsComplete = 14 MPCI: sendWaitsComplete = 17 MPCI: recvs = 17 MPCI: recvWaitsComplete = 13 MPCI: earlyArrivals = 5 MPCI: earlyArrivalsMatched = 5 MPCI: lateArrivals = 8 MPCI: shoves = 10 MPCI: pulls = 13 MPCI: threadedLockYields = 0 MPCI: unorderedMsgs = 0 LAPI: Tot_dup_pkt_cnt=0...
  • Page 25 Run the following command: /usr/sbin/ifsn_dump -a The data is collected in sni.snap (sni_dump.out.Z), and provides useful information, such as the local mac address: mac_addr 0:0:0:40:0:0 If you are seeing arpq drops, ensure the source has the correct mac_addr for its destination. The ndd statistics listed in ifsn_dump are useful for measuring packet drops in relation to the overall number of packets sent and received.
  • Page 26: Packets Dropped In The Ml0 Interface

    To help you isolate the exact cause of packet drops, the ifsn_dump -a command also lists the following debug statistics. If you isolate packet drops to these statistics, you will probably need to contact IBM support. dbg: sNet_drop sRTF_drop sMbuf_drop...
  • Page 27: Packets Dropped Because Of A Hardware Problem On An Endpoint

    There are two routes. sending packet using route No. 1 ml ip address structure, starting: ml flag (ml interface up or down) = 0x00000000 ml tick = 0 ml ip address = 0xc0a80203, 192.168.2.3 There are two preferred route pairs: from local if 0 to remote if 0 from local if 1 to remote if 1 There are two actual routes (two preferred).
  • Page 28: Packets Dropped In The Switch Hardware

    MAC WOF (2F870): Bit: 1 [. . .] 5.12.4 Packets dropped in the switch hardware If a packet is dropped within the switch hardware itself (for example, when traversing the link between two switch chips), evidence of the packet drop is on the HMC, where the switch Federation Network Manager (FNM) runs.
  • Page 29: Lapi_Debug_Comm_Timeout

    5.14 LAPI_DEBUG_COMM_TIMEOUT If the LAPI protocol experiences communication timeouts, set the environment variable LAPI_DEBUG_COMM_TIMEOUT to PAUSE. This causes the application to issue a pause() call when encountering a timeout, which stops the application instead of closing it. 5.15 LAPI_DEBUG_PERF The LAPI_DEBUG_PERF flag is not supported and should not be used in production. However, it can provide useful information about packet loss.
  • Page 30: Aix 5L Trace For Daemon Activity

    HPS performs very well. If tuning is needed, there are several good tools to use to determine performance problems. 7.0 Additional reading This section lists documents that contain additional information about the topics in this white paper. 7.1 HPS documentation pSeries High Performance Switch - Planning, Installation and Service, GA22-7951-02 pshpstuningguidewp040105.doc Page 30...
  • Page 31: Mpi Documentation

    AIX 5L Version 5.2 Performance Tools Guide and Reference, SC23-4859-03 7.4 IBM Redbooks™ AIX 5L Performance Tools Handbook, SG24-6039-01 7.5 POWER4 POWER4 Processor Introduction and Tuning Guide, SG24-7041-00 How to Control Resource Affinity on Multiple MCM or SCM pSeries Architecture in an HPC Environment http://www.redbooks.ibm.com/redpapers/abstracts/redp3932.html pshpstuningguidewp040105.doc Page 31...
  • Page 32 “AS IS” and no warranties or guarantees are expressed or implied by IBM. The IBM home page on the Internet can be found at http://www.ibm.com. The pSeries home page on the Internet can be found at http://www.ibm.com/servers/eserver/pseries. pshpstuningguidewp040105.doc Page 32...

This manual is also suitable for:

@server pseries

Table of Contents