IBM Power7 Optimization And Tuning Manual page 50

Table of Contents

Advertisement

Additionally, to achieve optimal performance, floating point and VMX/VSX have different
alignment requirements. For example, the preferred VSX alignment is 16 bytes instead of the
element size of the data type being used. This situation means that VSX data that is smaller
than 16 bytes in length must be padded out to 16 bytes. The compilers introduce padding as
necessary to provide optimal alignment for vector data types.
Sensitivity of scaling to more cores
Different processor chip versions and system models provide less or more scaling of LPARs
and workloads to cores. Different processor chips and systems might have different bus
widths and latencies. All of these factors result in the sensitivity of the performance of an
application/workload to the number of cores it is running on to change based on the
processor chip version and system model.
In general terms, an application that tends to not access memory without CPU intervention
(that are core-centric) scales perfectly across more cores. Performance loss when scaling
across multiple cores tends to come from one or more of the following sources:
Increased cache misses (often from invalidations of data by other processor cores,
especially for locks)
The increased cost of cache misses, which in turn drives overall memory and interconnect
fabric traffic into the region of bandwidth limitations (saturating the memory busses and
interconnect)
The additional cores that are being added to the workload in other nodes, resulting in
increased latency in reaching memory and caches in those nodes
Briefly, cache miss requests and returning data can end up being routed through busses that
connect multiple chips and memory, which have particular bandwidth and latency
characteristics. The goal for scaling across multiple cores, then, is to minimize the change in
the potential penalties that are associated with cache misses and data requests as the
workload size grows.
It is difficult to assess what strategies are effective for scaling to more cores without
considering the complex aspects of a specific application. For example, if all of the cores that
the application is running across eventually access all of the data, then it might be wise to
interleave data across the processor sockets (which are typically a grouping of processor
chips) to optimize them from a memory bus utilization point of view. However, if the access
pattern to data is more localized so that, for most of the data, separate processor cores are
accessing it most of the time, the application might obtain better performance if the data is
close to the processor core that is accessing that data the most (maintaining affinity between
the application thread and the data it is accessing). For the latter case, where the data ought
to be close to the processor core that is accessing the data, the AIX MEMORY_AFFINITY=MCM
environment variable can be set to achieve this behavior.
When multiple processor cores are accessing the same data and that data is being held by a
lock, resulting in the data line in the cache that is invalidated, programs can suffer. This
phenomenon is often referred to as
of contention. Hot locks result in intervention and can easily limit the ability to scale a
workload because all updates to the lock are serialized. Tools such as splat (see "AIX
trace-based analysis tools" on page 165) can be used to identify hot locks.
34
POWER7 and POWER7+ Optimization and Tuning Guide
hot locks
, where a lock is holding data that has a high rate

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power7+

Table of Contents