IBM Power7 Optimization And Tuning Manual page 49

Table of Contents

Advertisement

Prefetching to avoid cache miss penalties
Prefetching to avoid cache miss penalties is another technique that is used to improve
performance of applications. The concept is to prefetch blocks of data to be placed into the
cache a number of cycles before the data is needed. This action hides the penalty of waiting
for the data to be read from main storage. Prefetching can be speculative when, based on the
conditional path that is taken through the code, the data might end up not actually being
required. The benefit of prefetching depends on how often the prefetched data is used.
Although prefetching is not strictly related to cache geometry, it is an important technique.
A caveat to prefetching is that, although it is common for the technique to improve
performance for single-thread, single core, and low utilization environments, it actually can
decrease performance in high thread-count per-socket and high-utilization environments.
Most systems today virtualize processors and the memory that is used by the workload.
Because of this situation, the application designer must consider that, although an LPAR
might be assigned only a few cores, the overall system likely has a large number of cores.
Further, if the LPARs are sharing processor cores, the problem becomes compounded.
The dcbt and dcbtst instructions are commonly used to prefetch data.
Architecture ISA 2.06 Stride N Prefetch Engines to boost Application's performance
provides an overview about how these instructions can be used to improve application
performance. These instructions can be used directly in hand-tuned assembly language
code, or they can be accessed through compiler built-ins or directives.
Prefetching is also automatically done by the POWER7 hardware and is configurable, as
described in 2.3.7, "Data prefetching using d-cache instructions and the Data Streams
Control Register (DSCR)" on page 46.
Alignment of data
Processors are optimized for accessing data elements on their naturally aligned boundaries.
Unaligned data accesses might require extra processing time by the processor for individual
load or store instructions. They might require a trap and emulation by the host operating
system. Ensuring natural data alignment also ensures that individual accesses do not span
cache line boundaries.
Similar to the idea of splitting structures into hot and cold elements, the concept of data
alignment seeks to optimize cache performance by ensuring that data does not span across
multiple cache lines. The cache line size in Power Systems is 128 bytes.
The general technique for alignment is to keep operands (data) on
as a word or doubleword boundary (that is, an int would be aligned to be on a word boundary
in memory). This technique might involve padding and reordering data structures to avoid
cases such as the interleaving of chars and doubles:
language compilers do automatic data alignment. However, padding must be carefully
analyzed to ensure that it does not result in more cache misses or page misses (especially for
rarely referenced groupings of data).
23
dcbt (Data Cache Block Touch) Instruction, available at:
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alangref/idalangref_dcbt_i
nstrs.htm
24
dcbtst (Data Cache Block Touch for Store) Instruction, available at:
http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alangref/idalangref_dcbsts
t_instrs.htm
25
Power Architecture ISA 2.06 Stride N prefetch Engines to boost Application's performance, available at:
https://www.power.org/documentation/whitepaper-on-stride-n-prefetch-feature-of-isa-2-06/
required)
23,24
Power
natural
boundaries, such
char; double; char; double
Chapter 2. The POWER7 processor
25
. High-level
(registration
33

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power7+

Table of Contents